SIMD product of a 4×4 matrix and a vector
up vote
2
down vote
favorite
I wrote a matrix class for use in a ray tracer. I already have a 'scalar' version up and running, but now I'm trying to rewrite it to use Intel SIMD Instrinsics. I realize my compiler (clang-7.0.0) will probably generate code that is at least as fast as mine (chances are, it's going to be faster), but I'm doing this because I enjoy it, and because I find it interesting to think about things like this.
This is my version of a matrix-vector product. The matrix has a fixed dimension of 4x4, the vector has a fixed dimension of 4. I was wondering if I could take a different approach to make this function even faster?
My matrix class is set-up like this:
class alignas(16 * sizeof(float)) matrix
{
private:
union
{
struct
{
__m256 ymm0;
__m256 ymm1;
};
float data [16];
};
...
public:
...
}
ymm0
stores rows 0 and 1, ymm1
holds rows 2 and 3.
My simd_vec4
class is just contains a single __m128
called xmm
that's aligned to 16 bytes (4 * sizeof(float)
).
The vector stores its components in the as {x, y, z, w}
.
Here's my reasoning for the algorithm:
We basically need to perform 4 dot products; so the first thing I do is broadcast my 4-float vector to an 8-float vector with the same 4 floats in every lane.
Now, we perform the necessary multiplications, these only carry a data dependency with the vector from step 1, so they can be parallelized/out-of-order'ed.
(a) The next step is to reduce our 16 floats down to 4 floats. We do this by first performing 1 horzontal reduction on each of our 2-lane vectors, and then swapping lanes in one of them.
(b) The second portion of our reduction operation is to merge our two 2-lane vectors into one 2-lane vector, due to the way our data is laid out, we get the same data in both lanes (different order, but we need to fix that anyway...).
We now extract only the lower lane, because it has all the data we need.
Finally, we shuffle some data around in the lower lane to put the right components in the right places.
And here's the code:
simd_vec4 operator*(const simd_mat4& lhs, const simd_vec4& rhs) noexcept
{
// ymm0 = [ A B C D ][ E F G H ]
// ymm1 = [ I J K L ][ M N O P ]
__m256 broadcast = _mm256_set_m128(rhs.xmm, rhs.xmm); // Vec = [ x y z w ][ x y z w ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 xy_m256 = _mm256_mul_ps(lhs.ymm0, broadcast); // xy_m256 = [ Ax By Cz Dw ][ Ex Fy Gz Hw ]
__m256 zw_m256 = _mm256_mul_ps(lhs.ymm1, broadcast); // zw_m256 = [ Ix Jy Kz Lw ][ Mx Ny Oz Pw ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 acc0 = _mm256_hadd_ps(xy_m256, zw_m256); // acc0 = [ Ix + Jy Kz + Lw Ax + By Cz + Dw ][ Mx + Ny Oz + Pw Ex + Fy Gz + Hw ]
__m256 acc1 = _mm256_hadd_ps(zw_m256, xy_m256); // acc1 = [ Ax + By Cz + Dw Ix + Jy Kz + Lw ][ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ]
acc1 = _mm256_permute2f128_ps(acc1, acc1, 0x01); // acc1 = [ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ][ Ax + By Cz + Dw Ix + Jy Kz + Lw ]
__m256 merged = _mm256_hadd_ps(acc0, acc1); // merged = [Ex + Fy + Gz + Hw Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ax + By + Cz + Dw][Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
__m128 vec = _mm256_extractf128_ps(merged, 0); // vec = [Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
vec = _mm_shuffle_ps(vec, vec, _MM_SHUFFLE(2, 1, 3, 0)); // vec = [Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ex + Fy + Gz + Hw Ax + By + Cz + Dw]
return simd_vec4(vec);
}
I'm running this code on an Intel Core i7-5500U CPU @ 2.40GHz, so the allowed extensions are SSE4.1, SSE4.2, AVX2 (and below, I'm assuming).
Any advice is welcome, but one thing I do struggle with is coming up with meaningful names for the intermediate results, if anyone has any tips, you're more than welcome!
I've written two unit test so far, and both are passing (both are 'normal' cases though, I'm working on adding more 'edge' cases)
Here's the unit test code (I use Catch2 btw):
TEST_CASE("matrix * vec4", "[simd_matrix][vec4][multiplication][product]")
{
Math::matrix mat;
Math::simd_vec4 vec;
// First test case, tested by writing out all operations
constexpr std::array<float, 16> TEST_CASE_1 =
{0, 1, 2, 3,
4, 5, 6, 7,
8, 9, 10, 11,
12, 13, 14, 15};
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;
mat(i, j) = TEST_CASE_1[arrayIdx];
}
}
vec = Math::simd_vec4(1.0f, 3.0f, 2.0f, 5.0f);
Math::simd_vec4 result = mat * vec;
Math::simd_vec4 EXPECTED_1 = Math::simd_vec4(
(0.0f * 1.0f) + (1.0f * 3.0f) + (2.0f * 2.0f) + (3.0f * 5.0f),
(4.0f * 1.0f) + (5.0f * 3.0f) + (6.0f * 2.0f) + (7.0f * 5.0f),
(8.0f * 1.0f) + (9.0f * 3.0f) + (10.0f * 2.0f) + (11.0f * 5.0f),
(12.0f * 1.0f) + (13.0f * 3.0f) + (14.0f * 2.0f) + (15.0f * 5.0f));
CHECK(result.x == Approx(EXPECTED_1.x));
CHECK(result.y == Approx(EXPECTED_1.y));
CHECK(result.z == Approx(EXPECTED_1.z));
CHECK(result.w == Approx(EXPECTED_1.w));
// Second test case, checked with Wolfram Alpha (https://www.wolframalpha.com/input/?i=%7B%7B1,+7,+23,+-5%7D,%7B0,+-9,+-5,+1%7D,%7B2,+6,+-3,+8%7D,%7B-1,+8,+11,+-5%7D%7D+*+%7B-7,+5,+-3,+1%7D)
constexpr std::array<float, 16> TEST_CASE_2 =
{1, 7, 23, -5,
0, -9, -5, 1,
2, 6, -3, 8,
-1, 8, 11, -5};
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;
mat(i, j) = TEST_CASE_2[arrayIdx];
}
}
vec = Math::simd_vec4(-7, 5, -3, 1);
Math::simd_vec4 EXPECTED_2 = Math::simd_vec4(-46, -29, 33, 9);
result = mat * vec;
CHECK(result.x == Approx(EXPECTED_2.x));
CHECK(result.y == Approx(EXPECTED_2.y));
CHECK(result.z == Approx(EXPECTED_2.z));
CHECK(result.w == Approx(EXPECTED_2.w));
}
For those that are interested: The full code can be found here:
Header, Implementation.
Should you have any problems building, here's my CMakeLists.txt.
Edit: I quickly got a gist setup, it's got 5 files (2 for the matrix, 2 for the vector and 1 for main) along with a shell script that contains the command I used to compile this.
The example code runs the first test case. If any of the results are wrong, the error will be written to stderr.
c++ performance matrix simd
add a comment |
up vote
2
down vote
favorite
I wrote a matrix class for use in a ray tracer. I already have a 'scalar' version up and running, but now I'm trying to rewrite it to use Intel SIMD Instrinsics. I realize my compiler (clang-7.0.0) will probably generate code that is at least as fast as mine (chances are, it's going to be faster), but I'm doing this because I enjoy it, and because I find it interesting to think about things like this.
This is my version of a matrix-vector product. The matrix has a fixed dimension of 4x4, the vector has a fixed dimension of 4. I was wondering if I could take a different approach to make this function even faster?
My matrix class is set-up like this:
class alignas(16 * sizeof(float)) matrix
{
private:
union
{
struct
{
__m256 ymm0;
__m256 ymm1;
};
float data [16];
};
...
public:
...
}
ymm0
stores rows 0 and 1, ymm1
holds rows 2 and 3.
My simd_vec4
class is just contains a single __m128
called xmm
that's aligned to 16 bytes (4 * sizeof(float)
).
The vector stores its components in the as {x, y, z, w}
.
Here's my reasoning for the algorithm:
We basically need to perform 4 dot products; so the first thing I do is broadcast my 4-float vector to an 8-float vector with the same 4 floats in every lane.
Now, we perform the necessary multiplications, these only carry a data dependency with the vector from step 1, so they can be parallelized/out-of-order'ed.
(a) The next step is to reduce our 16 floats down to 4 floats. We do this by first performing 1 horzontal reduction on each of our 2-lane vectors, and then swapping lanes in one of them.
(b) The second portion of our reduction operation is to merge our two 2-lane vectors into one 2-lane vector, due to the way our data is laid out, we get the same data in both lanes (different order, but we need to fix that anyway...).
We now extract only the lower lane, because it has all the data we need.
Finally, we shuffle some data around in the lower lane to put the right components in the right places.
And here's the code:
simd_vec4 operator*(const simd_mat4& lhs, const simd_vec4& rhs) noexcept
{
// ymm0 = [ A B C D ][ E F G H ]
// ymm1 = [ I J K L ][ M N O P ]
__m256 broadcast = _mm256_set_m128(rhs.xmm, rhs.xmm); // Vec = [ x y z w ][ x y z w ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 xy_m256 = _mm256_mul_ps(lhs.ymm0, broadcast); // xy_m256 = [ Ax By Cz Dw ][ Ex Fy Gz Hw ]
__m256 zw_m256 = _mm256_mul_ps(lhs.ymm1, broadcast); // zw_m256 = [ Ix Jy Kz Lw ][ Mx Ny Oz Pw ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 acc0 = _mm256_hadd_ps(xy_m256, zw_m256); // acc0 = [ Ix + Jy Kz + Lw Ax + By Cz + Dw ][ Mx + Ny Oz + Pw Ex + Fy Gz + Hw ]
__m256 acc1 = _mm256_hadd_ps(zw_m256, xy_m256); // acc1 = [ Ax + By Cz + Dw Ix + Jy Kz + Lw ][ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ]
acc1 = _mm256_permute2f128_ps(acc1, acc1, 0x01); // acc1 = [ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ][ Ax + By Cz + Dw Ix + Jy Kz + Lw ]
__m256 merged = _mm256_hadd_ps(acc0, acc1); // merged = [Ex + Fy + Gz + Hw Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ax + By + Cz + Dw][Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
__m128 vec = _mm256_extractf128_ps(merged, 0); // vec = [Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
vec = _mm_shuffle_ps(vec, vec, _MM_SHUFFLE(2, 1, 3, 0)); // vec = [Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ex + Fy + Gz + Hw Ax + By + Cz + Dw]
return simd_vec4(vec);
}
I'm running this code on an Intel Core i7-5500U CPU @ 2.40GHz, so the allowed extensions are SSE4.1, SSE4.2, AVX2 (and below, I'm assuming).
Any advice is welcome, but one thing I do struggle with is coming up with meaningful names for the intermediate results, if anyone has any tips, you're more than welcome!
I've written two unit test so far, and both are passing (both are 'normal' cases though, I'm working on adding more 'edge' cases)
Here's the unit test code (I use Catch2 btw):
TEST_CASE("matrix * vec4", "[simd_matrix][vec4][multiplication][product]")
{
Math::matrix mat;
Math::simd_vec4 vec;
// First test case, tested by writing out all operations
constexpr std::array<float, 16> TEST_CASE_1 =
{0, 1, 2, 3,
4, 5, 6, 7,
8, 9, 10, 11,
12, 13, 14, 15};
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;
mat(i, j) = TEST_CASE_1[arrayIdx];
}
}
vec = Math::simd_vec4(1.0f, 3.0f, 2.0f, 5.0f);
Math::simd_vec4 result = mat * vec;
Math::simd_vec4 EXPECTED_1 = Math::simd_vec4(
(0.0f * 1.0f) + (1.0f * 3.0f) + (2.0f * 2.0f) + (3.0f * 5.0f),
(4.0f * 1.0f) + (5.0f * 3.0f) + (6.0f * 2.0f) + (7.0f * 5.0f),
(8.0f * 1.0f) + (9.0f * 3.0f) + (10.0f * 2.0f) + (11.0f * 5.0f),
(12.0f * 1.0f) + (13.0f * 3.0f) + (14.0f * 2.0f) + (15.0f * 5.0f));
CHECK(result.x == Approx(EXPECTED_1.x));
CHECK(result.y == Approx(EXPECTED_1.y));
CHECK(result.z == Approx(EXPECTED_1.z));
CHECK(result.w == Approx(EXPECTED_1.w));
// Second test case, checked with Wolfram Alpha (https://www.wolframalpha.com/input/?i=%7B%7B1,+7,+23,+-5%7D,%7B0,+-9,+-5,+1%7D,%7B2,+6,+-3,+8%7D,%7B-1,+8,+11,+-5%7D%7D+*+%7B-7,+5,+-3,+1%7D)
constexpr std::array<float, 16> TEST_CASE_2 =
{1, 7, 23, -5,
0, -9, -5, 1,
2, 6, -3, 8,
-1, 8, 11, -5};
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;
mat(i, j) = TEST_CASE_2[arrayIdx];
}
}
vec = Math::simd_vec4(-7, 5, -3, 1);
Math::simd_vec4 EXPECTED_2 = Math::simd_vec4(-46, -29, 33, 9);
result = mat * vec;
CHECK(result.x == Approx(EXPECTED_2.x));
CHECK(result.y == Approx(EXPECTED_2.y));
CHECK(result.z == Approx(EXPECTED_2.z));
CHECK(result.w == Approx(EXPECTED_2.w));
}
For those that are interested: The full code can be found here:
Header, Implementation.
Should you have any problems building, here's my CMakeLists.txt.
Edit: I quickly got a gist setup, it's got 5 files (2 for the matrix, 2 for the vector and 1 for main) along with a shell script that contains the command I used to compile this.
The example code runs the first test case. If any of the results are wrong, the error will be written to stderr.
c++ performance matrix simd
Hello, is it possible for you to publish thematrix
class in its entirety, or a reduced version that will allow me to runoperator*
?
– Dair
Nov 27 at 20:56
I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
– shmoo6000
Nov 27 at 21:01
1
@shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
– bruglesco
Nov 27 at 21:43
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I wrote a matrix class for use in a ray tracer. I already have a 'scalar' version up and running, but now I'm trying to rewrite it to use Intel SIMD Instrinsics. I realize my compiler (clang-7.0.0) will probably generate code that is at least as fast as mine (chances are, it's going to be faster), but I'm doing this because I enjoy it, and because I find it interesting to think about things like this.
This is my version of a matrix-vector product. The matrix has a fixed dimension of 4x4, the vector has a fixed dimension of 4. I was wondering if I could take a different approach to make this function even faster?
My matrix class is set-up like this:
class alignas(16 * sizeof(float)) matrix
{
private:
union
{
struct
{
__m256 ymm0;
__m256 ymm1;
};
float data [16];
};
...
public:
...
}
ymm0
stores rows 0 and 1, ymm1
holds rows 2 and 3.
My simd_vec4
class is just contains a single __m128
called xmm
that's aligned to 16 bytes (4 * sizeof(float)
).
The vector stores its components in the as {x, y, z, w}
.
Here's my reasoning for the algorithm:
We basically need to perform 4 dot products; so the first thing I do is broadcast my 4-float vector to an 8-float vector with the same 4 floats in every lane.
Now, we perform the necessary multiplications, these only carry a data dependency with the vector from step 1, so they can be parallelized/out-of-order'ed.
(a) The next step is to reduce our 16 floats down to 4 floats. We do this by first performing 1 horzontal reduction on each of our 2-lane vectors, and then swapping lanes in one of them.
(b) The second portion of our reduction operation is to merge our two 2-lane vectors into one 2-lane vector, due to the way our data is laid out, we get the same data in both lanes (different order, but we need to fix that anyway...).
We now extract only the lower lane, because it has all the data we need.
Finally, we shuffle some data around in the lower lane to put the right components in the right places.
And here's the code:
simd_vec4 operator*(const simd_mat4& lhs, const simd_vec4& rhs) noexcept
{
// ymm0 = [ A B C D ][ E F G H ]
// ymm1 = [ I J K L ][ M N O P ]
__m256 broadcast = _mm256_set_m128(rhs.xmm, rhs.xmm); // Vec = [ x y z w ][ x y z w ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 xy_m256 = _mm256_mul_ps(lhs.ymm0, broadcast); // xy_m256 = [ Ax By Cz Dw ][ Ex Fy Gz Hw ]
__m256 zw_m256 = _mm256_mul_ps(lhs.ymm1, broadcast); // zw_m256 = [ Ix Jy Kz Lw ][ Mx Ny Oz Pw ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 acc0 = _mm256_hadd_ps(xy_m256, zw_m256); // acc0 = [ Ix + Jy Kz + Lw Ax + By Cz + Dw ][ Mx + Ny Oz + Pw Ex + Fy Gz + Hw ]
__m256 acc1 = _mm256_hadd_ps(zw_m256, xy_m256); // acc1 = [ Ax + By Cz + Dw Ix + Jy Kz + Lw ][ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ]
acc1 = _mm256_permute2f128_ps(acc1, acc1, 0x01); // acc1 = [ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ][ Ax + By Cz + Dw Ix + Jy Kz + Lw ]
__m256 merged = _mm256_hadd_ps(acc0, acc1); // merged = [Ex + Fy + Gz + Hw Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ax + By + Cz + Dw][Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
__m128 vec = _mm256_extractf128_ps(merged, 0); // vec = [Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
vec = _mm_shuffle_ps(vec, vec, _MM_SHUFFLE(2, 1, 3, 0)); // vec = [Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ex + Fy + Gz + Hw Ax + By + Cz + Dw]
return simd_vec4(vec);
}
I'm running this code on an Intel Core i7-5500U CPU @ 2.40GHz, so the allowed extensions are SSE4.1, SSE4.2, AVX2 (and below, I'm assuming).
Any advice is welcome, but one thing I do struggle with is coming up with meaningful names for the intermediate results, if anyone has any tips, you're more than welcome!
I've written two unit test so far, and both are passing (both are 'normal' cases though, I'm working on adding more 'edge' cases)
Here's the unit test code (I use Catch2 btw):
TEST_CASE("matrix * vec4", "[simd_matrix][vec4][multiplication][product]")
{
Math::matrix mat;
Math::simd_vec4 vec;
// First test case, tested by writing out all operations
constexpr std::array<float, 16> TEST_CASE_1 =
{0, 1, 2, 3,
4, 5, 6, 7,
8, 9, 10, 11,
12, 13, 14, 15};
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;
mat(i, j) = TEST_CASE_1[arrayIdx];
}
}
vec = Math::simd_vec4(1.0f, 3.0f, 2.0f, 5.0f);
Math::simd_vec4 result = mat * vec;
Math::simd_vec4 EXPECTED_1 = Math::simd_vec4(
(0.0f * 1.0f) + (1.0f * 3.0f) + (2.0f * 2.0f) + (3.0f * 5.0f),
(4.0f * 1.0f) + (5.0f * 3.0f) + (6.0f * 2.0f) + (7.0f * 5.0f),
(8.0f * 1.0f) + (9.0f * 3.0f) + (10.0f * 2.0f) + (11.0f * 5.0f),
(12.0f * 1.0f) + (13.0f * 3.0f) + (14.0f * 2.0f) + (15.0f * 5.0f));
CHECK(result.x == Approx(EXPECTED_1.x));
CHECK(result.y == Approx(EXPECTED_1.y));
CHECK(result.z == Approx(EXPECTED_1.z));
CHECK(result.w == Approx(EXPECTED_1.w));
// Second test case, checked with Wolfram Alpha (https://www.wolframalpha.com/input/?i=%7B%7B1,+7,+23,+-5%7D,%7B0,+-9,+-5,+1%7D,%7B2,+6,+-3,+8%7D,%7B-1,+8,+11,+-5%7D%7D+*+%7B-7,+5,+-3,+1%7D)
constexpr std::array<float, 16> TEST_CASE_2 =
{1, 7, 23, -5,
0, -9, -5, 1,
2, 6, -3, 8,
-1, 8, 11, -5};
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;
mat(i, j) = TEST_CASE_2[arrayIdx];
}
}
vec = Math::simd_vec4(-7, 5, -3, 1);
Math::simd_vec4 EXPECTED_2 = Math::simd_vec4(-46, -29, 33, 9);
result = mat * vec;
CHECK(result.x == Approx(EXPECTED_2.x));
CHECK(result.y == Approx(EXPECTED_2.y));
CHECK(result.z == Approx(EXPECTED_2.z));
CHECK(result.w == Approx(EXPECTED_2.w));
}
For those that are interested: The full code can be found here:
Header, Implementation.
Should you have any problems building, here's my CMakeLists.txt.
Edit: I quickly got a gist setup, it's got 5 files (2 for the matrix, 2 for the vector and 1 for main) along with a shell script that contains the command I used to compile this.
The example code runs the first test case. If any of the results are wrong, the error will be written to stderr.
c++ performance matrix simd
I wrote a matrix class for use in a ray tracer. I already have a 'scalar' version up and running, but now I'm trying to rewrite it to use Intel SIMD Instrinsics. I realize my compiler (clang-7.0.0) will probably generate code that is at least as fast as mine (chances are, it's going to be faster), but I'm doing this because I enjoy it, and because I find it interesting to think about things like this.
This is my version of a matrix-vector product. The matrix has a fixed dimension of 4x4, the vector has a fixed dimension of 4. I was wondering if I could take a different approach to make this function even faster?
My matrix class is set-up like this:
class alignas(16 * sizeof(float)) matrix
{
private:
union
{
struct
{
__m256 ymm0;
__m256 ymm1;
};
float data [16];
};
...
public:
...
}
ymm0
stores rows 0 and 1, ymm1
holds rows 2 and 3.
My simd_vec4
class is just contains a single __m128
called xmm
that's aligned to 16 bytes (4 * sizeof(float)
).
The vector stores its components in the as {x, y, z, w}
.
Here's my reasoning for the algorithm:
We basically need to perform 4 dot products; so the first thing I do is broadcast my 4-float vector to an 8-float vector with the same 4 floats in every lane.
Now, we perform the necessary multiplications, these only carry a data dependency with the vector from step 1, so they can be parallelized/out-of-order'ed.
(a) The next step is to reduce our 16 floats down to 4 floats. We do this by first performing 1 horzontal reduction on each of our 2-lane vectors, and then swapping lanes in one of them.
(b) The second portion of our reduction operation is to merge our two 2-lane vectors into one 2-lane vector, due to the way our data is laid out, we get the same data in both lanes (different order, but we need to fix that anyway...).
We now extract only the lower lane, because it has all the data we need.
Finally, we shuffle some data around in the lower lane to put the right components in the right places.
And here's the code:
simd_vec4 operator*(const simd_mat4& lhs, const simd_vec4& rhs) noexcept
{
// ymm0 = [ A B C D ][ E F G H ]
// ymm1 = [ I J K L ][ M N O P ]
__m256 broadcast = _mm256_set_m128(rhs.xmm, rhs.xmm); // Vec = [ x y z w ][ x y z w ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 xy_m256 = _mm256_mul_ps(lhs.ymm0, broadcast); // xy_m256 = [ Ax By Cz Dw ][ Ex Fy Gz Hw ]
__m256 zw_m256 = _mm256_mul_ps(lhs.ymm1, broadcast); // zw_m256 = [ Ix Jy Kz Lw ][ Mx Ny Oz Pw ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 acc0 = _mm256_hadd_ps(xy_m256, zw_m256); // acc0 = [ Ix + Jy Kz + Lw Ax + By Cz + Dw ][ Mx + Ny Oz + Pw Ex + Fy Gz + Hw ]
__m256 acc1 = _mm256_hadd_ps(zw_m256, xy_m256); // acc1 = [ Ax + By Cz + Dw Ix + Jy Kz + Lw ][ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ]
acc1 = _mm256_permute2f128_ps(acc1, acc1, 0x01); // acc1 = [ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ][ Ax + By Cz + Dw Ix + Jy Kz + Lw ]
__m256 merged = _mm256_hadd_ps(acc0, acc1); // merged = [Ex + Fy + Gz + Hw Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ax + By + Cz + Dw][Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
__m128 vec = _mm256_extractf128_ps(merged, 0); // vec = [Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
vec = _mm_shuffle_ps(vec, vec, _MM_SHUFFLE(2, 1, 3, 0)); // vec = [Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ex + Fy + Gz + Hw Ax + By + Cz + Dw]
return simd_vec4(vec);
}
I'm running this code on an Intel Core i7-5500U CPU @ 2.40GHz, so the allowed extensions are SSE4.1, SSE4.2, AVX2 (and below, I'm assuming).
Any advice is welcome, but one thing I do struggle with is coming up with meaningful names for the intermediate results, if anyone has any tips, you're more than welcome!
I've written two unit test so far, and both are passing (both are 'normal' cases though, I'm working on adding more 'edge' cases)
Here's the unit test code (I use Catch2 btw):
TEST_CASE("matrix * vec4", "[simd_matrix][vec4][multiplication][product]")
{
Math::matrix mat;
Math::simd_vec4 vec;
// First test case, tested by writing out all operations
constexpr std::array<float, 16> TEST_CASE_1 =
{0, 1, 2, 3,
4, 5, 6, 7,
8, 9, 10, 11,
12, 13, 14, 15};
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;
mat(i, j) = TEST_CASE_1[arrayIdx];
}
}
vec = Math::simd_vec4(1.0f, 3.0f, 2.0f, 5.0f);
Math::simd_vec4 result = mat * vec;
Math::simd_vec4 EXPECTED_1 = Math::simd_vec4(
(0.0f * 1.0f) + (1.0f * 3.0f) + (2.0f * 2.0f) + (3.0f * 5.0f),
(4.0f * 1.0f) + (5.0f * 3.0f) + (6.0f * 2.0f) + (7.0f * 5.0f),
(8.0f * 1.0f) + (9.0f * 3.0f) + (10.0f * 2.0f) + (11.0f * 5.0f),
(12.0f * 1.0f) + (13.0f * 3.0f) + (14.0f * 2.0f) + (15.0f * 5.0f));
CHECK(result.x == Approx(EXPECTED_1.x));
CHECK(result.y == Approx(EXPECTED_1.y));
CHECK(result.z == Approx(EXPECTED_1.z));
CHECK(result.w == Approx(EXPECTED_1.w));
// Second test case, checked with Wolfram Alpha (https://www.wolframalpha.com/input/?i=%7B%7B1,+7,+23,+-5%7D,%7B0,+-9,+-5,+1%7D,%7B2,+6,+-3,+8%7D,%7B-1,+8,+11,+-5%7D%7D+*+%7B-7,+5,+-3,+1%7D)
constexpr std::array<float, 16> TEST_CASE_2 =
{1, 7, 23, -5,
0, -9, -5, 1,
2, 6, -3, 8,
-1, 8, 11, -5};
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;
mat(i, j) = TEST_CASE_2[arrayIdx];
}
}
vec = Math::simd_vec4(-7, 5, -3, 1);
Math::simd_vec4 EXPECTED_2 = Math::simd_vec4(-46, -29, 33, 9);
result = mat * vec;
CHECK(result.x == Approx(EXPECTED_2.x));
CHECK(result.y == Approx(EXPECTED_2.y));
CHECK(result.z == Approx(EXPECTED_2.z));
CHECK(result.w == Approx(EXPECTED_2.w));
}
For those that are interested: The full code can be found here:
Header, Implementation.
Should you have any problems building, here's my CMakeLists.txt.
Edit: I quickly got a gist setup, it's got 5 files (2 for the matrix, 2 for the vector and 1 for main) along with a shell script that contains the command I used to compile this.
The example code runs the first test case. If any of the results are wrong, the error will be written to stderr.
c++ performance matrix simd
c++ performance matrix simd
edited Nov 27 at 21:22
asked Nov 27 at 20:08
shmoo6000
193311
193311
Hello, is it possible for you to publish thematrix
class in its entirety, or a reduced version that will allow me to runoperator*
?
– Dair
Nov 27 at 20:56
I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
– shmoo6000
Nov 27 at 21:01
1
@shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
– bruglesco
Nov 27 at 21:43
add a comment |
Hello, is it possible for you to publish thematrix
class in its entirety, or a reduced version that will allow me to runoperator*
?
– Dair
Nov 27 at 20:56
I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
– shmoo6000
Nov 27 at 21:01
1
@shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
– bruglesco
Nov 27 at 21:43
Hello, is it possible for you to publish the
matrix
class in its entirety, or a reduced version that will allow me to run operator*
?– Dair
Nov 27 at 20:56
Hello, is it possible for you to publish the
matrix
class in its entirety, or a reduced version that will allow me to run operator*
?– Dair
Nov 27 at 20:56
I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
– shmoo6000
Nov 27 at 21:01
I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
– shmoo6000
Nov 27 at 21:01
1
1
@shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
– bruglesco
Nov 27 at 21:43
@shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
– bruglesco
Nov 27 at 21:43
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208565%2fsimd-product-of-a-4%25c3%25974-matrix-and-a-vector%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Hello, is it possible for you to publish the
matrix
class in its entirety, or a reduced version that will allow me to runoperator*
?– Dair
Nov 27 at 20:56
I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
– shmoo6000
Nov 27 at 21:01
1
@shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
– bruglesco
Nov 27 at 21:43