SIMD product of a 4×4 matrix and a vector











up vote
2
down vote

favorite












I wrote a matrix class for use in a ray tracer. I already have a 'scalar' version up and running, but now I'm trying to rewrite it to use Intel SIMD Instrinsics. I realize my compiler (clang-7.0.0) will probably generate code that is at least as fast as mine (chances are, it's going to be faster), but I'm doing this because I enjoy it, and because I find it interesting to think about things like this.



This is my version of a matrix-vector product. The matrix has a fixed dimension of 4x4, the vector has a fixed dimension of 4. I was wondering if I could take a different approach to make this function even faster?



My matrix class is set-up like this:



class alignas(16 * sizeof(float)) matrix
{
private:
union
{
struct
{
__m256 ymm0;
__m256 ymm1;
};

float data [16];
};

...

public:
...
}


ymm0 stores rows 0 and 1, ymm1 holds rows 2 and 3.



My simd_vec4 class is just contains a single __m128 called xmm that's aligned to 16 bytes (4 * sizeof(float)).
The vector stores its components in the as {x, y, z, w}.



Here's my reasoning for the algorithm:




  1. We basically need to perform 4 dot products; so the first thing I do is broadcast my 4-float vector to an 8-float vector with the same 4 floats in every lane.


  2. Now, we perform the necessary multiplications, these only carry a data dependency with the vector from step 1, so they can be parallelized/out-of-order'ed.



  3. (a) The next step is to reduce our 16 floats down to 4 floats. We do this by first performing 1 horzontal reduction on each of our 2-lane vectors, and then swapping lanes in one of them.



    (b) The second portion of our reduction operation is to merge our two 2-lane vectors into one 2-lane vector, due to the way our data is laid out, we get the same data in both lanes (different order, but we need to fix that anyway...).



  4. We now extract only the lower lane, because it has all the data we need.


  5. Finally, we shuffle some data around in the lower lane to put the right components in the right places.



And here's the code:



simd_vec4 operator*(const simd_mat4& lhs, const simd_vec4& rhs) noexcept
{
// ymm0 = [ A B C D ][ E F G H ]
// ymm1 = [ I J K L ][ M N O P ]
__m256 broadcast = _mm256_set_m128(rhs.xmm, rhs.xmm); // Vec = [ x y z w ][ x y z w ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 xy_m256 = _mm256_mul_ps(lhs.ymm0, broadcast); // xy_m256 = [ Ax By Cz Dw ][ Ex Fy Gz Hw ]
__m256 zw_m256 = _mm256_mul_ps(lhs.ymm1, broadcast); // zw_m256 = [ Ix Jy Kz Lw ][ Mx Ny Oz Pw ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 acc0 = _mm256_hadd_ps(xy_m256, zw_m256); // acc0 = [ Ix + Jy Kz + Lw Ax + By Cz + Dw ][ Mx + Ny Oz + Pw Ex + Fy Gz + Hw ]
__m256 acc1 = _mm256_hadd_ps(zw_m256, xy_m256); // acc1 = [ Ax + By Cz + Dw Ix + Jy Kz + Lw ][ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ]
acc1 = _mm256_permute2f128_ps(acc1, acc1, 0x01); // acc1 = [ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ][ Ax + By Cz + Dw Ix + Jy Kz + Lw ]
__m256 merged = _mm256_hadd_ps(acc0, acc1); // merged = [Ex + Fy + Gz + Hw Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ax + By + Cz + Dw][Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
__m128 vec = _mm256_extractf128_ps(merged, 0); // vec = [Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
vec = _mm_shuffle_ps(vec, vec, _MM_SHUFFLE(2, 1, 3, 0)); // vec = [Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ex + Fy + Gz + Hw Ax + By + Cz + Dw]

return simd_vec4(vec);
}


I'm running this code on an Intel Core i7-5500U CPU @ 2.40GHz, so the allowed extensions are SSE4.1, SSE4.2, AVX2 (and below, I'm assuming).



Any advice is welcome, but one thing I do struggle with is coming up with meaningful names for the intermediate results, if anyone has any tips, you're more than welcome!



I've written two unit test so far, and both are passing (both are 'normal' cases though, I'm working on adding more 'edge' cases)



Here's the unit test code (I use Catch2 btw):



TEST_CASE("matrix * vec4", "[simd_matrix][vec4][multiplication][product]")
{
Math::matrix mat;
Math::simd_vec4 vec;

// First test case, tested by writing out all operations
constexpr std::array<float, 16> TEST_CASE_1 =
{0, 1, 2, 3,
4, 5, 6, 7,
8, 9, 10, 11,
12, 13, 14, 15};

for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;

mat(i, j) = TEST_CASE_1[arrayIdx];
}
}

vec = Math::simd_vec4(1.0f, 3.0f, 2.0f, 5.0f);

Math::simd_vec4 result = mat * vec;
Math::simd_vec4 EXPECTED_1 = Math::simd_vec4(
(0.0f * 1.0f) + (1.0f * 3.0f) + (2.0f * 2.0f) + (3.0f * 5.0f),
(4.0f * 1.0f) + (5.0f * 3.0f) + (6.0f * 2.0f) + (7.0f * 5.0f),
(8.0f * 1.0f) + (9.0f * 3.0f) + (10.0f * 2.0f) + (11.0f * 5.0f),
(12.0f * 1.0f) + (13.0f * 3.0f) + (14.0f * 2.0f) + (15.0f * 5.0f));

CHECK(result.x == Approx(EXPECTED_1.x));
CHECK(result.y == Approx(EXPECTED_1.y));
CHECK(result.z == Approx(EXPECTED_1.z));
CHECK(result.w == Approx(EXPECTED_1.w));

// Second test case, checked with Wolfram Alpha (https://www.wolframalpha.com/input/?i=%7B%7B1,+7,+23,+-5%7D,%7B0,+-9,+-5,+1%7D,%7B2,+6,+-3,+8%7D,%7B-1,+8,+11,+-5%7D%7D+*+%7B-7,+5,+-3,+1%7D)
constexpr std::array<float, 16> TEST_CASE_2 =
{1, 7, 23, -5,
0, -9, -5, 1,
2, 6, -3, 8,
-1, 8, 11, -5};

for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;
mat(i, j) = TEST_CASE_2[arrayIdx];
}
}

vec = Math::simd_vec4(-7, 5, -3, 1);
Math::simd_vec4 EXPECTED_2 = Math::simd_vec4(-46, -29, 33, 9);
result = mat * vec;

CHECK(result.x == Approx(EXPECTED_2.x));
CHECK(result.y == Approx(EXPECTED_2.y));
CHECK(result.z == Approx(EXPECTED_2.z));
CHECK(result.w == Approx(EXPECTED_2.w));
}


For those that are interested: The full code can be found here:
Header, Implementation.
Should you have any problems building, here's my CMakeLists.txt.



Edit: I quickly got a gist setup, it's got 5 files (2 for the matrix, 2 for the vector and 1 for main) along with a shell script that contains the command I used to compile this.



The example code runs the first test case. If any of the results are wrong, the error will be written to stderr.










share|improve this question
























  • Hello, is it possible for you to publish the matrix class in its entirety, or a reduced version that will allow me to run operator*?
    – Dair
    Nov 27 at 20:56










  • I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
    – shmoo6000
    Nov 27 at 21:01






  • 1




    @shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
    – bruglesco
    Nov 27 at 21:43















up vote
2
down vote

favorite












I wrote a matrix class for use in a ray tracer. I already have a 'scalar' version up and running, but now I'm trying to rewrite it to use Intel SIMD Instrinsics. I realize my compiler (clang-7.0.0) will probably generate code that is at least as fast as mine (chances are, it's going to be faster), but I'm doing this because I enjoy it, and because I find it interesting to think about things like this.



This is my version of a matrix-vector product. The matrix has a fixed dimension of 4x4, the vector has a fixed dimension of 4. I was wondering if I could take a different approach to make this function even faster?



My matrix class is set-up like this:



class alignas(16 * sizeof(float)) matrix
{
private:
union
{
struct
{
__m256 ymm0;
__m256 ymm1;
};

float data [16];
};

...

public:
...
}


ymm0 stores rows 0 and 1, ymm1 holds rows 2 and 3.



My simd_vec4 class is just contains a single __m128 called xmm that's aligned to 16 bytes (4 * sizeof(float)).
The vector stores its components in the as {x, y, z, w}.



Here's my reasoning for the algorithm:




  1. We basically need to perform 4 dot products; so the first thing I do is broadcast my 4-float vector to an 8-float vector with the same 4 floats in every lane.


  2. Now, we perform the necessary multiplications, these only carry a data dependency with the vector from step 1, so they can be parallelized/out-of-order'ed.



  3. (a) The next step is to reduce our 16 floats down to 4 floats. We do this by first performing 1 horzontal reduction on each of our 2-lane vectors, and then swapping lanes in one of them.



    (b) The second portion of our reduction operation is to merge our two 2-lane vectors into one 2-lane vector, due to the way our data is laid out, we get the same data in both lanes (different order, but we need to fix that anyway...).



  4. We now extract only the lower lane, because it has all the data we need.


  5. Finally, we shuffle some data around in the lower lane to put the right components in the right places.



And here's the code:



simd_vec4 operator*(const simd_mat4& lhs, const simd_vec4& rhs) noexcept
{
// ymm0 = [ A B C D ][ E F G H ]
// ymm1 = [ I J K L ][ M N O P ]
__m256 broadcast = _mm256_set_m128(rhs.xmm, rhs.xmm); // Vec = [ x y z w ][ x y z w ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 xy_m256 = _mm256_mul_ps(lhs.ymm0, broadcast); // xy_m256 = [ Ax By Cz Dw ][ Ex Fy Gz Hw ]
__m256 zw_m256 = _mm256_mul_ps(lhs.ymm1, broadcast); // zw_m256 = [ Ix Jy Kz Lw ][ Mx Ny Oz Pw ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 acc0 = _mm256_hadd_ps(xy_m256, zw_m256); // acc0 = [ Ix + Jy Kz + Lw Ax + By Cz + Dw ][ Mx + Ny Oz + Pw Ex + Fy Gz + Hw ]
__m256 acc1 = _mm256_hadd_ps(zw_m256, xy_m256); // acc1 = [ Ax + By Cz + Dw Ix + Jy Kz + Lw ][ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ]
acc1 = _mm256_permute2f128_ps(acc1, acc1, 0x01); // acc1 = [ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ][ Ax + By Cz + Dw Ix + Jy Kz + Lw ]
__m256 merged = _mm256_hadd_ps(acc0, acc1); // merged = [Ex + Fy + Gz + Hw Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ax + By + Cz + Dw][Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
__m128 vec = _mm256_extractf128_ps(merged, 0); // vec = [Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
vec = _mm_shuffle_ps(vec, vec, _MM_SHUFFLE(2, 1, 3, 0)); // vec = [Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ex + Fy + Gz + Hw Ax + By + Cz + Dw]

return simd_vec4(vec);
}


I'm running this code on an Intel Core i7-5500U CPU @ 2.40GHz, so the allowed extensions are SSE4.1, SSE4.2, AVX2 (and below, I'm assuming).



Any advice is welcome, but one thing I do struggle with is coming up with meaningful names for the intermediate results, if anyone has any tips, you're more than welcome!



I've written two unit test so far, and both are passing (both are 'normal' cases though, I'm working on adding more 'edge' cases)



Here's the unit test code (I use Catch2 btw):



TEST_CASE("matrix * vec4", "[simd_matrix][vec4][multiplication][product]")
{
Math::matrix mat;
Math::simd_vec4 vec;

// First test case, tested by writing out all operations
constexpr std::array<float, 16> TEST_CASE_1 =
{0, 1, 2, 3,
4, 5, 6, 7,
8, 9, 10, 11,
12, 13, 14, 15};

for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;

mat(i, j) = TEST_CASE_1[arrayIdx];
}
}

vec = Math::simd_vec4(1.0f, 3.0f, 2.0f, 5.0f);

Math::simd_vec4 result = mat * vec;
Math::simd_vec4 EXPECTED_1 = Math::simd_vec4(
(0.0f * 1.0f) + (1.0f * 3.0f) + (2.0f * 2.0f) + (3.0f * 5.0f),
(4.0f * 1.0f) + (5.0f * 3.0f) + (6.0f * 2.0f) + (7.0f * 5.0f),
(8.0f * 1.0f) + (9.0f * 3.0f) + (10.0f * 2.0f) + (11.0f * 5.0f),
(12.0f * 1.0f) + (13.0f * 3.0f) + (14.0f * 2.0f) + (15.0f * 5.0f));

CHECK(result.x == Approx(EXPECTED_1.x));
CHECK(result.y == Approx(EXPECTED_1.y));
CHECK(result.z == Approx(EXPECTED_1.z));
CHECK(result.w == Approx(EXPECTED_1.w));

// Second test case, checked with Wolfram Alpha (https://www.wolframalpha.com/input/?i=%7B%7B1,+7,+23,+-5%7D,%7B0,+-9,+-5,+1%7D,%7B2,+6,+-3,+8%7D,%7B-1,+8,+11,+-5%7D%7D+*+%7B-7,+5,+-3,+1%7D)
constexpr std::array<float, 16> TEST_CASE_2 =
{1, 7, 23, -5,
0, -9, -5, 1,
2, 6, -3, 8,
-1, 8, 11, -5};

for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;
mat(i, j) = TEST_CASE_2[arrayIdx];
}
}

vec = Math::simd_vec4(-7, 5, -3, 1);
Math::simd_vec4 EXPECTED_2 = Math::simd_vec4(-46, -29, 33, 9);
result = mat * vec;

CHECK(result.x == Approx(EXPECTED_2.x));
CHECK(result.y == Approx(EXPECTED_2.y));
CHECK(result.z == Approx(EXPECTED_2.z));
CHECK(result.w == Approx(EXPECTED_2.w));
}


For those that are interested: The full code can be found here:
Header, Implementation.
Should you have any problems building, here's my CMakeLists.txt.



Edit: I quickly got a gist setup, it's got 5 files (2 for the matrix, 2 for the vector and 1 for main) along with a shell script that contains the command I used to compile this.



The example code runs the first test case. If any of the results are wrong, the error will be written to stderr.










share|improve this question
























  • Hello, is it possible for you to publish the matrix class in its entirety, or a reduced version that will allow me to run operator*?
    – Dair
    Nov 27 at 20:56










  • I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
    – shmoo6000
    Nov 27 at 21:01






  • 1




    @shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
    – bruglesco
    Nov 27 at 21:43













up vote
2
down vote

favorite









up vote
2
down vote

favorite











I wrote a matrix class for use in a ray tracer. I already have a 'scalar' version up and running, but now I'm trying to rewrite it to use Intel SIMD Instrinsics. I realize my compiler (clang-7.0.0) will probably generate code that is at least as fast as mine (chances are, it's going to be faster), but I'm doing this because I enjoy it, and because I find it interesting to think about things like this.



This is my version of a matrix-vector product. The matrix has a fixed dimension of 4x4, the vector has a fixed dimension of 4. I was wondering if I could take a different approach to make this function even faster?



My matrix class is set-up like this:



class alignas(16 * sizeof(float)) matrix
{
private:
union
{
struct
{
__m256 ymm0;
__m256 ymm1;
};

float data [16];
};

...

public:
...
}


ymm0 stores rows 0 and 1, ymm1 holds rows 2 and 3.



My simd_vec4 class is just contains a single __m128 called xmm that's aligned to 16 bytes (4 * sizeof(float)).
The vector stores its components in the as {x, y, z, w}.



Here's my reasoning for the algorithm:




  1. We basically need to perform 4 dot products; so the first thing I do is broadcast my 4-float vector to an 8-float vector with the same 4 floats in every lane.


  2. Now, we perform the necessary multiplications, these only carry a data dependency with the vector from step 1, so they can be parallelized/out-of-order'ed.



  3. (a) The next step is to reduce our 16 floats down to 4 floats. We do this by first performing 1 horzontal reduction on each of our 2-lane vectors, and then swapping lanes in one of them.



    (b) The second portion of our reduction operation is to merge our two 2-lane vectors into one 2-lane vector, due to the way our data is laid out, we get the same data in both lanes (different order, but we need to fix that anyway...).



  4. We now extract only the lower lane, because it has all the data we need.


  5. Finally, we shuffle some data around in the lower lane to put the right components in the right places.



And here's the code:



simd_vec4 operator*(const simd_mat4& lhs, const simd_vec4& rhs) noexcept
{
// ymm0 = [ A B C D ][ E F G H ]
// ymm1 = [ I J K L ][ M N O P ]
__m256 broadcast = _mm256_set_m128(rhs.xmm, rhs.xmm); // Vec = [ x y z w ][ x y z w ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 xy_m256 = _mm256_mul_ps(lhs.ymm0, broadcast); // xy_m256 = [ Ax By Cz Dw ][ Ex Fy Gz Hw ]
__m256 zw_m256 = _mm256_mul_ps(lhs.ymm1, broadcast); // zw_m256 = [ Ix Jy Kz Lw ][ Mx Ny Oz Pw ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 acc0 = _mm256_hadd_ps(xy_m256, zw_m256); // acc0 = [ Ix + Jy Kz + Lw Ax + By Cz + Dw ][ Mx + Ny Oz + Pw Ex + Fy Gz + Hw ]
__m256 acc1 = _mm256_hadd_ps(zw_m256, xy_m256); // acc1 = [ Ax + By Cz + Dw Ix + Jy Kz + Lw ][ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ]
acc1 = _mm256_permute2f128_ps(acc1, acc1, 0x01); // acc1 = [ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ][ Ax + By Cz + Dw Ix + Jy Kz + Lw ]
__m256 merged = _mm256_hadd_ps(acc0, acc1); // merged = [Ex + Fy + Gz + Hw Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ax + By + Cz + Dw][Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
__m128 vec = _mm256_extractf128_ps(merged, 0); // vec = [Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
vec = _mm_shuffle_ps(vec, vec, _MM_SHUFFLE(2, 1, 3, 0)); // vec = [Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ex + Fy + Gz + Hw Ax + By + Cz + Dw]

return simd_vec4(vec);
}


I'm running this code on an Intel Core i7-5500U CPU @ 2.40GHz, so the allowed extensions are SSE4.1, SSE4.2, AVX2 (and below, I'm assuming).



Any advice is welcome, but one thing I do struggle with is coming up with meaningful names for the intermediate results, if anyone has any tips, you're more than welcome!



I've written two unit test so far, and both are passing (both are 'normal' cases though, I'm working on adding more 'edge' cases)



Here's the unit test code (I use Catch2 btw):



TEST_CASE("matrix * vec4", "[simd_matrix][vec4][multiplication][product]")
{
Math::matrix mat;
Math::simd_vec4 vec;

// First test case, tested by writing out all operations
constexpr std::array<float, 16> TEST_CASE_1 =
{0, 1, 2, 3,
4, 5, 6, 7,
8, 9, 10, 11,
12, 13, 14, 15};

for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;

mat(i, j) = TEST_CASE_1[arrayIdx];
}
}

vec = Math::simd_vec4(1.0f, 3.0f, 2.0f, 5.0f);

Math::simd_vec4 result = mat * vec;
Math::simd_vec4 EXPECTED_1 = Math::simd_vec4(
(0.0f * 1.0f) + (1.0f * 3.0f) + (2.0f * 2.0f) + (3.0f * 5.0f),
(4.0f * 1.0f) + (5.0f * 3.0f) + (6.0f * 2.0f) + (7.0f * 5.0f),
(8.0f * 1.0f) + (9.0f * 3.0f) + (10.0f * 2.0f) + (11.0f * 5.0f),
(12.0f * 1.0f) + (13.0f * 3.0f) + (14.0f * 2.0f) + (15.0f * 5.0f));

CHECK(result.x == Approx(EXPECTED_1.x));
CHECK(result.y == Approx(EXPECTED_1.y));
CHECK(result.z == Approx(EXPECTED_1.z));
CHECK(result.w == Approx(EXPECTED_1.w));

// Second test case, checked with Wolfram Alpha (https://www.wolframalpha.com/input/?i=%7B%7B1,+7,+23,+-5%7D,%7B0,+-9,+-5,+1%7D,%7B2,+6,+-3,+8%7D,%7B-1,+8,+11,+-5%7D%7D+*+%7B-7,+5,+-3,+1%7D)
constexpr std::array<float, 16> TEST_CASE_2 =
{1, 7, 23, -5,
0, -9, -5, 1,
2, 6, -3, 8,
-1, 8, 11, -5};

for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;
mat(i, j) = TEST_CASE_2[arrayIdx];
}
}

vec = Math::simd_vec4(-7, 5, -3, 1);
Math::simd_vec4 EXPECTED_2 = Math::simd_vec4(-46, -29, 33, 9);
result = mat * vec;

CHECK(result.x == Approx(EXPECTED_2.x));
CHECK(result.y == Approx(EXPECTED_2.y));
CHECK(result.z == Approx(EXPECTED_2.z));
CHECK(result.w == Approx(EXPECTED_2.w));
}


For those that are interested: The full code can be found here:
Header, Implementation.
Should you have any problems building, here's my CMakeLists.txt.



Edit: I quickly got a gist setup, it's got 5 files (2 for the matrix, 2 for the vector and 1 for main) along with a shell script that contains the command I used to compile this.



The example code runs the first test case. If any of the results are wrong, the error will be written to stderr.










share|improve this question















I wrote a matrix class for use in a ray tracer. I already have a 'scalar' version up and running, but now I'm trying to rewrite it to use Intel SIMD Instrinsics. I realize my compiler (clang-7.0.0) will probably generate code that is at least as fast as mine (chances are, it's going to be faster), but I'm doing this because I enjoy it, and because I find it interesting to think about things like this.



This is my version of a matrix-vector product. The matrix has a fixed dimension of 4x4, the vector has a fixed dimension of 4. I was wondering if I could take a different approach to make this function even faster?



My matrix class is set-up like this:



class alignas(16 * sizeof(float)) matrix
{
private:
union
{
struct
{
__m256 ymm0;
__m256 ymm1;
};

float data [16];
};

...

public:
...
}


ymm0 stores rows 0 and 1, ymm1 holds rows 2 and 3.



My simd_vec4 class is just contains a single __m128 called xmm that's aligned to 16 bytes (4 * sizeof(float)).
The vector stores its components in the as {x, y, z, w}.



Here's my reasoning for the algorithm:




  1. We basically need to perform 4 dot products; so the first thing I do is broadcast my 4-float vector to an 8-float vector with the same 4 floats in every lane.


  2. Now, we perform the necessary multiplications, these only carry a data dependency with the vector from step 1, so they can be parallelized/out-of-order'ed.



  3. (a) The next step is to reduce our 16 floats down to 4 floats. We do this by first performing 1 horzontal reduction on each of our 2-lane vectors, and then swapping lanes in one of them.



    (b) The second portion of our reduction operation is to merge our two 2-lane vectors into one 2-lane vector, due to the way our data is laid out, we get the same data in both lanes (different order, but we need to fix that anyway...).



  4. We now extract only the lower lane, because it has all the data we need.


  5. Finally, we shuffle some data around in the lower lane to put the right components in the right places.



And here's the code:



simd_vec4 operator*(const simd_mat4& lhs, const simd_vec4& rhs) noexcept
{
// ymm0 = [ A B C D ][ E F G H ]
// ymm1 = [ I J K L ][ M N O P ]
__m256 broadcast = _mm256_set_m128(rhs.xmm, rhs.xmm); // Vec = [ x y z w ][ x y z w ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 xy_m256 = _mm256_mul_ps(lhs.ymm0, broadcast); // xy_m256 = [ Ax By Cz Dw ][ Ex Fy Gz Hw ]
__m256 zw_m256 = _mm256_mul_ps(lhs.ymm1, broadcast); // zw_m256 = [ Ix Jy Kz Lw ][ Mx Ny Oz Pw ]
//-------------------------------------------------------------------------------------------------------------------------------------------------------------
__m256 acc0 = _mm256_hadd_ps(xy_m256, zw_m256); // acc0 = [ Ix + Jy Kz + Lw Ax + By Cz + Dw ][ Mx + Ny Oz + Pw Ex + Fy Gz + Hw ]
__m256 acc1 = _mm256_hadd_ps(zw_m256, xy_m256); // acc1 = [ Ax + By Cz + Dw Ix + Jy Kz + Lw ][ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ]
acc1 = _mm256_permute2f128_ps(acc1, acc1, 0x01); // acc1 = [ Ex + Fy Gz + Hw Mx + Ny Oz + Pw ][ Ax + By Cz + Dw Ix + Jy Kz + Lw ]
__m256 merged = _mm256_hadd_ps(acc0, acc1); // merged = [Ex + Fy + Gz + Hw Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ax + By + Cz + Dw][Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
__m128 vec = _mm256_extractf128_ps(merged, 0); // vec = [Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]
vec = _mm_shuffle_ps(vec, vec, _MM_SHUFFLE(2, 1, 3, 0)); // vec = [Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ex + Fy + Gz + Hw Ax + By + Cz + Dw]

return simd_vec4(vec);
}


I'm running this code on an Intel Core i7-5500U CPU @ 2.40GHz, so the allowed extensions are SSE4.1, SSE4.2, AVX2 (and below, I'm assuming).



Any advice is welcome, but one thing I do struggle with is coming up with meaningful names for the intermediate results, if anyone has any tips, you're more than welcome!



I've written two unit test so far, and both are passing (both are 'normal' cases though, I'm working on adding more 'edge' cases)



Here's the unit test code (I use Catch2 btw):



TEST_CASE("matrix * vec4", "[simd_matrix][vec4][multiplication][product]")
{
Math::matrix mat;
Math::simd_vec4 vec;

// First test case, tested by writing out all operations
constexpr std::array<float, 16> TEST_CASE_1 =
{0, 1, 2, 3,
4, 5, 6, 7,
8, 9, 10, 11,
12, 13, 14, 15};

for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;

mat(i, j) = TEST_CASE_1[arrayIdx];
}
}

vec = Math::simd_vec4(1.0f, 3.0f, 2.0f, 5.0f);

Math::simd_vec4 result = mat * vec;
Math::simd_vec4 EXPECTED_1 = Math::simd_vec4(
(0.0f * 1.0f) + (1.0f * 3.0f) + (2.0f * 2.0f) + (3.0f * 5.0f),
(4.0f * 1.0f) + (5.0f * 3.0f) + (6.0f * 2.0f) + (7.0f * 5.0f),
(8.0f * 1.0f) + (9.0f * 3.0f) + (10.0f * 2.0f) + (11.0f * 5.0f),
(12.0f * 1.0f) + (13.0f * 3.0f) + (14.0f * 2.0f) + (15.0f * 5.0f));

CHECK(result.x == Approx(EXPECTED_1.x));
CHECK(result.y == Approx(EXPECTED_1.y));
CHECK(result.z == Approx(EXPECTED_1.z));
CHECK(result.w == Approx(EXPECTED_1.w));

// Second test case, checked with Wolfram Alpha (https://www.wolframalpha.com/input/?i=%7B%7B1,+7,+23,+-5%7D,%7B0,+-9,+-5,+1%7D,%7B2,+6,+-3,+8%7D,%7B-1,+8,+11,+-5%7D%7D+*+%7B-7,+5,+-3,+1%7D)
constexpr std::array<float, 16> TEST_CASE_2 =
{1, 7, 23, -5,
0, -9, -5, 1,
2, 6, -3, 8,
-1, 8, 11, -5};

for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
const int arrayIdx = (i * 4) + j;
mat(i, j) = TEST_CASE_2[arrayIdx];
}
}

vec = Math::simd_vec4(-7, 5, -3, 1);
Math::simd_vec4 EXPECTED_2 = Math::simd_vec4(-46, -29, 33, 9);
result = mat * vec;

CHECK(result.x == Approx(EXPECTED_2.x));
CHECK(result.y == Approx(EXPECTED_2.y));
CHECK(result.z == Approx(EXPECTED_2.z));
CHECK(result.w == Approx(EXPECTED_2.w));
}


For those that are interested: The full code can be found here:
Header, Implementation.
Should you have any problems building, here's my CMakeLists.txt.



Edit: I quickly got a gist setup, it's got 5 files (2 for the matrix, 2 for the vector and 1 for main) along with a shell script that contains the command I used to compile this.



The example code runs the first test case. If any of the results are wrong, the error will be written to stderr.







c++ performance matrix simd






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 27 at 21:22

























asked Nov 27 at 20:08









shmoo6000

193311




193311












  • Hello, is it possible for you to publish the matrix class in its entirety, or a reduced version that will allow me to run operator*?
    – Dair
    Nov 27 at 20:56










  • I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
    – shmoo6000
    Nov 27 at 21:01






  • 1




    @shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
    – bruglesco
    Nov 27 at 21:43


















  • Hello, is it possible for you to publish the matrix class in its entirety, or a reduced version that will allow me to run operator*?
    – Dair
    Nov 27 at 20:56










  • I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
    – shmoo6000
    Nov 27 at 21:01






  • 1




    @shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
    – bruglesco
    Nov 27 at 21:43
















Hello, is it possible for you to publish the matrix class in its entirety, or a reduced version that will allow me to run operator*?
– Dair
Nov 27 at 20:56




Hello, is it possible for you to publish the matrix class in its entirety, or a reduced version that will allow me to run operator*?
– Dair
Nov 27 at 20:56












I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
– shmoo6000
Nov 27 at 21:01




I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
– shmoo6000
Nov 27 at 21:01




1




1




@shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
– bruglesco
Nov 27 at 21:43




@shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
– bruglesco
Nov 27 at 21:43















active

oldest

votes











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208565%2fsimd-product-of-a-4%25c3%25974-matrix-and-a-vector%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208565%2fsimd-product-of-a-4%25c3%25974-matrix-and-a-vector%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Список кардиналов, возведённых папой римским Каликстом III

Deduzione

Mysql.sock missing - “Can't connect to local MySQL server through socket”