SIMD product of a 4×4 matrix and a vector

up vote
2
down vote

favorite

I wrote a matrix class for use in a ray tracer. I already have a 'scalar' version up and running, but now I'm trying to rewrite it to use Intel SIMD Instrinsics. I realize my compiler (clang-7.0.0) will probably generate code that is at least as fast as mine (chances are, it's going to be faster), but I'm doing this because I enjoy it, and because I find it interesting to think about things like this.

This is my version of a matrix-vector product. The matrix has a fixed dimension of 4x4, the vector has a fixed dimension of 4. I was wondering if I could take a different approach to make this function even faster?

My matrix class is set-up like this:

class alignas(16 * sizeof(float)) matrix

{

    private:

        union

        {

            struct

            {

                __m256 ymm0;

                __m256 ymm1;

            };



            float data [16];

        };



        ...



    public:

        ...

}

ymm0 stores rows 0 and 1, ymm1 holds rows 2 and 3.

My simd_vec4 class is just contains a single __m128 called xmm that's aligned to 16 bytes (4 * sizeof(float)).
The vector stores its components in the as {x, y, z, w}.

Here's my reasoning for the algorithm:

We basically need to perform 4 dot products; so the first thing I do is broadcast my 4-float vector to an 8-float vector with the same 4 floats in every lane.

Now, we perform the necessary multiplications, these only carry a data dependency with the vector from step 1, so they can be parallelized/out-of-order'ed.

(a) The next step is to reduce our 16 floats down to 4 floats. We do this by first performing 1 horzontal reduction on each of our 2-lane vectors, and then swapping lanes in one of them.

(b) The second portion of our reduction operation is to merge our two 2-lane vectors into one 2-lane vector, due to the way our data is laid out, we get the same data in both lanes (different order, but we need to fix that anyway...).

We now extract only the lower lane, because it has all the data we need.

Finally, we shuffle some data around in the lower lane to put the right components in the right places.

And here's the code:

simd_vec4 operator*(const simd_mat4& lhs, const simd_vec4& rhs) noexcept

{

                                                                        // ymm0    = [        A                 B                 C                 D        ][        E                 F                 G                 H        ]

                                                                        // ymm1    = [        I                 J                 K                 L        ][        M                 N                 O                 P        ]

    __m256 broadcast = _mm256_set_m128(rhs.xmm, rhs.xmm);               // Vec     = [         x                 y                 z                 w       ][         x                 y                 z                 w       ]

                                                                        //-------------------------------------------------------------------------------------------------------------------------------------------------------------

    __m256 xy_m256 = _mm256_mul_ps(lhs.ymm0, broadcast);                // xy_m256 = [        Ax                By                Cz                Dw       ][        Ex                Fy                Gz                Hw       ]

    __m256 zw_m256 = _mm256_mul_ps(lhs.ymm1, broadcast);                // zw_m256 = [        Ix                Jy                Kz                Lw       ][        Mx                Ny                Oz                Pw       ]

                                                                        //-------------------------------------------------------------------------------------------------------------------------------------------------------------

    __m256   acc0  = _mm256_hadd_ps(xy_m256, zw_m256);                  // acc0    = [     Ix + Jy           Kz + Lw           Ax + By           Cz + Dw     ][     Mx + Ny           Oz + Pw           Ex + Fy           Gz + Hw     ]

    __m256   acc1  = _mm256_hadd_ps(zw_m256, xy_m256);                  // acc1    = [     Ax + By           Cz + Dw           Ix + Jy           Kz + Lw     ][     Ex + Fy           Gz + Hw           Mx + Ny           Oz + Pw     ]

             acc1  = _mm256_permute2f128_ps(acc1, acc1, 0x01);          // acc1    = [     Ex + Fy           Gz + Hw           Mx + Ny           Oz + Pw     ][     Ax + By           Cz + Dw           Ix + Jy           Kz + Lw     ]

    __m256 merged  = _mm256_hadd_ps(acc0, acc1);                        // merged  = [Ex + Fy + Gz + Hw Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ax + By + Cz + Dw][Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]

    __m128    vec  = _mm256_extractf128_ps(merged, 0);                  //  vec    =                                                                          [Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]

              vec  = _mm_shuffle_ps(vec, vec, _MM_SHUFFLE(2, 1, 3, 0)); //  vec    =                                                                          [Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ex + Fy + Gz + Hw Ax + By + Cz + Dw]



    return simd_vec4(vec);

}

I'm running this code on an Intel Core i7-5500U CPU @ 2.40GHz, so the allowed extensions are SSE4.1, SSE4.2, AVX2 (and below, I'm assuming).

Any advice is welcome, but one thing I do struggle with is coming up with meaningful names for the intermediate results, if anyone has any tips, you're more than welcome!

I've written two unit test so far, and both are passing (both are 'normal' cases though, I'm working on adding more 'edge' cases)

Here's the unit test code (I use Catch2 btw):

TEST_CASE("matrix * vec4", "[simd_matrix][vec4][multiplication][product]")

{

    Math::matrix mat;

    Math::simd_vec4 vec;



    // First test case, tested by writing out all operations

    constexpr std::array<float, 16> TEST_CASE_1 =

                                        {0,  1,  2,  3,

                                         4,  5,  6,  7,

                                         8,  9,  10, 11,

                                         12, 13, 14, 15};



    for (int i = 0; i < 4; i++)

    {

        for (int j = 0; j < 4; j++)

        {

            const int arrayIdx = (i * 4) + j;



            mat(i, j) = TEST_CASE_1[arrayIdx];

        }

    }



    vec = Math::simd_vec4(1.0f, 3.0f, 2.0f, 5.0f);



    Math::simd_vec4 result   = mat * vec;

    Math::simd_vec4 EXPECTED_1 = Math::simd_vec4(

        (0.0f * 1.0f)  + (1.0f * 3.0f)  + (2.0f * 2.0f)  + (3.0f * 5.0f),

        (4.0f * 1.0f)  + (5.0f * 3.0f)  + (6.0f * 2.0f)  + (7.0f * 5.0f),

        (8.0f * 1.0f)  + (9.0f * 3.0f)  + (10.0f * 2.0f) + (11.0f * 5.0f),

        (12.0f * 1.0f) + (13.0f * 3.0f) + (14.0f * 2.0f) + (15.0f * 5.0f));



    CHECK(result.x == Approx(EXPECTED_1.x));

    CHECK(result.y == Approx(EXPECTED_1.y));

    CHECK(result.z == Approx(EXPECTED_1.z));

    CHECK(result.w == Approx(EXPECTED_1.w));



    // Second test case, checked with Wolfram Alpha (https://www.wolframalpha.com/input/?i=%7B%7B1,+7,+23,+-5%7D,%7B0,+-9,+-5,+1%7D,%7B2,+6,+-3,+8%7D,%7B-1,+8,+11,+-5%7D%7D+*+%7B-7,+5,+-3,+1%7D)

    constexpr std::array<float, 16> TEST_CASE_2 =

                                        {1, 7, 23, -5,

                                         0, -9, -5, 1,

                                         2, 6, -3, 8,

                                         -1, 8, 11, -5};



    for (int i = 0; i < 4; i++)

    {

        for (int j = 0; j < 4; j++)

        {

            const int arrayIdx = (i * 4) + j;

            mat(i, j) = TEST_CASE_2[arrayIdx];

        }

    }



    vec = Math::simd_vec4(-7, 5, -3, 1);

    Math::simd_vec4 EXPECTED_2 = Math::simd_vec4(-46, -29, 33, 9);

    result   = mat * vec;



    CHECK(result.x == Approx(EXPECTED_2.x));

    CHECK(result.y == Approx(EXPECTED_2.y));

    CHECK(result.z == Approx(EXPECTED_2.z));

    CHECK(result.w == Approx(EXPECTED_2.w));

}

For those that are interested: The full code can be found here:
Header, Implementation.
Should you have any problems building, here's my CMakeLists.txt.

Edit: I quickly got a gist setup, it's got 5 files (2 for the matrix, 2 for the vector and 1 for main) along with a shell script that contains the command I used to compile this.

The example code runs the first test case. If any of the results are wrong, the error will be written to stderr.

edited Nov 27 at 21:22

asked Nov 27 at 20:08

shmoo6000

193311

Hello, is it possible for you to publish the matrix class in its entirety, or a reduced version that will allow me to run operator*?
– Dair
Nov 27 at 20:56

I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
– shmoo6000
Nov 27 at 21:01

1

@shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
– bruglesco
Nov 27 at 21:43

add a comment |

up vote
2
down vote

favorite

My matrix class is set-up like this:

class alignas(16 * sizeof(float)) matrix

{

    private:

        union

        {

            struct

            {

                __m256 ymm0;

                __m256 ymm1;

            };



            float data [16];

        };



        ...



    public:

        ...

}

ymm0 stores rows 0 and 1, ymm1 holds rows 2 and 3.

My simd_vec4 class is just contains a single __m128 called xmm that's aligned to 16 bytes (4 * sizeof(float)).
The vector stores its components in the as {x, y, z, w}.

Here's my reasoning for the algorithm:

We basically need to perform 4 dot products; so the first thing I do is broadcast my 4-float vector to an 8-float vector with the same 4 floats in every lane.

Now, we perform the necessary multiplications, these only carry a data dependency with the vector from step 1, so they can be parallelized/out-of-order'ed.

(a) The next step is to reduce our 16 floats down to 4 floats. We do this by first performing 1 horzontal reduction on each of our 2-lane vectors, and then swapping lanes in one of them.

(b) The second portion of our reduction operation is to merge our two 2-lane vectors into one 2-lane vector, due to the way our data is laid out, we get the same data in both lanes (different order, but we need to fix that anyway...).

We now extract only the lower lane, because it has all the data we need.

Finally, we shuffle some data around in the lower lane to put the right components in the right places.

And here's the code:

simd_vec4 operator*(const simd_mat4& lhs, const simd_vec4& rhs) noexcept

{

                                                                        // ymm0    = [        A                 B                 C                 D        ][        E                 F                 G                 H        ]

                                                                        // ymm1    = [        I                 J                 K                 L        ][        M                 N                 O                 P        ]

    __m256 broadcast = _mm256_set_m128(rhs.xmm, rhs.xmm);               // Vec     = [         x                 y                 z                 w       ][         x                 y                 z                 w       ]

                                                                        //-------------------------------------------------------------------------------------------------------------------------------------------------------------

    __m256 xy_m256 = _mm256_mul_ps(lhs.ymm0, broadcast);                // xy_m256 = [        Ax                By                Cz                Dw       ][        Ex                Fy                Gz                Hw       ]

    __m256 zw_m256 = _mm256_mul_ps(lhs.ymm1, broadcast);                // zw_m256 = [        Ix                Jy                Kz                Lw       ][        Mx                Ny                Oz                Pw       ]

                                                                        //-------------------------------------------------------------------------------------------------------------------------------------------------------------

    __m256   acc0  = _mm256_hadd_ps(xy_m256, zw_m256);                  // acc0    = [     Ix + Jy           Kz + Lw           Ax + By           Cz + Dw     ][     Mx + Ny           Oz + Pw           Ex + Fy           Gz + Hw     ]

    __m256   acc1  = _mm256_hadd_ps(zw_m256, xy_m256);                  // acc1    = [     Ax + By           Cz + Dw           Ix + Jy           Kz + Lw     ][     Ex + Fy           Gz + Hw           Mx + Ny           Oz + Pw     ]

             acc1  = _mm256_permute2f128_ps(acc1, acc1, 0x01);          // acc1    = [     Ex + Fy           Gz + Hw           Mx + Ny           Oz + Pw     ][     Ax + By           Cz + Dw           Ix + Jy           Kz + Lw     ]

    __m256 merged  = _mm256_hadd_ps(acc0, acc1);                        // merged  = [Ex + Fy + Gz + Hw Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ax + By + Cz + Dw][Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]

    __m128    vec  = _mm256_extractf128_ps(merged, 0);                  //  vec    =                                                                          [Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]

              vec  = _mm_shuffle_ps(vec, vec, _MM_SHUFFLE(2, 1, 3, 0)); //  vec    =                                                                          [Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ex + Fy + Gz + Hw Ax + By + Cz + Dw]



    return simd_vec4(vec);

}

I'm running this code on an Intel Core i7-5500U CPU @ 2.40GHz, so the allowed extensions are SSE4.1, SSE4.2, AVX2 (and below, I'm assuming).

Any advice is welcome, but one thing I do struggle with is coming up with meaningful names for the intermediate results, if anyone has any tips, you're more than welcome!

I've written two unit test so far, and both are passing (both are 'normal' cases though, I'm working on adding more 'edge' cases)

Here's the unit test code (I use Catch2 btw):

TEST_CASE("matrix * vec4", "[simd_matrix][vec4][multiplication][product]")

{

    Math::matrix mat;

    Math::simd_vec4 vec;



    // First test case, tested by writing out all operations

    constexpr std::array<float, 16> TEST_CASE_1 =

                                        {0,  1,  2,  3,

                                         4,  5,  6,  7,

                                         8,  9,  10, 11,

                                         12, 13, 14, 15};



    for (int i = 0; i < 4; i++)

    {

        for (int j = 0; j < 4; j++)

        {

            const int arrayIdx = (i * 4) + j;



            mat(i, j) = TEST_CASE_1[arrayIdx];

        }

    }



    vec = Math::simd_vec4(1.0f, 3.0f, 2.0f, 5.0f);



    Math::simd_vec4 result   = mat * vec;

    Math::simd_vec4 EXPECTED_1 = Math::simd_vec4(

        (0.0f * 1.0f)  + (1.0f * 3.0f)  + (2.0f * 2.0f)  + (3.0f * 5.0f),

        (4.0f * 1.0f)  + (5.0f * 3.0f)  + (6.0f * 2.0f)  + (7.0f * 5.0f),

        (8.0f * 1.0f)  + (9.0f * 3.0f)  + (10.0f * 2.0f) + (11.0f * 5.0f),

        (12.0f * 1.0f) + (13.0f * 3.0f) + (14.0f * 2.0f) + (15.0f * 5.0f));



    CHECK(result.x == Approx(EXPECTED_1.x));

    CHECK(result.y == Approx(EXPECTED_1.y));

    CHECK(result.z == Approx(EXPECTED_1.z));

    CHECK(result.w == Approx(EXPECTED_1.w));



    // Second test case, checked with Wolfram Alpha (https://www.wolframalpha.com/input/?i=%7B%7B1,+7,+23,+-5%7D,%7B0,+-9,+-5,+1%7D,%7B2,+6,+-3,+8%7D,%7B-1,+8,+11,+-5%7D%7D+*+%7B-7,+5,+-3,+1%7D)

    constexpr std::array<float, 16> TEST_CASE_2 =

                                        {1, 7, 23, -5,

                                         0, -9, -5, 1,

                                         2, 6, -3, 8,

                                         -1, 8, 11, -5};



    for (int i = 0; i < 4; i++)

    {

        for (int j = 0; j < 4; j++)

        {

            const int arrayIdx = (i * 4) + j;

            mat(i, j) = TEST_CASE_2[arrayIdx];

        }

    }



    vec = Math::simd_vec4(-7, 5, -3, 1);

    Math::simd_vec4 EXPECTED_2 = Math::simd_vec4(-46, -29, 33, 9);

    result   = mat * vec;



    CHECK(result.x == Approx(EXPECTED_2.x));

    CHECK(result.y == Approx(EXPECTED_2.y));

    CHECK(result.z == Approx(EXPECTED_2.z));

    CHECK(result.w == Approx(EXPECTED_2.w));

}

For those that are interested: The full code can be found here:
Header, Implementation.
Should you have any problems building, here's my CMakeLists.txt.

Edit: I quickly got a gist setup, it's got 5 files (2 for the matrix, 2 for the vector and 1 for main) along with a shell script that contains the command I used to compile this.

The example code runs the first test case. If any of the results are wrong, the error will be written to stderr.

edited Nov 27 at 21:22

asked Nov 27 at 20:08

shmoo6000

193311

Hello, is it possible for you to publish the matrix class in its entirety, or a reduced version that will allow me to run operator*?
– Dair
Nov 27 at 20:56

I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
– shmoo6000
Nov 27 at 21:01

1

@shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
– bruglesco
Nov 27 at 21:43

add a comment |

up vote
2
down vote

favorite

My matrix class is set-up like this:

class alignas(16 * sizeof(float)) matrix

{

    private:

        union

        {

            struct

            {

                __m256 ymm0;

                __m256 ymm1;

            };



            float data [16];

        };



        ...



    public:

        ...

}

ymm0 stores rows 0 and 1, ymm1 holds rows 2 and 3.

My simd_vec4 class is just contains a single __m128 called xmm that's aligned to 16 bytes (4 * sizeof(float)).
The vector stores its components in the as {x, y, z, w}.

Here's my reasoning for the algorithm:

We basically need to perform 4 dot products; so the first thing I do is broadcast my 4-float vector to an 8-float vector with the same 4 floats in every lane.

Now, we perform the necessary multiplications, these only carry a data dependency with the vector from step 1, so they can be parallelized/out-of-order'ed.

(a) The next step is to reduce our 16 floats down to 4 floats. We do this by first performing 1 horzontal reduction on each of our 2-lane vectors, and then swapping lanes in one of them.

(b) The second portion of our reduction operation is to merge our two 2-lane vectors into one 2-lane vector, due to the way our data is laid out, we get the same data in both lanes (different order, but we need to fix that anyway...).

We now extract only the lower lane, because it has all the data we need.

Finally, we shuffle some data around in the lower lane to put the right components in the right places.

And here's the code:

simd_vec4 operator*(const simd_mat4& lhs, const simd_vec4& rhs) noexcept

{

                                                                        // ymm0    = [        A                 B                 C                 D        ][        E                 F                 G                 H        ]

                                                                        // ymm1    = [        I                 J                 K                 L        ][        M                 N                 O                 P        ]

    __m256 broadcast = _mm256_set_m128(rhs.xmm, rhs.xmm);               // Vec     = [         x                 y                 z                 w       ][         x                 y                 z                 w       ]

                                                                        //-------------------------------------------------------------------------------------------------------------------------------------------------------------

    __m256 xy_m256 = _mm256_mul_ps(lhs.ymm0, broadcast);                // xy_m256 = [        Ax                By                Cz                Dw       ][        Ex                Fy                Gz                Hw       ]

    __m256 zw_m256 = _mm256_mul_ps(lhs.ymm1, broadcast);                // zw_m256 = [        Ix                Jy                Kz                Lw       ][        Mx                Ny                Oz                Pw       ]

                                                                        //-------------------------------------------------------------------------------------------------------------------------------------------------------------

    __m256   acc0  = _mm256_hadd_ps(xy_m256, zw_m256);                  // acc0    = [     Ix + Jy           Kz + Lw           Ax + By           Cz + Dw     ][     Mx + Ny           Oz + Pw           Ex + Fy           Gz + Hw     ]

    __m256   acc1  = _mm256_hadd_ps(zw_m256, xy_m256);                  // acc1    = [     Ax + By           Cz + Dw           Ix + Jy           Kz + Lw     ][     Ex + Fy           Gz + Hw           Mx + Ny           Oz + Pw     ]

             acc1  = _mm256_permute2f128_ps(acc1, acc1, 0x01);          // acc1    = [     Ex + Fy           Gz + Hw           Mx + Ny           Oz + Pw     ][     Ax + By           Cz + Dw           Ix + Jy           Kz + Lw     ]

    __m256 merged  = _mm256_hadd_ps(acc0, acc1);                        // merged  = [Ex + Fy + Gz + Hw Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ax + By + Cz + Dw][Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]

    __m128    vec  = _mm256_extractf128_ps(merged, 0);                  //  vec    =                                                                          [Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]

              vec  = _mm_shuffle_ps(vec, vec, _MM_SHUFFLE(2, 1, 3, 0)); //  vec    =                                                                          [Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ex + Fy + Gz + Hw Ax + By + Cz + Dw]



    return simd_vec4(vec);

}

I'm running this code on an Intel Core i7-5500U CPU @ 2.40GHz, so the allowed extensions are SSE4.1, SSE4.2, AVX2 (and below, I'm assuming).

Any advice is welcome, but one thing I do struggle with is coming up with meaningful names for the intermediate results, if anyone has any tips, you're more than welcome!

I've written two unit test so far, and both are passing (both are 'normal' cases though, I'm working on adding more 'edge' cases)

Here's the unit test code (I use Catch2 btw):

TEST_CASE("matrix * vec4", "[simd_matrix][vec4][multiplication][product]")

{

    Math::matrix mat;

    Math::simd_vec4 vec;



    // First test case, tested by writing out all operations

    constexpr std::array<float, 16> TEST_CASE_1 =

                                        {0,  1,  2,  3,

                                         4,  5,  6,  7,

                                         8,  9,  10, 11,

                                         12, 13, 14, 15};



    for (int i = 0; i < 4; i++)

    {

        for (int j = 0; j < 4; j++)

        {

            const int arrayIdx = (i * 4) + j;



            mat(i, j) = TEST_CASE_1[arrayIdx];

        }

    }



    vec = Math::simd_vec4(1.0f, 3.0f, 2.0f, 5.0f);



    Math::simd_vec4 result   = mat * vec;

    Math::simd_vec4 EXPECTED_1 = Math::simd_vec4(

        (0.0f * 1.0f)  + (1.0f * 3.0f)  + (2.0f * 2.0f)  + (3.0f * 5.0f),

        (4.0f * 1.0f)  + (5.0f * 3.0f)  + (6.0f * 2.0f)  + (7.0f * 5.0f),

        (8.0f * 1.0f)  + (9.0f * 3.0f)  + (10.0f * 2.0f) + (11.0f * 5.0f),

        (12.0f * 1.0f) + (13.0f * 3.0f) + (14.0f * 2.0f) + (15.0f * 5.0f));



    CHECK(result.x == Approx(EXPECTED_1.x));

    CHECK(result.y == Approx(EXPECTED_1.y));

    CHECK(result.z == Approx(EXPECTED_1.z));

    CHECK(result.w == Approx(EXPECTED_1.w));



    // Second test case, checked with Wolfram Alpha (https://www.wolframalpha.com/input/?i=%7B%7B1,+7,+23,+-5%7D,%7B0,+-9,+-5,+1%7D,%7B2,+6,+-3,+8%7D,%7B-1,+8,+11,+-5%7D%7D+*+%7B-7,+5,+-3,+1%7D)

    constexpr std::array<float, 16> TEST_CASE_2 =

                                        {1, 7, 23, -5,

                                         0, -9, -5, 1,

                                         2, 6, -3, 8,

                                         -1, 8, 11, -5};



    for (int i = 0; i < 4; i++)

    {

        for (int j = 0; j < 4; j++)

        {

            const int arrayIdx = (i * 4) + j;

            mat(i, j) = TEST_CASE_2[arrayIdx];

        }

    }



    vec = Math::simd_vec4(-7, 5, -3, 1);

    Math::simd_vec4 EXPECTED_2 = Math::simd_vec4(-46, -29, 33, 9);

    result   = mat * vec;



    CHECK(result.x == Approx(EXPECTED_2.x));

    CHECK(result.y == Approx(EXPECTED_2.y));

    CHECK(result.z == Approx(EXPECTED_2.z));

    CHECK(result.w == Approx(EXPECTED_2.w));

}

For those that are interested: The full code can be found here:
Header, Implementation.
Should you have any problems building, here's my CMakeLists.txt.

Edit: I quickly got a gist setup, it's got 5 files (2 for the matrix, 2 for the vector and 1 for main) along with a shell script that contains the command I used to compile this.

The example code runs the first test case. If any of the results are wrong, the error will be written to stderr.

edited Nov 27 at 21:22

asked Nov 27 at 20:08

shmoo6000

193311

My matrix class is set-up like this:

class alignas(16 * sizeof(float)) matrix

{

    private:

        union

        {

            struct

            {

                __m256 ymm0;

                __m256 ymm1;

            };



            float data [16];

        };



        ...



    public:

        ...

}

ymm0 stores rows 0 and 1, ymm1 holds rows 2 and 3.

My simd_vec4 class is just contains a single __m128 called xmm that's aligned to 16 bytes (4 * sizeof(float)).
The vector stores its components in the as {x, y, z, w}.

Here's my reasoning for the algorithm:

We basically need to perform 4 dot products; so the first thing I do is broadcast my 4-float vector to an 8-float vector with the same 4 floats in every lane.

Now, we perform the necessary multiplications, these only carry a data dependency with the vector from step 1, so they can be parallelized/out-of-order'ed.

(a) The next step is to reduce our 16 floats down to 4 floats. We do this by first performing 1 horzontal reduction on each of our 2-lane vectors, and then swapping lanes in one of them.

(b) The second portion of our reduction operation is to merge our two 2-lane vectors into one 2-lane vector, due to the way our data is laid out, we get the same data in both lanes (different order, but we need to fix that anyway...).

We now extract only the lower lane, because it has all the data we need.

Finally, we shuffle some data around in the lower lane to put the right components in the right places.

And here's the code:

simd_vec4 operator*(const simd_mat4& lhs, const simd_vec4& rhs) noexcept

{

                                                                        // ymm0    = [        A                 B                 C                 D        ][        E                 F                 G                 H        ]

                                                                        // ymm1    = [        I                 J                 K                 L        ][        M                 N                 O                 P        ]

    __m256 broadcast = _mm256_set_m128(rhs.xmm, rhs.xmm);               // Vec     = [         x                 y                 z                 w       ][         x                 y                 z                 w       ]

                                                                        //-------------------------------------------------------------------------------------------------------------------------------------------------------------

    __m256 xy_m256 = _mm256_mul_ps(lhs.ymm0, broadcast);                // xy_m256 = [        Ax                By                Cz                Dw       ][        Ex                Fy                Gz                Hw       ]

    __m256 zw_m256 = _mm256_mul_ps(lhs.ymm1, broadcast);                // zw_m256 = [        Ix                Jy                Kz                Lw       ][        Mx                Ny                Oz                Pw       ]

                                                                        //-------------------------------------------------------------------------------------------------------------------------------------------------------------

    __m256   acc0  = _mm256_hadd_ps(xy_m256, zw_m256);                  // acc0    = [     Ix + Jy           Kz + Lw           Ax + By           Cz + Dw     ][     Mx + Ny           Oz + Pw           Ex + Fy           Gz + Hw     ]

    __m256   acc1  = _mm256_hadd_ps(zw_m256, xy_m256);                  // acc1    = [     Ax + By           Cz + Dw           Ix + Jy           Kz + Lw     ][     Ex + Fy           Gz + Hw           Mx + Ny           Oz + Pw     ]

             acc1  = _mm256_permute2f128_ps(acc1, acc1, 0x01);          // acc1    = [     Ex + Fy           Gz + Hw           Mx + Ny           Oz + Pw     ][     Ax + By           Cz + Dw           Ix + Jy           Kz + Lw     ]

    __m256 merged  = _mm256_hadd_ps(acc0, acc1);                        // merged  = [Ex + Fy + Gz + Hw Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ax + By + Cz + Dw][Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]

    __m128    vec  = _mm256_extractf128_ps(merged, 0);                  //  vec    =                                                                          [Ax + By + Cz + Dw Ix + Jy + Kz + Lw Mx + Ny + Oz + Pw Ex + Fy + Gz + Hw]

              vec  = _mm_shuffle_ps(vec, vec, _MM_SHUFFLE(2, 1, 3, 0)); //  vec    =                                                                          [Mx + Ny + Oz + Pw Ix + Jy + Kz + Lw Ex + Fy + Gz + Hw Ax + By + Cz + Dw]



    return simd_vec4(vec);

}

I'm running this code on an Intel Core i7-5500U CPU @ 2.40GHz, so the allowed extensions are SSE4.1, SSE4.2, AVX2 (and below, I'm assuming).

Any advice is welcome, but one thing I do struggle with is coming up with meaningful names for the intermediate results, if anyone has any tips, you're more than welcome!

I've written two unit test so far, and both are passing (both are 'normal' cases though, I'm working on adding more 'edge' cases)

Here's the unit test code (I use Catch2 btw):

TEST_CASE("matrix * vec4", "[simd_matrix][vec4][multiplication][product]")

{

    Math::matrix mat;

    Math::simd_vec4 vec;



    // First test case, tested by writing out all operations

    constexpr std::array<float, 16> TEST_CASE_1 =

                                        {0,  1,  2,  3,

                                         4,  5,  6,  7,

                                         8,  9,  10, 11,

                                         12, 13, 14, 15};



    for (int i = 0; i < 4; i++)

    {

        for (int j = 0; j < 4; j++)

        {

            const int arrayIdx = (i * 4) + j;



            mat(i, j) = TEST_CASE_1[arrayIdx];

        }

    }



    vec = Math::simd_vec4(1.0f, 3.0f, 2.0f, 5.0f);



    Math::simd_vec4 result   = mat * vec;

    Math::simd_vec4 EXPECTED_1 = Math::simd_vec4(

        (0.0f * 1.0f)  + (1.0f * 3.0f)  + (2.0f * 2.0f)  + (3.0f * 5.0f),

        (4.0f * 1.0f)  + (5.0f * 3.0f)  + (6.0f * 2.0f)  + (7.0f * 5.0f),

        (8.0f * 1.0f)  + (9.0f * 3.0f)  + (10.0f * 2.0f) + (11.0f * 5.0f),

        (12.0f * 1.0f) + (13.0f * 3.0f) + (14.0f * 2.0f) + (15.0f * 5.0f));



    CHECK(result.x == Approx(EXPECTED_1.x));

    CHECK(result.y == Approx(EXPECTED_1.y));

    CHECK(result.z == Approx(EXPECTED_1.z));

    CHECK(result.w == Approx(EXPECTED_1.w));



    // Second test case, checked with Wolfram Alpha (https://www.wolframalpha.com/input/?i=%7B%7B1,+7,+23,+-5%7D,%7B0,+-9,+-5,+1%7D,%7B2,+6,+-3,+8%7D,%7B-1,+8,+11,+-5%7D%7D+*+%7B-7,+5,+-3,+1%7D)

    constexpr std::array<float, 16> TEST_CASE_2 =

                                        {1, 7, 23, -5,

                                         0, -9, -5, 1,

                                         2, 6, -3, 8,

                                         -1, 8, 11, -5};



    for (int i = 0; i < 4; i++)

    {

        for (int j = 0; j < 4; j++)

        {

            const int arrayIdx = (i * 4) + j;

            mat(i, j) = TEST_CASE_2[arrayIdx];

        }

    }



    vec = Math::simd_vec4(-7, 5, -3, 1);

    Math::simd_vec4 EXPECTED_2 = Math::simd_vec4(-46, -29, 33, 9);

    result   = mat * vec;



    CHECK(result.x == Approx(EXPECTED_2.x));

    CHECK(result.y == Approx(EXPECTED_2.y));

    CHECK(result.z == Approx(EXPECTED_2.z));

    CHECK(result.w == Approx(EXPECTED_2.w));

}

For those that are interested: The full code can be found here:
Header, Implementation.
Should you have any problems building, here's my CMakeLists.txt.

Edit: I quickly got a gist setup, it's got 5 files (2 for the matrix, 2 for the vector and 1 for main) along with a shell script that contains the command I used to compile this.

The example code runs the first test case. If any of the results are wrong, the error will be written to stderr.

c++ performance matrix simd

edited Nov 27 at 21:22

asked Nov 27 at 20:08

shmoo6000

193311

edited Nov 27 at 21:22

asked Nov 27 at 20:08

shmoo6000

193311

edited Nov 27 at 21:22

asked Nov 27 at 20:08

shmoo6000

193311

asked Nov 27 at 20:08

shmoo6000

193311

asked Nov 27 at 20:08

shmoo6000

193311

Hello, is it possible for you to publish the matrix class in its entirety, or a reduced version that will allow me to run operator*?
– Dair
Nov 27 at 20:56

I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
– shmoo6000
Nov 27 at 21:01

1

@shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
– bruglesco
Nov 27 at 21:43

add a comment |

Hello, is it possible for you to publish the matrix class in its entirety, or a reduced version that will allow me to run operator*?
– Dair
Nov 27 at 20:56

I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
– shmoo6000
Nov 27 at 21:01

1

@shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
– bruglesco
Nov 27 at 21:43

Hello, is it possible for you to publish the matrix class in its entirety, or a reduced version that will allow me to run operator*?
– Dair
Nov 27 at 20:56

I added the necessary links to the bottom of the post, I'll get to work on a more minimal version.
– shmoo6000
Nov 27 at 21:01

@shmoo6000 minimal is explicitly discouraged here. We are not stack overflow. We need coffee in its entirety in order to review it effectively.
– bruglesco
Nov 27 at 21:43

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208565%2fsimd-product-of-a-4%25c3%25974-matrix-and-a-vector%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfrktyl