Expand every bit into a byte

up vote
3
down vote

favorite

I have the following code:

function TSliverHelper.SlowNorth: TSlice;

var

  i: integer;

begin

  // Add pixels 0,1,2

  // This means expanding every bit into a byte

  // Or rather every byte into an int64;

  for i:= 0 to 7 do begin

    Result.Data8[i]:= TSuperSlice.Lookup012[Self.bytes[i]];

  end;

end;

This uses a straight forward lookup table, but obviously LUT's are slow and clobber the cache. This takes about 2860 millisecs for 100.000.000 items.

The following approach is a bit faster (1797 MS, or 37% faster):

function TSliverHelper.North: TSlice;

const

  SliverToSliceMask: array[0..7] of byte = ($01,$02,$04,$08,$10,$20,$40,$80);

asm

  //RCX = @Self    (a pointer to an Int64)

  //RDX = @Result  (a pointer to an array[0..63] of byte)

  movq xmm0,[rcx]                       //Get the sliver

  mov r9,$8040201008040201

  movq xmm15,r9 //[rip+SliverToSliceMask] //Get the mask

  movlhps xmm15,xmm15                   //extend it

  mov r8,$0101010101010101              //Shuffle mask

  movq xmm14,r8                         //00 00 00 00 00 00 00 00 01 01 01 01 01 01 01 01

  pslldq xmm14,8                        //01 01 01 01 01 01 01 01 00 00 00 00 00 00 00 00

  movdqa xmm1,xmm0                      //make a copy of the sliver

  //bytes 0,1

  pshufb xmm1,xmm14                     //copy the first two bytes across

  pand xmm1,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm1,xmm15                    //Expand a bit into a byte

  movdqu [rdx],xmm1

  //bytes 2,3

  psrldq xmm0,2                         //shift in the next two bytes

  movdqa xmm2,xmm0

  pshufb xmm2,xmm14                     //copy the next two bytes across

  pand xmm2,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm2,xmm15                    //Expand a bit into a byte

  movdqu [rdx+16],xmm2

  //bytes 4,5

  psrldq xmm0,2                         //shift in the next two bytes

  movdqa xmm3,xmm0

  pshufb xmm3,xmm14                     //copy the next two bytes across

  pand xmm3,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm3,xmm15                    //Expand a bit into a byte

  movdqu [rdx+32],xmm3

  //bytes 6,7

  psrldq xmm0,2                         //shift in the next two bytes

  movdqa xmm4,xmm0

  pshufb xmm4,xmm14                     //copy the final two bytes across

  pand xmm4,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm4,xmm15                    //Expand a bit into a byte

  //Store the data

  movdqu [rdx+48],xmm4

end;

However, that is a lot of code. I'm hoping there's a way to do with less processing that's going to work faster.
The way the code works (in prose) is simple.

First we clone the input byte 8 times. Next the bit is masked off using the 01,02,04... mask and an AND operation. Finally this randomish bit is expanded into a byte using the compare-equal-to-mask (pcmpeqb).

The opposite operation is a simple PMSKMOVB.

I can use AVX1 code, but not AVX2.

edited Nov 12 at 17:28

200_success

128k15149412

asked Nov 12 at 9:02

Johan

24217

Don't you have any other options than Pascal (if I'm right)?
– Calak
Nov 12 at 9:30

It's in assembly. The Pascal code is just a wrapper, any language will do.
– Johan
Nov 12 at 9:37

1

Can't try it for now, but I assume you can reach the ASM's performance, shorter, with plain C.
– Calak
Nov 12 at 9:55

add a comment |

up vote
3
down vote

favorite

I have the following code:

function TSliverHelper.SlowNorth: TSlice;

var

  i: integer;

begin

  // Add pixels 0,1,2

  // This means expanding every bit into a byte

  // Or rather every byte into an int64;

  for i:= 0 to 7 do begin

    Result.Data8[i]:= TSuperSlice.Lookup012[Self.bytes[i]];

  end;

end;

This uses a straight forward lookup table, but obviously LUT's are slow and clobber the cache. This takes about 2860 millisecs for 100.000.000 items.

The following approach is a bit faster (1797 MS, or 37% faster):

function TSliverHelper.North: TSlice;

const

  SliverToSliceMask: array[0..7] of byte = ($01,$02,$04,$08,$10,$20,$40,$80);

asm

  //RCX = @Self    (a pointer to an Int64)

  //RDX = @Result  (a pointer to an array[0..63] of byte)

  movq xmm0,[rcx]                       //Get the sliver

  mov r9,$8040201008040201

  movq xmm15,r9 //[rip+SliverToSliceMask] //Get the mask

  movlhps xmm15,xmm15                   //extend it

  mov r8,$0101010101010101              //Shuffle mask

  movq xmm14,r8                         //00 00 00 00 00 00 00 00 01 01 01 01 01 01 01 01

  pslldq xmm14,8                        //01 01 01 01 01 01 01 01 00 00 00 00 00 00 00 00

  movdqa xmm1,xmm0                      //make a copy of the sliver

  //bytes 0,1

  pshufb xmm1,xmm14                     //copy the first two bytes across

  pand xmm1,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm1,xmm15                    //Expand a bit into a byte

  movdqu [rdx],xmm1

  //bytes 2,3

  psrldq xmm0,2                         //shift in the next two bytes

  movdqa xmm2,xmm0

  pshufb xmm2,xmm14                     //copy the next two bytes across

  pand xmm2,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm2,xmm15                    //Expand a bit into a byte

  movdqu [rdx+16],xmm2

  //bytes 4,5

  psrldq xmm0,2                         //shift in the next two bytes

  movdqa xmm3,xmm0

  pshufb xmm3,xmm14                     //copy the next two bytes across

  pand xmm3,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm3,xmm15                    //Expand a bit into a byte

  movdqu [rdx+32],xmm3

  //bytes 6,7

  psrldq xmm0,2                         //shift in the next two bytes

  movdqa xmm4,xmm0

  pshufb xmm4,xmm14                     //copy the final two bytes across

  pand xmm4,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm4,xmm15                    //Expand a bit into a byte

  //Store the data

  movdqu [rdx+48],xmm4

end;

The opposite operation is a simple PMSKMOVB.

I can use AVX1 code, but not AVX2.

edited Nov 12 at 17:28

200_success

128k15149412

asked Nov 12 at 9:02

Johan

24217

Don't you have any other options than Pascal (if I'm right)?
– Calak
Nov 12 at 9:30

It's in assembly. The Pascal code is just a wrapper, any language will do.
– Johan
Nov 12 at 9:37

1

Can't try it for now, but I assume you can reach the ASM's performance, shorter, with plain C.
– Calak
Nov 12 at 9:55

add a comment |

up vote
3
down vote

favorite

I have the following code:

function TSliverHelper.SlowNorth: TSlice;

var

  i: integer;

begin

  // Add pixels 0,1,2

  // This means expanding every bit into a byte

  // Or rather every byte into an int64;

  for i:= 0 to 7 do begin

    Result.Data8[i]:= TSuperSlice.Lookup012[Self.bytes[i]];

  end;

end;

This uses a straight forward lookup table, but obviously LUT's are slow and clobber the cache. This takes about 2860 millisecs for 100.000.000 items.

The following approach is a bit faster (1797 MS, or 37% faster):

function TSliverHelper.North: TSlice;

const

  SliverToSliceMask: array[0..7] of byte = ($01,$02,$04,$08,$10,$20,$40,$80);

asm

  //RCX = @Self    (a pointer to an Int64)

  //RDX = @Result  (a pointer to an array[0..63] of byte)

  movq xmm0,[rcx]                       //Get the sliver

  mov r9,$8040201008040201

  movq xmm15,r9 //[rip+SliverToSliceMask] //Get the mask

  movlhps xmm15,xmm15                   //extend it

  mov r8,$0101010101010101              //Shuffle mask

  movq xmm14,r8                         //00 00 00 00 00 00 00 00 01 01 01 01 01 01 01 01

  pslldq xmm14,8                        //01 01 01 01 01 01 01 01 00 00 00 00 00 00 00 00

  movdqa xmm1,xmm0                      //make a copy of the sliver

  //bytes 0,1

  pshufb xmm1,xmm14                     //copy the first two bytes across

  pand xmm1,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm1,xmm15                    //Expand a bit into a byte

  movdqu [rdx],xmm1

  //bytes 2,3

  psrldq xmm0,2                         //shift in the next two bytes

  movdqa xmm2,xmm0

  pshufb xmm2,xmm14                     //copy the next two bytes across

  pand xmm2,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm2,xmm15                    //Expand a bit into a byte

  movdqu [rdx+16],xmm2

  //bytes 4,5

  psrldq xmm0,2                         //shift in the next two bytes

  movdqa xmm3,xmm0

  pshufb xmm3,xmm14                     //copy the next two bytes across

  pand xmm3,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm3,xmm15                    //Expand a bit into a byte

  movdqu [rdx+32],xmm3

  //bytes 6,7

  psrldq xmm0,2                         //shift in the next two bytes

  movdqa xmm4,xmm0

  pshufb xmm4,xmm14                     //copy the final two bytes across

  pand xmm4,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm4,xmm15                    //Expand a bit into a byte

  //Store the data

  movdqu [rdx+48],xmm4

end;

The opposite operation is a simple PMSKMOVB.

I can use AVX1 code, but not AVX2.

edited Nov 12 at 17:28

200_success

128k15149412

asked Nov 12 at 9:02

Johan

24217

I have the following code:

function TSliverHelper.SlowNorth: TSlice;

var

  i: integer;

begin

  // Add pixels 0,1,2

  // This means expanding every bit into a byte

  // Or rather every byte into an int64;

  for i:= 0 to 7 do begin

    Result.Data8[i]:= TSuperSlice.Lookup012[Self.bytes[i]];

  end;

end;

This uses a straight forward lookup table, but obviously LUT's are slow and clobber the cache. This takes about 2860 millisecs for 100.000.000 items.

The following approach is a bit faster (1797 MS, or 37% faster):

function TSliverHelper.North: TSlice;

const

  SliverToSliceMask: array[0..7] of byte = ($01,$02,$04,$08,$10,$20,$40,$80);

asm

  //RCX = @Self    (a pointer to an Int64)

  //RDX = @Result  (a pointer to an array[0..63] of byte)

  movq xmm0,[rcx]                       //Get the sliver

  mov r9,$8040201008040201

  movq xmm15,r9 //[rip+SliverToSliceMask] //Get the mask

  movlhps xmm15,xmm15                   //extend it

  mov r8,$0101010101010101              //Shuffle mask

  movq xmm14,r8                         //00 00 00 00 00 00 00 00 01 01 01 01 01 01 01 01

  pslldq xmm14,8                        //01 01 01 01 01 01 01 01 00 00 00 00 00 00 00 00

  movdqa xmm1,xmm0                      //make a copy of the sliver

  //bytes 0,1

  pshufb xmm1,xmm14                     //copy the first two bytes across

  pand xmm1,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm1,xmm15                    //Expand a bit into a byte

  movdqu [rdx],xmm1

  //bytes 2,3

  psrldq xmm0,2                         //shift in the next two bytes

  movdqa xmm2,xmm0

  pshufb xmm2,xmm14                     //copy the next two bytes across

  pand xmm2,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm2,xmm15                    //Expand a bit into a byte

  movdqu [rdx+16],xmm2

  //bytes 4,5

  psrldq xmm0,2                         //shift in the next two bytes

  movdqa xmm3,xmm0

  pshufb xmm3,xmm14                     //copy the next two bytes across

  pand xmm3,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm3,xmm15                    //Expand a bit into a byte

  movdqu [rdx+32],xmm3

  //bytes 6,7

  psrldq xmm0,2                         //shift in the next two bytes

  movdqa xmm4,xmm0

  pshufb xmm4,xmm14                     //copy the final two bytes across

  pand xmm4,xmm15                       //Mask off the relevant bits

  pcmpeqb xmm4,xmm15                    //Expand a bit into a byte

  //Store the data

  movdqu [rdx+48],xmm4

end;

The opposite operation is a simple PMSKMOVB.

I can use AVX1 code, but not AVX2.

performance assembly native-code pascal x86

edited Nov 12 at 17:28

200_success

128k15149412

asked Nov 12 at 9:02

Johan

24217

edited Nov 12 at 17:28

200_success

128k15149412

asked Nov 12 at 9:02

Johan

24217

edited Nov 12 at 17:28

200_success

128k15149412

edited Nov 12 at 17:28

200_success

128k15149412

edited Nov 12 at 17:28

200_success

128k15149412

asked Nov 12 at 9:02

Johan

24217

asked Nov 12 at 9:02

Johan

24217

asked Nov 12 at 9:02

Johan

24217

Don't you have any other options than Pascal (if I'm right)?
– Calak
Nov 12 at 9:30

It's in assembly. The Pascal code is just a wrapper, any language will do.
– Johan
Nov 12 at 9:37

1

Can't try it for now, but I assume you can reach the ASM's performance, shorter, with plain C.
– Calak
Nov 12 at 9:55

add a comment |

Don't you have any other options than Pascal (if I'm right)?
– Calak
Nov 12 at 9:30

It's in assembly. The Pascal code is just a wrapper, any language will do.
– Johan
Nov 12 at 9:37

1

Can't try it for now, but I assume you can reach the ASM's performance, shorter, with plain C.
– Calak
Nov 12 at 9:55

Don't you have any other options than Pascal (if I'm right)?
– Calak
Nov 12 at 9:30

It's in assembly. The Pascal code is just a wrapper, any language will do.
– Johan
Nov 12 at 9:37

Can't try it for now, but I assume you can reach the ASM's performance, shorter, with plain C.
– Calak
Nov 12 at 9:55

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

Use a multiplication to perform several shifts in a single instruction.

Trim the input to seven bits to avoid overlap in the second step.

Shift by 0, 7, 14, 21, 28, 35, 42 bits and aggregate the results in a 64-bit integer.

Keep only bits 0, 8, 16, 24, 32, 40, 48.

Handle the 8th bit of the input separately. Shift by 49, then add it to the others.

Example code in C#

ulong Expand(byte b)

{

    ulong shift = 0x0000040810204081ul; // bits set: 0, 7, 14, 21, 28, 35, 42

    ulong mask = 0x0001010101010101ul; // bits set: 0, 8, 16, 24, 32, 40, 48

    return (ulong)(b & 127) * shift & mask | (ulong)(b & 128) << 49;

}

answered Nov 12 at 13:21

Rainer P.

1,066413

Nice trick for byte to qword.
– W. Chang
Nov 27 at 3:32

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f207462%2fexpand-every-bit-into-a-byte%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

Use a multiplication to perform several shifts in a single instruction.

Trim the input to seven bits to avoid overlap in the second step.

Shift by 0, 7, 14, 21, 28, 35, 42 bits and aggregate the results in a 64-bit integer.

Keep only bits 0, 8, 16, 24, 32, 40, 48.

Handle the 8th bit of the input separately. Shift by 49, then add it to the others.

Example code in C#

ulong Expand(byte b)

{

    ulong shift = 0x0000040810204081ul; // bits set: 0, 7, 14, 21, 28, 35, 42

    ulong mask = 0x0001010101010101ul; // bits set: 0, 8, 16, 24, 32, 40, 48

    return (ulong)(b & 127) * shift & mask | (ulong)(b & 128) << 49;

}

answered Nov 12 at 13:21

Rainer P.

1,066413

Nice trick for byte to qword.
– W. Chang
Nov 27 at 3:32

add a comment |

up vote
1
down vote

Use a multiplication to perform several shifts in a single instruction.

Trim the input to seven bits to avoid overlap in the second step.

Shift by 0, 7, 14, 21, 28, 35, 42 bits and aggregate the results in a 64-bit integer.

Keep only bits 0, 8, 16, 24, 32, 40, 48.

Handle the 8th bit of the input separately. Shift by 49, then add it to the others.

Example code in C#

ulong Expand(byte b)

{

    ulong shift = 0x0000040810204081ul; // bits set: 0, 7, 14, 21, 28, 35, 42

    ulong mask = 0x0001010101010101ul; // bits set: 0, 8, 16, 24, 32, 40, 48

    return (ulong)(b & 127) * shift & mask | (ulong)(b & 128) << 49;

}

answered Nov 12 at 13:21

Rainer P.

1,066413

Nice trick for byte to qword.
– W. Chang
Nov 27 at 3:32

add a comment |

up vote
1
down vote

Use a multiplication to perform several shifts in a single instruction.

Trim the input to seven bits to avoid overlap in the second step.

Shift by 0, 7, 14, 21, 28, 35, 42 bits and aggregate the results in a 64-bit integer.

Keep only bits 0, 8, 16, 24, 32, 40, 48.

Handle the 8th bit of the input separately. Shift by 49, then add it to the others.

Example code in C#

ulong Expand(byte b)

{

    ulong shift = 0x0000040810204081ul; // bits set: 0, 7, 14, 21, 28, 35, 42

    ulong mask = 0x0001010101010101ul; // bits set: 0, 8, 16, 24, 32, 40, 48

    return (ulong)(b & 127) * shift & mask | (ulong)(b & 128) << 49;

}

answered Nov 12 at 13:21

Rainer P.

1,066413

Use a multiplication to perform several shifts in a single instruction.

Trim the input to seven bits to avoid overlap in the second step.

Shift by 0, 7, 14, 21, 28, 35, 42 bits and aggregate the results in a 64-bit integer.

Keep only bits 0, 8, 16, 24, 32, 40, 48.

Handle the 8th bit of the input separately. Shift by 49, then add it to the others.

Example code in C#

ulong Expand(byte b)

{

    ulong shift = 0x0000040810204081ul; // bits set: 0, 7, 14, 21, 28, 35, 42

    ulong mask = 0x0001010101010101ul; // bits set: 0, 8, 16, 24, 32, 40, 48

    return (ulong)(b & 127) * shift & mask | (ulong)(b & 128) << 49;

}

answered Nov 12 at 13:21

Rainer P.

1,066413

answered Nov 12 at 13:21

Rainer P.

1,066413

answered Nov 12 at 13:21

Rainer P.

1,066413

answered Nov 12 at 13:21

Rainer P.

1,066413

Nice trick for byte to qword.
– W. Chang
Nov 27 at 3:32

add a comment |

Nice trick for byte to qword.
– W. Chang
Nov 27 at 3:32

Nice trick for byte to qword.
– W. Chang
Nov 27 at 3:32

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

TM5 HKnk3Jm,tkdE2m4SXPl1DHW,UFt3w4hrAz

搜尋此網誌

Gfrktyl