Getting the dominant RGB color of a bitmap











up vote
12
down vote

favorite
1












I'm currently running this function 60 times per second in order to get the dominant color on my screen in RGB values. It is using approximately 15% of my CPU on 30FPS and 25% of my CPU on 60FPS. Is there any way I can improve the efficiency of this loop or is there a better way to get the color altogether?



public Color getDominantColor(System.Drawing.Bitmap bmp) {
BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, System.Drawing.Imaging.PixelFormat.Format32bppArgb);

int stride = srcData.Stride;

IntPtr Scan0 = srcData.Scan0;

int totals = new int { 0, 0, 0 };

int width = bmp.Width;
int height = bmp.Height;

unsafe
{
byte* p = (byte*)(void*)Scan0;

for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
for (int color = 0; color < 3; color++) {
int idx = (y * stride) + x * 4 + color;
totals[color] += p[idx];
}
}
}
}

int avgB = totals[0] / (width * height);
int avgG = totals[1] / (width * height);
int avgR = totals[2] / (width * height);

bmp.UnlockBits(srcData);

return Color.FromArgb(avgR, avgG, avgB);
}









share|improve this question
























  • The indentation looks very off - does it look exactly like that in your IDE?
    – Mathieu Guindon
    Mar 13 '17 at 18:56










  • Did you try to move independent computations outside the loop and not recalculating stuff?
    – Tamoghna Chowdhury
    Mar 14 '17 at 5:01










  • Consider Cuda/OpenGl. This is what GPUs are really for.
    – vnp
    Mar 14 '17 at 5:28















up vote
12
down vote

favorite
1












I'm currently running this function 60 times per second in order to get the dominant color on my screen in RGB values. It is using approximately 15% of my CPU on 30FPS and 25% of my CPU on 60FPS. Is there any way I can improve the efficiency of this loop or is there a better way to get the color altogether?



public Color getDominantColor(System.Drawing.Bitmap bmp) {
BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, System.Drawing.Imaging.PixelFormat.Format32bppArgb);

int stride = srcData.Stride;

IntPtr Scan0 = srcData.Scan0;

int totals = new int { 0, 0, 0 };

int width = bmp.Width;
int height = bmp.Height;

unsafe
{
byte* p = (byte*)(void*)Scan0;

for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
for (int color = 0; color < 3; color++) {
int idx = (y * stride) + x * 4 + color;
totals[color] += p[idx];
}
}
}
}

int avgB = totals[0] / (width * height);
int avgG = totals[1] / (width * height);
int avgR = totals[2] / (width * height);

bmp.UnlockBits(srcData);

return Color.FromArgb(avgR, avgG, avgB);
}









share|improve this question
























  • The indentation looks very off - does it look exactly like that in your IDE?
    – Mathieu Guindon
    Mar 13 '17 at 18:56










  • Did you try to move independent computations outside the loop and not recalculating stuff?
    – Tamoghna Chowdhury
    Mar 14 '17 at 5:01










  • Consider Cuda/OpenGl. This is what GPUs are really for.
    – vnp
    Mar 14 '17 at 5:28













up vote
12
down vote

favorite
1









up vote
12
down vote

favorite
1






1





I'm currently running this function 60 times per second in order to get the dominant color on my screen in RGB values. It is using approximately 15% of my CPU on 30FPS and 25% of my CPU on 60FPS. Is there any way I can improve the efficiency of this loop or is there a better way to get the color altogether?



public Color getDominantColor(System.Drawing.Bitmap bmp) {
BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, System.Drawing.Imaging.PixelFormat.Format32bppArgb);

int stride = srcData.Stride;

IntPtr Scan0 = srcData.Scan0;

int totals = new int { 0, 0, 0 };

int width = bmp.Width;
int height = bmp.Height;

unsafe
{
byte* p = (byte*)(void*)Scan0;

for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
for (int color = 0; color < 3; color++) {
int idx = (y * stride) + x * 4 + color;
totals[color] += p[idx];
}
}
}
}

int avgB = totals[0] / (width * height);
int avgG = totals[1] / (width * height);
int avgR = totals[2] / (width * height);

bmp.UnlockBits(srcData);

return Color.FromArgb(avgR, avgG, avgB);
}









share|improve this question















I'm currently running this function 60 times per second in order to get the dominant color on my screen in RGB values. It is using approximately 15% of my CPU on 30FPS and 25% of my CPU on 60FPS. Is there any way I can improve the efficiency of this loop or is there a better way to get the color altogether?



public Color getDominantColor(System.Drawing.Bitmap bmp) {
BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, System.Drawing.Imaging.PixelFormat.Format32bppArgb);

int stride = srcData.Stride;

IntPtr Scan0 = srcData.Scan0;

int totals = new int { 0, 0, 0 };

int width = bmp.Width;
int height = bmp.Height;

unsafe
{
byte* p = (byte*)(void*)Scan0;

for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
for (int color = 0; color < 3; color++) {
int idx = (y * stride) + x * 4 + color;
totals[color] += p[idx];
}
}
}
}

int avgB = totals[0] / (width * height);
int avgG = totals[1] / (width * height);
int avgR = totals[2] / (width * height);

bmp.UnlockBits(srcData);

return Color.FromArgb(avgR, avgG, avgB);
}






c# performance image






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 13 '17 at 19:16









200_success

127k14148410




127k14148410










asked Mar 13 '17 at 18:52









Arjan Kuiper

6113




6113












  • The indentation looks very off - does it look exactly like that in your IDE?
    – Mathieu Guindon
    Mar 13 '17 at 18:56










  • Did you try to move independent computations outside the loop and not recalculating stuff?
    – Tamoghna Chowdhury
    Mar 14 '17 at 5:01










  • Consider Cuda/OpenGl. This is what GPUs are really for.
    – vnp
    Mar 14 '17 at 5:28


















  • The indentation looks very off - does it look exactly like that in your IDE?
    – Mathieu Guindon
    Mar 13 '17 at 18:56










  • Did you try to move independent computations outside the loop and not recalculating stuff?
    – Tamoghna Chowdhury
    Mar 14 '17 at 5:01










  • Consider Cuda/OpenGl. This is what GPUs are really for.
    – vnp
    Mar 14 '17 at 5:28
















The indentation looks very off - does it look exactly like that in your IDE?
– Mathieu Guindon
Mar 13 '17 at 18:56




The indentation looks very off - does it look exactly like that in your IDE?
– Mathieu Guindon
Mar 13 '17 at 18:56












Did you try to move independent computations outside the loop and not recalculating stuff?
– Tamoghna Chowdhury
Mar 14 '17 at 5:01




Did you try to move independent computations outside the loop and not recalculating stuff?
– Tamoghna Chowdhury
Mar 14 '17 at 5:01












Consider Cuda/OpenGl. This is what GPUs are really for.
– vnp
Mar 14 '17 at 5:28




Consider Cuda/OpenGl. This is what GPUs are really for.
– vnp
Mar 14 '17 at 5:28










6 Answers
6






active

oldest

votes

















up vote
5
down vote













If we look at your for loop very carefully we can see that you can eliminate two multiplications:




for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
for (int color = 0; color < 3; color++) {
int idx = (y * stride) + x * 4 + color;
totals[color] += p[idx];
}
}
}



You don't use y except for y * stride, and you don't use x except for x * 4, rewrite the for loop and you can eliminate those entirely.



var heightLimit = height * stride;
var widthLimit = width * 4;

for (int y = 0; y < heightLimit; y += stride) {
for (int x = 0; x < widthLimit; x += 4) {
for (int color = 0; color < 3; color++) {
int idx = y + x + color;
totals[color] += p[idx];
}
}
}


By removing all three multiplication operations, we effectively reduce the amount of work we do by a significant portion. (There are roughly 8 instructions in the original, and 6 in the reduced, so we've eliminated 25% of our work.)



Other than that, there really isn't much you can do. You could consider chunking the work, then only recalculate the regions that were re-drawn if that's reasonable, you could also chunk and thread it (this wouldn't reduce CPU usage but it would reduce time spend on this method).



With as frequently as you call this method, the next lines may be a potential for optimization as well:




int avgB = totals[0] / (width * height);
int avgG = totals[1] / (width * height);
int avgR = totals[2] / (width * height);



Why do width * height in all three calculations? Why not store var pixelCount = width * height; then divide by pixelCount? Of course, division is still slow, but you're not using floating-point math so we can't use the reciprocal of it.



You could consider, as mentioned in a comment, using CUDA/OpenGL/GPU-level work. Basically, operating on the GPU itself instead of using the CPU to do what could be very efficient on the GPU. (It's built specifically for this type of processing.) There is at least one Stack Overflow question on running C# code on the GPU, it's not very easy or simple, but it can give you a lot of power.






share|improve this answer






























    up vote
    5
    down vote













    One useful performance trick when dealing with low-level pixel manipulations like this is that it's often possible to process red and blue together using the 8 bits for green as a gap. Since here you're just adding them, you can add 256 blue values before they would overflow past green into red.



    Taking into account John Wu's comment about the stride being irrelevant you can do (untested, and in particular might have endianness bugs; it's several years since I regularly wrote this kind of code, and that was in Java rather than C#):



            unsafe
    {
    uint* p = (uint*)(void*)Scan0;

    uint pixelCount = width * height;
    uint idx = 0;
    while (idx < (pixelCount & ~0xff)) {
    uint sumRR00BB = 0;
    uint sum00GG00 = 0;
    for (int j = 0; j < 0x100; j++) {
    sumRR00BB += p[idx] & 0xff00ff;
    sum00GG00 += p[idx] & 0x00ff00;
    idx++;
    }

    totals[0] += sumRR00BB >> 16;
    totals[1] += sum00GG00 >> 8;
    totals[2] += sumRR00BB & 0xffff;
    }

    // And the final partial block of fewer than 0x100 pixels.
    {
    uint sumRR00BB = 0;
    uint sum00GG00 = 0;
    while (idx < pixelCount) {
    sumRR00BB += p[idx] & 0xff00ff;
    sum00GG00 += p[idx] & 0x00ff00;
    idx++;
    }

    totals[0] += sumRR00BB >> 16;
    totals[1] += sum00GG00 >> 8;
    totals[2] += sumRR00BB & 0xffff;
    }
    }





    share|improve this answer

















    • 1




      You need to change to totals[0] += sumRR00BB & 0xffff; and totals[2] += sumRR00BB >> 16; because the color components are stored BGR instead of RGB.
      – Heslacher
      Mar 14 '17 at 11:45










    • This makes no sense anyway right? 'p' would be the BGR addresses, so it makes no sense to apply bitmasks over the same p[idx] value unless i'm missing something? Maybe that's how java does it in memory, but surely C# doesnt return uint* in that case.
      – Viezevingertjes
      Oct 31 at 19:43












    • @Viezevingertjes, I don't understand your objection. What does p[idx] contain if not a pixel's value?
      – Peter Taylor
      Oct 31 at 19:57










    • p[idx] = 1 byte = 8 bits, 0-255 so R, G or B, not all three, why would those bitmasks work exactly? Not trying to be a dick, would be interested in how this theoretically would work but not seeying this work in C# as-is.
      – Viezevingertjes
      Oct 31 at 20:01








    • 1




      @Viezevingertjes, I tried to explain that in the first paragraph. If you sum 256 values in the range 0x00 to 0xff the result will be in the range 0x0000 to 0xffff. So if you mask with 0xff00ff and add 256 values then the top 16 bits of the result will contain the sums of the bits masked by 0xff0000 and the bottom 16 bits of the result will contain the sums of the bits masked by 0x0000ff. If that's a clearer explanation then I can edit it in, or feel free to do so yourself.
      – Peter Taylor
      Nov 1 at 7:53


















    up vote
    4
    down vote













    Five ideas, from easy to hard:




    • You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.


    • If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.


    • You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.


    • Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.



    • Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:



          MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
      MOV EDX, Scan0 ;EDX is our data pointer
      SUB BX, BX ;Set BX to 0 for later
      Loop:
      LODSL ;Load EAX from array, increment pointer
      SHRL 8, EAX ;Dump the least eight bits
      ADDB GreenTotal, AL ;Add least 8 bits to green total
      ADCW GreenTotal+1,BX ;Carry the 1
      SHRL 8, EAX ;Shift right 8 more bits
      ADDB BlueTotal, AL ;Add least 8 bits to blue total
      ADCW BlueTotal+1, BX ;Carry the 1
      SHRL 8, EAX ;Shift right 8 more bits
      ADDB RedTotal, AL ;Add least 8 bits to red total
      ADCW RedTotal+1, BX ;Carry the 1
      LOOPNZ Loop ;Decrement CX and keep going until it is zero


      If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.








    share|improve this answer



















    • 2




      I find it hard to believe that this would improve performance at all.
      – Nic Hartley
      Mar 14 '17 at 4:15










    • Maybe, maybe not. Depends how you do it. If you render+transform the bitmap onto a framebuffer (that you never actually display), the operation may take advantage of GPU hardware acceleration-- that could actually be blazingly fast.
      – John Wu
      Mar 14 '17 at 6:20










    • Oh, for context, the answer has been edited since I replied to it -- when I replied, it was just the point about resizing.
      – Nic Hartley
      Mar 14 '17 at 15:22










    • Haha, true. Anyway, most of this conversation is irrelevant; I've purged all but my first two comments. I'd recommend editing the think you commented into your answer as a potential resource for how to link C++ to C#; it's useful to speed things up.
      – Nic Hartley
      Mar 14 '17 at 17:20


















    up vote
    2
    down vote













    Validation



    Because the method is public you shouldn't assume that you get a valid non null Bitmap. You should add a null check otherwise you are exposing implementation details of your method.



    Naming



    Based on the C# Naming guidelines methods should be named using PascalCase casing. Method level variables should be named using camelCase casing. Hence getDominantColor->GetDominantColor and IntPtr Scan0->IntPtr scan0.



    Possible problems



    You state in your question that this method is used to get the dominant color of your desktop. If you use it only for that, then all will be good.



    A problem could come up if you use this method with some different bitmap.




    • If the bitmap which is passed is of DIN A4 size with e.g 300dpi your int totals will overflow.


    Performance



    I would suggest to use pointer arithmetic instead of calculating each time the idx value. I also would remove the most inner loop like @Zefick posted.



    public System.Drawing.Color GetDominantColor(Bitmap bmp)
    {
    if (bmp == null)
    {
    throw new ArgumentNullException("bmp");
    }

    BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, bmp.PixelFormat);

    int bytesPerPixel = Image.GetPixelFormatSize(srcData.PixelFormat) / 8;

    int stride = srcData.Stride;

    IntPtr scan0 = srcData.Scan0;

    long totals = new long { 0, 0, 0 };

    int width = bmp.Width * bytesPerPixel;
    int height = bmp.Height;

    unsafe
    {
    byte* p = (byte*)(void*)scan0;

    for (int y = 0; y < height; y++)
    {
    for (int x = 0; x < width; x += bytesPerPixel)
    {
    totals[0] += p[x + 0];
    totals[1] += p[x + 1];
    totals[2] += p[x + 2];
    }

    p += stride;
    }
    }

    long pixelCount = bmp.Width * height;

    int avgB = Convert.ToInt32(totals[0] / pixelCount);
    int avgG = Convert.ToInt32(totals[1] / pixelCount);
    int avgR = Convert.ToInt32(totals[2] / pixelCount);

    bmp.UnlockBits(srcData);

    return Color.FromArgb(avgR, avgG, avgB);

    }


    Benchmarking with BechnmarkDotNet with x64 compiled yields



    Yours: 17.5252 ms

    EBrown's: 14.6109 ms

    Mine: 8.4846 ms
    Peter Taylor's: 4.6419 ms



    Until @PeterTylor doesn't change its code please see my comment: Getting the dominant RGB color of a bitmap






    share|improve this answer






























      up vote
      1
      down vote













      At least you can unroll the most inner loop without losing quality following way:



      for (int y = 0; y < height; y++) {
      for (int x = 0; x < width; x++) {
      int idx = (y * stride) + x * 4;
      totals[color] += p[idx];
      totals[color+1] += p[idx+1];
      totals[color+2] += p[idx+2];
      }
      }


      Potentially, the compiler can do this optimization itself, but I'm not sure that it does this inside an "unsafe" block.






      share|improve this answer






























        up vote
        0
        down vote













        Currently doing real-time processing where every frame counts, with a modified version of Peter Taylor's answer. Tested on a weak Raspberry PI it still resulted in 24 out of 25 frames per second. Which is amazing performance knowing what the specifications are of it and running it through Mono.



        The method itself was hard to understand at first, but maybe this way of using it might clear a few things up.



        Please note that this only works with 32bpp images due to the use of the Unsigned 32-bit integer to go through the memory.



            public unsafe uint GetAverageColorInRegion(Bitmap source, Rectangle region)
        {
        var rectangle = new Rectangle(0, 0, source.Width, source.Height);
        var bitmapData = source.LockBits(rectangle, ImageLockMode.ReadOnly, source.PixelFormat);
        var pixelCount = region.Width * region.Height;
        var scanWidth = bitmapData.Stride / 4;

        uint totals = { 0, 0, 0 };
        int flushCounter = 0;
        uint sumRR00BB = 0;
        uint sum00GG00 = 0;

        for (var y = region.Y; y < region.Y + region.Height; y++)
        {
        uint* row = (uint*)bitmapData.Scan0 + y * scanWidth;

        for (var x = region.X; x < region.X + region.Width; x++)
        {
        sumRR00BB += row[x] & 0xff00ff;
        sum00GG00 += row[x] & 0x00ff00;

        // Flush before overflow occurs.
        if (flushCounter++ == 0xff)
        {
        totals[0] += sumRR00BB >> 16;
        totals[1] += sum00GG00 >> 8;
        totals[2] += sumRR00BB & 0xffff;

        sumRR00BB = 0;
        sum00GG00 = 0;

        flushCounter = 0;
        }
        }
        }

        // Flush left-over's.
        totals[0] += sumRR00BB >> 16;
        totals[1] += sum00GG00 >> 8;
        totals[2] += sumRR00BB & 0xffff;

        // Average the totals.
        totals[0] /= (uint)pixelCount;
        totals[1] /= (uint)pixelCount;
        totals[2] /= (uint)pixelCount;

        source.UnlockBits(bitmapData);

        return totals;
        }





        share|improve this answer





















          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "196"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f157667%2fgetting-the-dominant-rgb-color-of-a-bitmap%23new-answer', 'question_page');
          }
          );

          Post as a guest
































          6 Answers
          6






          active

          oldest

          votes








          6 Answers
          6






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          5
          down vote













          If we look at your for loop very carefully we can see that you can eliminate two multiplications:




          for (int y = 0; y < height; y++) {
          for (int x = 0; x < width; x++) {
          for (int color = 0; color < 3; color++) {
          int idx = (y * stride) + x * 4 + color;
          totals[color] += p[idx];
          }
          }
          }



          You don't use y except for y * stride, and you don't use x except for x * 4, rewrite the for loop and you can eliminate those entirely.



          var heightLimit = height * stride;
          var widthLimit = width * 4;

          for (int y = 0; y < heightLimit; y += stride) {
          for (int x = 0; x < widthLimit; x += 4) {
          for (int color = 0; color < 3; color++) {
          int idx = y + x + color;
          totals[color] += p[idx];
          }
          }
          }


          By removing all three multiplication operations, we effectively reduce the amount of work we do by a significant portion. (There are roughly 8 instructions in the original, and 6 in the reduced, so we've eliminated 25% of our work.)



          Other than that, there really isn't much you can do. You could consider chunking the work, then only recalculate the regions that were re-drawn if that's reasonable, you could also chunk and thread it (this wouldn't reduce CPU usage but it would reduce time spend on this method).



          With as frequently as you call this method, the next lines may be a potential for optimization as well:




          int avgB = totals[0] / (width * height);
          int avgG = totals[1] / (width * height);
          int avgR = totals[2] / (width * height);



          Why do width * height in all three calculations? Why not store var pixelCount = width * height; then divide by pixelCount? Of course, division is still slow, but you're not using floating-point math so we can't use the reciprocal of it.



          You could consider, as mentioned in a comment, using CUDA/OpenGL/GPU-level work. Basically, operating on the GPU itself instead of using the CPU to do what could be very efficient on the GPU. (It's built specifically for this type of processing.) There is at least one Stack Overflow question on running C# code on the GPU, it's not very easy or simple, but it can give you a lot of power.






          share|improve this answer



























            up vote
            5
            down vote













            If we look at your for loop very carefully we can see that you can eliminate two multiplications:




            for (int y = 0; y < height; y++) {
            for (int x = 0; x < width; x++) {
            for (int color = 0; color < 3; color++) {
            int idx = (y * stride) + x * 4 + color;
            totals[color] += p[idx];
            }
            }
            }



            You don't use y except for y * stride, and you don't use x except for x * 4, rewrite the for loop and you can eliminate those entirely.



            var heightLimit = height * stride;
            var widthLimit = width * 4;

            for (int y = 0; y < heightLimit; y += stride) {
            for (int x = 0; x < widthLimit; x += 4) {
            for (int color = 0; color < 3; color++) {
            int idx = y + x + color;
            totals[color] += p[idx];
            }
            }
            }


            By removing all three multiplication operations, we effectively reduce the amount of work we do by a significant portion. (There are roughly 8 instructions in the original, and 6 in the reduced, so we've eliminated 25% of our work.)



            Other than that, there really isn't much you can do. You could consider chunking the work, then only recalculate the regions that were re-drawn if that's reasonable, you could also chunk and thread it (this wouldn't reduce CPU usage but it would reduce time spend on this method).



            With as frequently as you call this method, the next lines may be a potential for optimization as well:




            int avgB = totals[0] / (width * height);
            int avgG = totals[1] / (width * height);
            int avgR = totals[2] / (width * height);



            Why do width * height in all three calculations? Why not store var pixelCount = width * height; then divide by pixelCount? Of course, division is still slow, but you're not using floating-point math so we can't use the reciprocal of it.



            You could consider, as mentioned in a comment, using CUDA/OpenGL/GPU-level work. Basically, operating on the GPU itself instead of using the CPU to do what could be very efficient on the GPU. (It's built specifically for this type of processing.) There is at least one Stack Overflow question on running C# code on the GPU, it's not very easy or simple, but it can give you a lot of power.






            share|improve this answer

























              up vote
              5
              down vote










              up vote
              5
              down vote









              If we look at your for loop very carefully we can see that you can eliminate two multiplications:




              for (int y = 0; y < height; y++) {
              for (int x = 0; x < width; x++) {
              for (int color = 0; color < 3; color++) {
              int idx = (y * stride) + x * 4 + color;
              totals[color] += p[idx];
              }
              }
              }



              You don't use y except for y * stride, and you don't use x except for x * 4, rewrite the for loop and you can eliminate those entirely.



              var heightLimit = height * stride;
              var widthLimit = width * 4;

              for (int y = 0; y < heightLimit; y += stride) {
              for (int x = 0; x < widthLimit; x += 4) {
              for (int color = 0; color < 3; color++) {
              int idx = y + x + color;
              totals[color] += p[idx];
              }
              }
              }


              By removing all three multiplication operations, we effectively reduce the amount of work we do by a significant portion. (There are roughly 8 instructions in the original, and 6 in the reduced, so we've eliminated 25% of our work.)



              Other than that, there really isn't much you can do. You could consider chunking the work, then only recalculate the regions that were re-drawn if that's reasonable, you could also chunk and thread it (this wouldn't reduce CPU usage but it would reduce time spend on this method).



              With as frequently as you call this method, the next lines may be a potential for optimization as well:




              int avgB = totals[0] / (width * height);
              int avgG = totals[1] / (width * height);
              int avgR = totals[2] / (width * height);



              Why do width * height in all three calculations? Why not store var pixelCount = width * height; then divide by pixelCount? Of course, division is still slow, but you're not using floating-point math so we can't use the reciprocal of it.



              You could consider, as mentioned in a comment, using CUDA/OpenGL/GPU-level work. Basically, operating on the GPU itself instead of using the CPU to do what could be very efficient on the GPU. (It's built specifically for this type of processing.) There is at least one Stack Overflow question on running C# code on the GPU, it's not very easy or simple, but it can give you a lot of power.






              share|improve this answer














              If we look at your for loop very carefully we can see that you can eliminate two multiplications:




              for (int y = 0; y < height; y++) {
              for (int x = 0; x < width; x++) {
              for (int color = 0; color < 3; color++) {
              int idx = (y * stride) + x * 4 + color;
              totals[color] += p[idx];
              }
              }
              }



              You don't use y except for y * stride, and you don't use x except for x * 4, rewrite the for loop and you can eliminate those entirely.



              var heightLimit = height * stride;
              var widthLimit = width * 4;

              for (int y = 0; y < heightLimit; y += stride) {
              for (int x = 0; x < widthLimit; x += 4) {
              for (int color = 0; color < 3; color++) {
              int idx = y + x + color;
              totals[color] += p[idx];
              }
              }
              }


              By removing all three multiplication operations, we effectively reduce the amount of work we do by a significant portion. (There are roughly 8 instructions in the original, and 6 in the reduced, so we've eliminated 25% of our work.)



              Other than that, there really isn't much you can do. You could consider chunking the work, then only recalculate the regions that were re-drawn if that's reasonable, you could also chunk and thread it (this wouldn't reduce CPU usage but it would reduce time spend on this method).



              With as frequently as you call this method, the next lines may be a potential for optimization as well:




              int avgB = totals[0] / (width * height);
              int avgG = totals[1] / (width * height);
              int avgR = totals[2] / (width * height);



              Why do width * height in all three calculations? Why not store var pixelCount = width * height; then divide by pixelCount? Of course, division is still slow, but you're not using floating-point math so we can't use the reciprocal of it.



              You could consider, as mentioned in a comment, using CUDA/OpenGL/GPU-level work. Basically, operating on the GPU itself instead of using the CPU to do what could be very efficient on the GPU. (It's built specifically for this type of processing.) There is at least one Stack Overflow question on running C# code on the GPU, it's not very easy or simple, but it can give you a lot of power.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited May 23 '17 at 12:40









              Community

              1




              1










              answered Mar 14 '17 at 6:37









              202_accepted

              15.2k249130




              15.2k249130
























                  up vote
                  5
                  down vote













                  One useful performance trick when dealing with low-level pixel manipulations like this is that it's often possible to process red and blue together using the 8 bits for green as a gap. Since here you're just adding them, you can add 256 blue values before they would overflow past green into red.



                  Taking into account John Wu's comment about the stride being irrelevant you can do (untested, and in particular might have endianness bugs; it's several years since I regularly wrote this kind of code, and that was in Java rather than C#):



                          unsafe
                  {
                  uint* p = (uint*)(void*)Scan0;

                  uint pixelCount = width * height;
                  uint idx = 0;
                  while (idx < (pixelCount & ~0xff)) {
                  uint sumRR00BB = 0;
                  uint sum00GG00 = 0;
                  for (int j = 0; j < 0x100; j++) {
                  sumRR00BB += p[idx] & 0xff00ff;
                  sum00GG00 += p[idx] & 0x00ff00;
                  idx++;
                  }

                  totals[0] += sumRR00BB >> 16;
                  totals[1] += sum00GG00 >> 8;
                  totals[2] += sumRR00BB & 0xffff;
                  }

                  // And the final partial block of fewer than 0x100 pixels.
                  {
                  uint sumRR00BB = 0;
                  uint sum00GG00 = 0;
                  while (idx < pixelCount) {
                  sumRR00BB += p[idx] & 0xff00ff;
                  sum00GG00 += p[idx] & 0x00ff00;
                  idx++;
                  }

                  totals[0] += sumRR00BB >> 16;
                  totals[1] += sum00GG00 >> 8;
                  totals[2] += sumRR00BB & 0xffff;
                  }
                  }





                  share|improve this answer

















                  • 1




                    You need to change to totals[0] += sumRR00BB & 0xffff; and totals[2] += sumRR00BB >> 16; because the color components are stored BGR instead of RGB.
                    – Heslacher
                    Mar 14 '17 at 11:45










                  • This makes no sense anyway right? 'p' would be the BGR addresses, so it makes no sense to apply bitmasks over the same p[idx] value unless i'm missing something? Maybe that's how java does it in memory, but surely C# doesnt return uint* in that case.
                    – Viezevingertjes
                    Oct 31 at 19:43












                  • @Viezevingertjes, I don't understand your objection. What does p[idx] contain if not a pixel's value?
                    – Peter Taylor
                    Oct 31 at 19:57










                  • p[idx] = 1 byte = 8 bits, 0-255 so R, G or B, not all three, why would those bitmasks work exactly? Not trying to be a dick, would be interested in how this theoretically would work but not seeying this work in C# as-is.
                    – Viezevingertjes
                    Oct 31 at 20:01








                  • 1




                    @Viezevingertjes, I tried to explain that in the first paragraph. If you sum 256 values in the range 0x00 to 0xff the result will be in the range 0x0000 to 0xffff. So if you mask with 0xff00ff and add 256 values then the top 16 bits of the result will contain the sums of the bits masked by 0xff0000 and the bottom 16 bits of the result will contain the sums of the bits masked by 0x0000ff. If that's a clearer explanation then I can edit it in, or feel free to do so yourself.
                    – Peter Taylor
                    Nov 1 at 7:53















                  up vote
                  5
                  down vote













                  One useful performance trick when dealing with low-level pixel manipulations like this is that it's often possible to process red and blue together using the 8 bits for green as a gap. Since here you're just adding them, you can add 256 blue values before they would overflow past green into red.



                  Taking into account John Wu's comment about the stride being irrelevant you can do (untested, and in particular might have endianness bugs; it's several years since I regularly wrote this kind of code, and that was in Java rather than C#):



                          unsafe
                  {
                  uint* p = (uint*)(void*)Scan0;

                  uint pixelCount = width * height;
                  uint idx = 0;
                  while (idx < (pixelCount & ~0xff)) {
                  uint sumRR00BB = 0;
                  uint sum00GG00 = 0;
                  for (int j = 0; j < 0x100; j++) {
                  sumRR00BB += p[idx] & 0xff00ff;
                  sum00GG00 += p[idx] & 0x00ff00;
                  idx++;
                  }

                  totals[0] += sumRR00BB >> 16;
                  totals[1] += sum00GG00 >> 8;
                  totals[2] += sumRR00BB & 0xffff;
                  }

                  // And the final partial block of fewer than 0x100 pixels.
                  {
                  uint sumRR00BB = 0;
                  uint sum00GG00 = 0;
                  while (idx < pixelCount) {
                  sumRR00BB += p[idx] & 0xff00ff;
                  sum00GG00 += p[idx] & 0x00ff00;
                  idx++;
                  }

                  totals[0] += sumRR00BB >> 16;
                  totals[1] += sum00GG00 >> 8;
                  totals[2] += sumRR00BB & 0xffff;
                  }
                  }





                  share|improve this answer

















                  • 1




                    You need to change to totals[0] += sumRR00BB & 0xffff; and totals[2] += sumRR00BB >> 16; because the color components are stored BGR instead of RGB.
                    – Heslacher
                    Mar 14 '17 at 11:45










                  • This makes no sense anyway right? 'p' would be the BGR addresses, so it makes no sense to apply bitmasks over the same p[idx] value unless i'm missing something? Maybe that's how java does it in memory, but surely C# doesnt return uint* in that case.
                    – Viezevingertjes
                    Oct 31 at 19:43












                  • @Viezevingertjes, I don't understand your objection. What does p[idx] contain if not a pixel's value?
                    – Peter Taylor
                    Oct 31 at 19:57










                  • p[idx] = 1 byte = 8 bits, 0-255 so R, G or B, not all three, why would those bitmasks work exactly? Not trying to be a dick, would be interested in how this theoretically would work but not seeying this work in C# as-is.
                    – Viezevingertjes
                    Oct 31 at 20:01








                  • 1




                    @Viezevingertjes, I tried to explain that in the first paragraph. If you sum 256 values in the range 0x00 to 0xff the result will be in the range 0x0000 to 0xffff. So if you mask with 0xff00ff and add 256 values then the top 16 bits of the result will contain the sums of the bits masked by 0xff0000 and the bottom 16 bits of the result will contain the sums of the bits masked by 0x0000ff. If that's a clearer explanation then I can edit it in, or feel free to do so yourself.
                    – Peter Taylor
                    Nov 1 at 7:53













                  up vote
                  5
                  down vote










                  up vote
                  5
                  down vote









                  One useful performance trick when dealing with low-level pixel manipulations like this is that it's often possible to process red and blue together using the 8 bits for green as a gap. Since here you're just adding them, you can add 256 blue values before they would overflow past green into red.



                  Taking into account John Wu's comment about the stride being irrelevant you can do (untested, and in particular might have endianness bugs; it's several years since I regularly wrote this kind of code, and that was in Java rather than C#):



                          unsafe
                  {
                  uint* p = (uint*)(void*)Scan0;

                  uint pixelCount = width * height;
                  uint idx = 0;
                  while (idx < (pixelCount & ~0xff)) {
                  uint sumRR00BB = 0;
                  uint sum00GG00 = 0;
                  for (int j = 0; j < 0x100; j++) {
                  sumRR00BB += p[idx] & 0xff00ff;
                  sum00GG00 += p[idx] & 0x00ff00;
                  idx++;
                  }

                  totals[0] += sumRR00BB >> 16;
                  totals[1] += sum00GG00 >> 8;
                  totals[2] += sumRR00BB & 0xffff;
                  }

                  // And the final partial block of fewer than 0x100 pixels.
                  {
                  uint sumRR00BB = 0;
                  uint sum00GG00 = 0;
                  while (idx < pixelCount) {
                  sumRR00BB += p[idx] & 0xff00ff;
                  sum00GG00 += p[idx] & 0x00ff00;
                  idx++;
                  }

                  totals[0] += sumRR00BB >> 16;
                  totals[1] += sum00GG00 >> 8;
                  totals[2] += sumRR00BB & 0xffff;
                  }
                  }





                  share|improve this answer












                  One useful performance trick when dealing with low-level pixel manipulations like this is that it's often possible to process red and blue together using the 8 bits for green as a gap. Since here you're just adding them, you can add 256 blue values before they would overflow past green into red.



                  Taking into account John Wu's comment about the stride being irrelevant you can do (untested, and in particular might have endianness bugs; it's several years since I regularly wrote this kind of code, and that was in Java rather than C#):



                          unsafe
                  {
                  uint* p = (uint*)(void*)Scan0;

                  uint pixelCount = width * height;
                  uint idx = 0;
                  while (idx < (pixelCount & ~0xff)) {
                  uint sumRR00BB = 0;
                  uint sum00GG00 = 0;
                  for (int j = 0; j < 0x100; j++) {
                  sumRR00BB += p[idx] & 0xff00ff;
                  sum00GG00 += p[idx] & 0x00ff00;
                  idx++;
                  }

                  totals[0] += sumRR00BB >> 16;
                  totals[1] += sum00GG00 >> 8;
                  totals[2] += sumRR00BB & 0xffff;
                  }

                  // And the final partial block of fewer than 0x100 pixels.
                  {
                  uint sumRR00BB = 0;
                  uint sum00GG00 = 0;
                  while (idx < pixelCount) {
                  sumRR00BB += p[idx] & 0xff00ff;
                  sum00GG00 += p[idx] & 0x00ff00;
                  idx++;
                  }

                  totals[0] += sumRR00BB >> 16;
                  totals[1] += sum00GG00 >> 8;
                  totals[2] += sumRR00BB & 0xffff;
                  }
                  }






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Mar 14 '17 at 9:00









                  Peter Taylor

                  15.3k2657




                  15.3k2657








                  • 1




                    You need to change to totals[0] += sumRR00BB & 0xffff; and totals[2] += sumRR00BB >> 16; because the color components are stored BGR instead of RGB.
                    – Heslacher
                    Mar 14 '17 at 11:45










                  • This makes no sense anyway right? 'p' would be the BGR addresses, so it makes no sense to apply bitmasks over the same p[idx] value unless i'm missing something? Maybe that's how java does it in memory, but surely C# doesnt return uint* in that case.
                    – Viezevingertjes
                    Oct 31 at 19:43












                  • @Viezevingertjes, I don't understand your objection. What does p[idx] contain if not a pixel's value?
                    – Peter Taylor
                    Oct 31 at 19:57










                  • p[idx] = 1 byte = 8 bits, 0-255 so R, G or B, not all three, why would those bitmasks work exactly? Not trying to be a dick, would be interested in how this theoretically would work but not seeying this work in C# as-is.
                    – Viezevingertjes
                    Oct 31 at 20:01








                  • 1




                    @Viezevingertjes, I tried to explain that in the first paragraph. If you sum 256 values in the range 0x00 to 0xff the result will be in the range 0x0000 to 0xffff. So if you mask with 0xff00ff and add 256 values then the top 16 bits of the result will contain the sums of the bits masked by 0xff0000 and the bottom 16 bits of the result will contain the sums of the bits masked by 0x0000ff. If that's a clearer explanation then I can edit it in, or feel free to do so yourself.
                    – Peter Taylor
                    Nov 1 at 7:53














                  • 1




                    You need to change to totals[0] += sumRR00BB & 0xffff; and totals[2] += sumRR00BB >> 16; because the color components are stored BGR instead of RGB.
                    – Heslacher
                    Mar 14 '17 at 11:45










                  • This makes no sense anyway right? 'p' would be the BGR addresses, so it makes no sense to apply bitmasks over the same p[idx] value unless i'm missing something? Maybe that's how java does it in memory, but surely C# doesnt return uint* in that case.
                    – Viezevingertjes
                    Oct 31 at 19:43












                  • @Viezevingertjes, I don't understand your objection. What does p[idx] contain if not a pixel's value?
                    – Peter Taylor
                    Oct 31 at 19:57










                  • p[idx] = 1 byte = 8 bits, 0-255 so R, G or B, not all three, why would those bitmasks work exactly? Not trying to be a dick, would be interested in how this theoretically would work but not seeying this work in C# as-is.
                    – Viezevingertjes
                    Oct 31 at 20:01








                  • 1




                    @Viezevingertjes, I tried to explain that in the first paragraph. If you sum 256 values in the range 0x00 to 0xff the result will be in the range 0x0000 to 0xffff. So if you mask with 0xff00ff and add 256 values then the top 16 bits of the result will contain the sums of the bits masked by 0xff0000 and the bottom 16 bits of the result will contain the sums of the bits masked by 0x0000ff. If that's a clearer explanation then I can edit it in, or feel free to do so yourself.
                    – Peter Taylor
                    Nov 1 at 7:53








                  1




                  1




                  You need to change to totals[0] += sumRR00BB & 0xffff; and totals[2] += sumRR00BB >> 16; because the color components are stored BGR instead of RGB.
                  – Heslacher
                  Mar 14 '17 at 11:45




                  You need to change to totals[0] += sumRR00BB & 0xffff; and totals[2] += sumRR00BB >> 16; because the color components are stored BGR instead of RGB.
                  – Heslacher
                  Mar 14 '17 at 11:45












                  This makes no sense anyway right? 'p' would be the BGR addresses, so it makes no sense to apply bitmasks over the same p[idx] value unless i'm missing something? Maybe that's how java does it in memory, but surely C# doesnt return uint* in that case.
                  – Viezevingertjes
                  Oct 31 at 19:43






                  This makes no sense anyway right? 'p' would be the BGR addresses, so it makes no sense to apply bitmasks over the same p[idx] value unless i'm missing something? Maybe that's how java does it in memory, but surely C# doesnt return uint* in that case.
                  – Viezevingertjes
                  Oct 31 at 19:43














                  @Viezevingertjes, I don't understand your objection. What does p[idx] contain if not a pixel's value?
                  – Peter Taylor
                  Oct 31 at 19:57




                  @Viezevingertjes, I don't understand your objection. What does p[idx] contain if not a pixel's value?
                  – Peter Taylor
                  Oct 31 at 19:57












                  p[idx] = 1 byte = 8 bits, 0-255 so R, G or B, not all three, why would those bitmasks work exactly? Not trying to be a dick, would be interested in how this theoretically would work but not seeying this work in C# as-is.
                  – Viezevingertjes
                  Oct 31 at 20:01






                  p[idx] = 1 byte = 8 bits, 0-255 so R, G or B, not all three, why would those bitmasks work exactly? Not trying to be a dick, would be interested in how this theoretically would work but not seeying this work in C# as-is.
                  – Viezevingertjes
                  Oct 31 at 20:01






                  1




                  1




                  @Viezevingertjes, I tried to explain that in the first paragraph. If you sum 256 values in the range 0x00 to 0xff the result will be in the range 0x0000 to 0xffff. So if you mask with 0xff00ff and add 256 values then the top 16 bits of the result will contain the sums of the bits masked by 0xff0000 and the bottom 16 bits of the result will contain the sums of the bits masked by 0x0000ff. If that's a clearer explanation then I can edit it in, or feel free to do so yourself.
                  – Peter Taylor
                  Nov 1 at 7:53




                  @Viezevingertjes, I tried to explain that in the first paragraph. If you sum 256 values in the range 0x00 to 0xff the result will be in the range 0x0000 to 0xffff. So if you mask with 0xff00ff and add 256 values then the top 16 bits of the result will contain the sums of the bits masked by 0xff0000 and the bottom 16 bits of the result will contain the sums of the bits masked by 0x0000ff. If that's a clearer explanation then I can edit it in, or feel free to do so yourself.
                  – Peter Taylor
                  Nov 1 at 7:53










                  up vote
                  4
                  down vote













                  Five ideas, from easy to hard:




                  • You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.


                  • If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.


                  • You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.


                  • Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.



                  • Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:



                        MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
                    MOV EDX, Scan0 ;EDX is our data pointer
                    SUB BX, BX ;Set BX to 0 for later
                    Loop:
                    LODSL ;Load EAX from array, increment pointer
                    SHRL 8, EAX ;Dump the least eight bits
                    ADDB GreenTotal, AL ;Add least 8 bits to green total
                    ADCW GreenTotal+1,BX ;Carry the 1
                    SHRL 8, EAX ;Shift right 8 more bits
                    ADDB BlueTotal, AL ;Add least 8 bits to blue total
                    ADCW BlueTotal+1, BX ;Carry the 1
                    SHRL 8, EAX ;Shift right 8 more bits
                    ADDB RedTotal, AL ;Add least 8 bits to red total
                    ADCW RedTotal+1, BX ;Carry the 1
                    LOOPNZ Loop ;Decrement CX and keep going until it is zero


                    If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.








                  share|improve this answer



















                  • 2




                    I find it hard to believe that this would improve performance at all.
                    – Nic Hartley
                    Mar 14 '17 at 4:15










                  • Maybe, maybe not. Depends how you do it. If you render+transform the bitmap onto a framebuffer (that you never actually display), the operation may take advantage of GPU hardware acceleration-- that could actually be blazingly fast.
                    – John Wu
                    Mar 14 '17 at 6:20










                  • Oh, for context, the answer has been edited since I replied to it -- when I replied, it was just the point about resizing.
                    – Nic Hartley
                    Mar 14 '17 at 15:22










                  • Haha, true. Anyway, most of this conversation is irrelevant; I've purged all but my first two comments. I'd recommend editing the think you commented into your answer as a potential resource for how to link C++ to C#; it's useful to speed things up.
                    – Nic Hartley
                    Mar 14 '17 at 17:20















                  up vote
                  4
                  down vote













                  Five ideas, from easy to hard:




                  • You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.


                  • If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.


                  • You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.


                  • Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.



                  • Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:



                        MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
                    MOV EDX, Scan0 ;EDX is our data pointer
                    SUB BX, BX ;Set BX to 0 for later
                    Loop:
                    LODSL ;Load EAX from array, increment pointer
                    SHRL 8, EAX ;Dump the least eight bits
                    ADDB GreenTotal, AL ;Add least 8 bits to green total
                    ADCW GreenTotal+1,BX ;Carry the 1
                    SHRL 8, EAX ;Shift right 8 more bits
                    ADDB BlueTotal, AL ;Add least 8 bits to blue total
                    ADCW BlueTotal+1, BX ;Carry the 1
                    SHRL 8, EAX ;Shift right 8 more bits
                    ADDB RedTotal, AL ;Add least 8 bits to red total
                    ADCW RedTotal+1, BX ;Carry the 1
                    LOOPNZ Loop ;Decrement CX and keep going until it is zero


                    If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.








                  share|improve this answer



















                  • 2




                    I find it hard to believe that this would improve performance at all.
                    – Nic Hartley
                    Mar 14 '17 at 4:15










                  • Maybe, maybe not. Depends how you do it. If you render+transform the bitmap onto a framebuffer (that you never actually display), the operation may take advantage of GPU hardware acceleration-- that could actually be blazingly fast.
                    – John Wu
                    Mar 14 '17 at 6:20










                  • Oh, for context, the answer has been edited since I replied to it -- when I replied, it was just the point about resizing.
                    – Nic Hartley
                    Mar 14 '17 at 15:22










                  • Haha, true. Anyway, most of this conversation is irrelevant; I've purged all but my first two comments. I'd recommend editing the think you commented into your answer as a potential resource for how to link C++ to C#; it's useful to speed things up.
                    – Nic Hartley
                    Mar 14 '17 at 17:20













                  up vote
                  4
                  down vote










                  up vote
                  4
                  down vote









                  Five ideas, from easy to hard:




                  • You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.


                  • If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.


                  • You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.


                  • Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.



                  • Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:



                        MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
                    MOV EDX, Scan0 ;EDX is our data pointer
                    SUB BX, BX ;Set BX to 0 for later
                    Loop:
                    LODSL ;Load EAX from array, increment pointer
                    SHRL 8, EAX ;Dump the least eight bits
                    ADDB GreenTotal, AL ;Add least 8 bits to green total
                    ADCW GreenTotal+1,BX ;Carry the 1
                    SHRL 8, EAX ;Shift right 8 more bits
                    ADDB BlueTotal, AL ;Add least 8 bits to blue total
                    ADCW BlueTotal+1, BX ;Carry the 1
                    SHRL 8, EAX ;Shift right 8 more bits
                    ADDB RedTotal, AL ;Add least 8 bits to red total
                    ADCW RedTotal+1, BX ;Carry the 1
                    LOOPNZ Loop ;Decrement CX and keep going until it is zero


                    If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.








                  share|improve this answer














                  Five ideas, from easy to hard:




                  • You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.


                  • If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.


                  • You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.


                  • Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.



                  • Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:



                        MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
                    MOV EDX, Scan0 ;EDX is our data pointer
                    SUB BX, BX ;Set BX to 0 for later
                    Loop:
                    LODSL ;Load EAX from array, increment pointer
                    SHRL 8, EAX ;Dump the least eight bits
                    ADDB GreenTotal, AL ;Add least 8 bits to green total
                    ADCW GreenTotal+1,BX ;Carry the 1
                    SHRL 8, EAX ;Shift right 8 more bits
                    ADDB BlueTotal, AL ;Add least 8 bits to blue total
                    ADCW BlueTotal+1, BX ;Carry the 1
                    SHRL 8, EAX ;Shift right 8 more bits
                    ADDB RedTotal, AL ;Add least 8 bits to red total
                    ADCW RedTotal+1, BX ;Carry the 1
                    LOOPNZ Loop ;Decrement CX and keep going until it is zero


                    If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.









                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited May 23 '17 at 12:40









                  Community

                  1




                  1










                  answered Mar 14 '17 at 3:09









                  John Wu

                  54136




                  54136








                  • 2




                    I find it hard to believe that this would improve performance at all.
                    – Nic Hartley
                    Mar 14 '17 at 4:15










                  • Maybe, maybe not. Depends how you do it. If you render+transform the bitmap onto a framebuffer (that you never actually display), the operation may take advantage of GPU hardware acceleration-- that could actually be blazingly fast.
                    – John Wu
                    Mar 14 '17 at 6:20










                  • Oh, for context, the answer has been edited since I replied to it -- when I replied, it was just the point about resizing.
                    – Nic Hartley
                    Mar 14 '17 at 15:22










                  • Haha, true. Anyway, most of this conversation is irrelevant; I've purged all but my first two comments. I'd recommend editing the think you commented into your answer as a potential resource for how to link C++ to C#; it's useful to speed things up.
                    – Nic Hartley
                    Mar 14 '17 at 17:20














                  • 2




                    I find it hard to believe that this would improve performance at all.
                    – Nic Hartley
                    Mar 14 '17 at 4:15










                  • Maybe, maybe not. Depends how you do it. If you render+transform the bitmap onto a framebuffer (that you never actually display), the operation may take advantage of GPU hardware acceleration-- that could actually be blazingly fast.
                    – John Wu
                    Mar 14 '17 at 6:20










                  • Oh, for context, the answer has been edited since I replied to it -- when I replied, it was just the point about resizing.
                    – Nic Hartley
                    Mar 14 '17 at 15:22










                  • Haha, true. Anyway, most of this conversation is irrelevant; I've purged all but my first two comments. I'd recommend editing the think you commented into your answer as a potential resource for how to link C++ to C#; it's useful to speed things up.
                    – Nic Hartley
                    Mar 14 '17 at 17:20








                  2




                  2




                  I find it hard to believe that this would improve performance at all.
                  – Nic Hartley
                  Mar 14 '17 at 4:15




                  I find it hard to believe that this would improve performance at all.
                  – Nic Hartley
                  Mar 14 '17 at 4:15












                  Maybe, maybe not. Depends how you do it. If you render+transform the bitmap onto a framebuffer (that you never actually display), the operation may take advantage of GPU hardware acceleration-- that could actually be blazingly fast.
                  – John Wu
                  Mar 14 '17 at 6:20




                  Maybe, maybe not. Depends how you do it. If you render+transform the bitmap onto a framebuffer (that you never actually display), the operation may take advantage of GPU hardware acceleration-- that could actually be blazingly fast.
                  – John Wu
                  Mar 14 '17 at 6:20












                  Oh, for context, the answer has been edited since I replied to it -- when I replied, it was just the point about resizing.
                  – Nic Hartley
                  Mar 14 '17 at 15:22




                  Oh, for context, the answer has been edited since I replied to it -- when I replied, it was just the point about resizing.
                  – Nic Hartley
                  Mar 14 '17 at 15:22












                  Haha, true. Anyway, most of this conversation is irrelevant; I've purged all but my first two comments. I'd recommend editing the think you commented into your answer as a potential resource for how to link C++ to C#; it's useful to speed things up.
                  – Nic Hartley
                  Mar 14 '17 at 17:20




                  Haha, true. Anyway, most of this conversation is irrelevant; I've purged all but my first two comments. I'd recommend editing the think you commented into your answer as a potential resource for how to link C++ to C#; it's useful to speed things up.
                  – Nic Hartley
                  Mar 14 '17 at 17:20










                  up vote
                  2
                  down vote













                  Validation



                  Because the method is public you shouldn't assume that you get a valid non null Bitmap. You should add a null check otherwise you are exposing implementation details of your method.



                  Naming



                  Based on the C# Naming guidelines methods should be named using PascalCase casing. Method level variables should be named using camelCase casing. Hence getDominantColor->GetDominantColor and IntPtr Scan0->IntPtr scan0.



                  Possible problems



                  You state in your question that this method is used to get the dominant color of your desktop. If you use it only for that, then all will be good.



                  A problem could come up if you use this method with some different bitmap.




                  • If the bitmap which is passed is of DIN A4 size with e.g 300dpi your int totals will overflow.


                  Performance



                  I would suggest to use pointer arithmetic instead of calculating each time the idx value. I also would remove the most inner loop like @Zefick posted.



                  public System.Drawing.Color GetDominantColor(Bitmap bmp)
                  {
                  if (bmp == null)
                  {
                  throw new ArgumentNullException("bmp");
                  }

                  BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, bmp.PixelFormat);

                  int bytesPerPixel = Image.GetPixelFormatSize(srcData.PixelFormat) / 8;

                  int stride = srcData.Stride;

                  IntPtr scan0 = srcData.Scan0;

                  long totals = new long { 0, 0, 0 };

                  int width = bmp.Width * bytesPerPixel;
                  int height = bmp.Height;

                  unsafe
                  {
                  byte* p = (byte*)(void*)scan0;

                  for (int y = 0; y < height; y++)
                  {
                  for (int x = 0; x < width; x += bytesPerPixel)
                  {
                  totals[0] += p[x + 0];
                  totals[1] += p[x + 1];
                  totals[2] += p[x + 2];
                  }

                  p += stride;
                  }
                  }

                  long pixelCount = bmp.Width * height;

                  int avgB = Convert.ToInt32(totals[0] / pixelCount);
                  int avgG = Convert.ToInt32(totals[1] / pixelCount);
                  int avgR = Convert.ToInt32(totals[2] / pixelCount);

                  bmp.UnlockBits(srcData);

                  return Color.FromArgb(avgR, avgG, avgB);

                  }


                  Benchmarking with BechnmarkDotNet with x64 compiled yields



                  Yours: 17.5252 ms

                  EBrown's: 14.6109 ms

                  Mine: 8.4846 ms
                  Peter Taylor's: 4.6419 ms



                  Until @PeterTylor doesn't change its code please see my comment: Getting the dominant RGB color of a bitmap






                  share|improve this answer



























                    up vote
                    2
                    down vote













                    Validation



                    Because the method is public you shouldn't assume that you get a valid non null Bitmap. You should add a null check otherwise you are exposing implementation details of your method.



                    Naming



                    Based on the C# Naming guidelines methods should be named using PascalCase casing. Method level variables should be named using camelCase casing. Hence getDominantColor->GetDominantColor and IntPtr Scan0->IntPtr scan0.



                    Possible problems



                    You state in your question that this method is used to get the dominant color of your desktop. If you use it only for that, then all will be good.



                    A problem could come up if you use this method with some different bitmap.




                    • If the bitmap which is passed is of DIN A4 size with e.g 300dpi your int totals will overflow.


                    Performance



                    I would suggest to use pointer arithmetic instead of calculating each time the idx value. I also would remove the most inner loop like @Zefick posted.



                    public System.Drawing.Color GetDominantColor(Bitmap bmp)
                    {
                    if (bmp == null)
                    {
                    throw new ArgumentNullException("bmp");
                    }

                    BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, bmp.PixelFormat);

                    int bytesPerPixel = Image.GetPixelFormatSize(srcData.PixelFormat) / 8;

                    int stride = srcData.Stride;

                    IntPtr scan0 = srcData.Scan0;

                    long totals = new long { 0, 0, 0 };

                    int width = bmp.Width * bytesPerPixel;
                    int height = bmp.Height;

                    unsafe
                    {
                    byte* p = (byte*)(void*)scan0;

                    for (int y = 0; y < height; y++)
                    {
                    for (int x = 0; x < width; x += bytesPerPixel)
                    {
                    totals[0] += p[x + 0];
                    totals[1] += p[x + 1];
                    totals[2] += p[x + 2];
                    }

                    p += stride;
                    }
                    }

                    long pixelCount = bmp.Width * height;

                    int avgB = Convert.ToInt32(totals[0] / pixelCount);
                    int avgG = Convert.ToInt32(totals[1] / pixelCount);
                    int avgR = Convert.ToInt32(totals[2] / pixelCount);

                    bmp.UnlockBits(srcData);

                    return Color.FromArgb(avgR, avgG, avgB);

                    }


                    Benchmarking with BechnmarkDotNet with x64 compiled yields



                    Yours: 17.5252 ms

                    EBrown's: 14.6109 ms

                    Mine: 8.4846 ms
                    Peter Taylor's: 4.6419 ms



                    Until @PeterTylor doesn't change its code please see my comment: Getting the dominant RGB color of a bitmap






                    share|improve this answer

























                      up vote
                      2
                      down vote










                      up vote
                      2
                      down vote









                      Validation



                      Because the method is public you shouldn't assume that you get a valid non null Bitmap. You should add a null check otherwise you are exposing implementation details of your method.



                      Naming



                      Based on the C# Naming guidelines methods should be named using PascalCase casing. Method level variables should be named using camelCase casing. Hence getDominantColor->GetDominantColor and IntPtr Scan0->IntPtr scan0.



                      Possible problems



                      You state in your question that this method is used to get the dominant color of your desktop. If you use it only for that, then all will be good.



                      A problem could come up if you use this method with some different bitmap.




                      • If the bitmap which is passed is of DIN A4 size with e.g 300dpi your int totals will overflow.


                      Performance



                      I would suggest to use pointer arithmetic instead of calculating each time the idx value. I also would remove the most inner loop like @Zefick posted.



                      public System.Drawing.Color GetDominantColor(Bitmap bmp)
                      {
                      if (bmp == null)
                      {
                      throw new ArgumentNullException("bmp");
                      }

                      BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, bmp.PixelFormat);

                      int bytesPerPixel = Image.GetPixelFormatSize(srcData.PixelFormat) / 8;

                      int stride = srcData.Stride;

                      IntPtr scan0 = srcData.Scan0;

                      long totals = new long { 0, 0, 0 };

                      int width = bmp.Width * bytesPerPixel;
                      int height = bmp.Height;

                      unsafe
                      {
                      byte* p = (byte*)(void*)scan0;

                      for (int y = 0; y < height; y++)
                      {
                      for (int x = 0; x < width; x += bytesPerPixel)
                      {
                      totals[0] += p[x + 0];
                      totals[1] += p[x + 1];
                      totals[2] += p[x + 2];
                      }

                      p += stride;
                      }
                      }

                      long pixelCount = bmp.Width * height;

                      int avgB = Convert.ToInt32(totals[0] / pixelCount);
                      int avgG = Convert.ToInt32(totals[1] / pixelCount);
                      int avgR = Convert.ToInt32(totals[2] / pixelCount);

                      bmp.UnlockBits(srcData);

                      return Color.FromArgb(avgR, avgG, avgB);

                      }


                      Benchmarking with BechnmarkDotNet with x64 compiled yields



                      Yours: 17.5252 ms

                      EBrown's: 14.6109 ms

                      Mine: 8.4846 ms
                      Peter Taylor's: 4.6419 ms



                      Until @PeterTylor doesn't change its code please see my comment: Getting the dominant RGB color of a bitmap






                      share|improve this answer














                      Validation



                      Because the method is public you shouldn't assume that you get a valid non null Bitmap. You should add a null check otherwise you are exposing implementation details of your method.



                      Naming



                      Based on the C# Naming guidelines methods should be named using PascalCase casing. Method level variables should be named using camelCase casing. Hence getDominantColor->GetDominantColor and IntPtr Scan0->IntPtr scan0.



                      Possible problems



                      You state in your question that this method is used to get the dominant color of your desktop. If you use it only for that, then all will be good.



                      A problem could come up if you use this method with some different bitmap.




                      • If the bitmap which is passed is of DIN A4 size with e.g 300dpi your int totals will overflow.


                      Performance



                      I would suggest to use pointer arithmetic instead of calculating each time the idx value. I also would remove the most inner loop like @Zefick posted.



                      public System.Drawing.Color GetDominantColor(Bitmap bmp)
                      {
                      if (bmp == null)
                      {
                      throw new ArgumentNullException("bmp");
                      }

                      BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, bmp.PixelFormat);

                      int bytesPerPixel = Image.GetPixelFormatSize(srcData.PixelFormat) / 8;

                      int stride = srcData.Stride;

                      IntPtr scan0 = srcData.Scan0;

                      long totals = new long { 0, 0, 0 };

                      int width = bmp.Width * bytesPerPixel;
                      int height = bmp.Height;

                      unsafe
                      {
                      byte* p = (byte*)(void*)scan0;

                      for (int y = 0; y < height; y++)
                      {
                      for (int x = 0; x < width; x += bytesPerPixel)
                      {
                      totals[0] += p[x + 0];
                      totals[1] += p[x + 1];
                      totals[2] += p[x + 2];
                      }

                      p += stride;
                      }
                      }

                      long pixelCount = bmp.Width * height;

                      int avgB = Convert.ToInt32(totals[0] / pixelCount);
                      int avgG = Convert.ToInt32(totals[1] / pixelCount);
                      int avgR = Convert.ToInt32(totals[2] / pixelCount);

                      bmp.UnlockBits(srcData);

                      return Color.FromArgb(avgR, avgG, avgB);

                      }


                      Benchmarking with BechnmarkDotNet with x64 compiled yields



                      Yours: 17.5252 ms

                      EBrown's: 14.6109 ms

                      Mine: 8.4846 ms
                      Peter Taylor's: 4.6419 ms



                      Until @PeterTylor doesn't change its code please see my comment: Getting the dominant RGB color of a bitmap







                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Apr 13 '17 at 12:40









                      Community

                      1




                      1










                      answered Mar 14 '17 at 13:25









                      Heslacher

                      44.3k460153




                      44.3k460153






















                          up vote
                          1
                          down vote













                          At least you can unroll the most inner loop without losing quality following way:



                          for (int y = 0; y < height; y++) {
                          for (int x = 0; x < width; x++) {
                          int idx = (y * stride) + x * 4;
                          totals[color] += p[idx];
                          totals[color+1] += p[idx+1];
                          totals[color+2] += p[idx+2];
                          }
                          }


                          Potentially, the compiler can do this optimization itself, but I'm not sure that it does this inside an "unsafe" block.






                          share|improve this answer



























                            up vote
                            1
                            down vote













                            At least you can unroll the most inner loop without losing quality following way:



                            for (int y = 0; y < height; y++) {
                            for (int x = 0; x < width; x++) {
                            int idx = (y * stride) + x * 4;
                            totals[color] += p[idx];
                            totals[color+1] += p[idx+1];
                            totals[color+2] += p[idx+2];
                            }
                            }


                            Potentially, the compiler can do this optimization itself, but I'm not sure that it does this inside an "unsafe" block.






                            share|improve this answer

























                              up vote
                              1
                              down vote










                              up vote
                              1
                              down vote









                              At least you can unroll the most inner loop without losing quality following way:



                              for (int y = 0; y < height; y++) {
                              for (int x = 0; x < width; x++) {
                              int idx = (y * stride) + x * 4;
                              totals[color] += p[idx];
                              totals[color+1] += p[idx+1];
                              totals[color+2] += p[idx+2];
                              }
                              }


                              Potentially, the compiler can do this optimization itself, but I'm not sure that it does this inside an "unsafe" block.






                              share|improve this answer














                              At least you can unroll the most inner loop without losing quality following way:



                              for (int y = 0; y < height; y++) {
                              for (int x = 0; x < width; x++) {
                              int idx = (y * stride) + x * 4;
                              totals[color] += p[idx];
                              totals[color+1] += p[idx+1];
                              totals[color+2] += p[idx+2];
                              }
                              }


                              Potentially, the compiler can do this optimization itself, but I'm not sure that it does this inside an "unsafe" block.







                              share|improve this answer














                              share|improve this answer



                              share|improve this answer








                              edited Mar 14 '17 at 5:31

























                              answered Mar 14 '17 at 4:53









                              Zefick

                              29626




                              29626






















                                  up vote
                                  0
                                  down vote













                                  Currently doing real-time processing where every frame counts, with a modified version of Peter Taylor's answer. Tested on a weak Raspberry PI it still resulted in 24 out of 25 frames per second. Which is amazing performance knowing what the specifications are of it and running it through Mono.



                                  The method itself was hard to understand at first, but maybe this way of using it might clear a few things up.



                                  Please note that this only works with 32bpp images due to the use of the Unsigned 32-bit integer to go through the memory.



                                      public unsafe uint GetAverageColorInRegion(Bitmap source, Rectangle region)
                                  {
                                  var rectangle = new Rectangle(0, 0, source.Width, source.Height);
                                  var bitmapData = source.LockBits(rectangle, ImageLockMode.ReadOnly, source.PixelFormat);
                                  var pixelCount = region.Width * region.Height;
                                  var scanWidth = bitmapData.Stride / 4;

                                  uint totals = { 0, 0, 0 };
                                  int flushCounter = 0;
                                  uint sumRR00BB = 0;
                                  uint sum00GG00 = 0;

                                  for (var y = region.Y; y < region.Y + region.Height; y++)
                                  {
                                  uint* row = (uint*)bitmapData.Scan0 + y * scanWidth;

                                  for (var x = region.X; x < region.X + region.Width; x++)
                                  {
                                  sumRR00BB += row[x] & 0xff00ff;
                                  sum00GG00 += row[x] & 0x00ff00;

                                  // Flush before overflow occurs.
                                  if (flushCounter++ == 0xff)
                                  {
                                  totals[0] += sumRR00BB >> 16;
                                  totals[1] += sum00GG00 >> 8;
                                  totals[2] += sumRR00BB & 0xffff;

                                  sumRR00BB = 0;
                                  sum00GG00 = 0;

                                  flushCounter = 0;
                                  }
                                  }
                                  }

                                  // Flush left-over's.
                                  totals[0] += sumRR00BB >> 16;
                                  totals[1] += sum00GG00 >> 8;
                                  totals[2] += sumRR00BB & 0xffff;

                                  // Average the totals.
                                  totals[0] /= (uint)pixelCount;
                                  totals[1] /= (uint)pixelCount;
                                  totals[2] /= (uint)pixelCount;

                                  source.UnlockBits(bitmapData);

                                  return totals;
                                  }





                                  share|improve this answer

























                                    up vote
                                    0
                                    down vote













                                    Currently doing real-time processing where every frame counts, with a modified version of Peter Taylor's answer. Tested on a weak Raspberry PI it still resulted in 24 out of 25 frames per second. Which is amazing performance knowing what the specifications are of it and running it through Mono.



                                    The method itself was hard to understand at first, but maybe this way of using it might clear a few things up.



                                    Please note that this only works with 32bpp images due to the use of the Unsigned 32-bit integer to go through the memory.



                                        public unsafe uint GetAverageColorInRegion(Bitmap source, Rectangle region)
                                    {
                                    var rectangle = new Rectangle(0, 0, source.Width, source.Height);
                                    var bitmapData = source.LockBits(rectangle, ImageLockMode.ReadOnly, source.PixelFormat);
                                    var pixelCount = region.Width * region.Height;
                                    var scanWidth = bitmapData.Stride / 4;

                                    uint totals = { 0, 0, 0 };
                                    int flushCounter = 0;
                                    uint sumRR00BB = 0;
                                    uint sum00GG00 = 0;

                                    for (var y = region.Y; y < region.Y + region.Height; y++)
                                    {
                                    uint* row = (uint*)bitmapData.Scan0 + y * scanWidth;

                                    for (var x = region.X; x < region.X + region.Width; x++)
                                    {
                                    sumRR00BB += row[x] & 0xff00ff;
                                    sum00GG00 += row[x] & 0x00ff00;

                                    // Flush before overflow occurs.
                                    if (flushCounter++ == 0xff)
                                    {
                                    totals[0] += sumRR00BB >> 16;
                                    totals[1] += sum00GG00 >> 8;
                                    totals[2] += sumRR00BB & 0xffff;

                                    sumRR00BB = 0;
                                    sum00GG00 = 0;

                                    flushCounter = 0;
                                    }
                                    }
                                    }

                                    // Flush left-over's.
                                    totals[0] += sumRR00BB >> 16;
                                    totals[1] += sum00GG00 >> 8;
                                    totals[2] += sumRR00BB & 0xffff;

                                    // Average the totals.
                                    totals[0] /= (uint)pixelCount;
                                    totals[1] /= (uint)pixelCount;
                                    totals[2] /= (uint)pixelCount;

                                    source.UnlockBits(bitmapData);

                                    return totals;
                                    }





                                    share|improve this answer























                                      up vote
                                      0
                                      down vote










                                      up vote
                                      0
                                      down vote









                                      Currently doing real-time processing where every frame counts, with a modified version of Peter Taylor's answer. Tested on a weak Raspberry PI it still resulted in 24 out of 25 frames per second. Which is amazing performance knowing what the specifications are of it and running it through Mono.



                                      The method itself was hard to understand at first, but maybe this way of using it might clear a few things up.



                                      Please note that this only works with 32bpp images due to the use of the Unsigned 32-bit integer to go through the memory.



                                          public unsafe uint GetAverageColorInRegion(Bitmap source, Rectangle region)
                                      {
                                      var rectangle = new Rectangle(0, 0, source.Width, source.Height);
                                      var bitmapData = source.LockBits(rectangle, ImageLockMode.ReadOnly, source.PixelFormat);
                                      var pixelCount = region.Width * region.Height;
                                      var scanWidth = bitmapData.Stride / 4;

                                      uint totals = { 0, 0, 0 };
                                      int flushCounter = 0;
                                      uint sumRR00BB = 0;
                                      uint sum00GG00 = 0;

                                      for (var y = region.Y; y < region.Y + region.Height; y++)
                                      {
                                      uint* row = (uint*)bitmapData.Scan0 + y * scanWidth;

                                      for (var x = region.X; x < region.X + region.Width; x++)
                                      {
                                      sumRR00BB += row[x] & 0xff00ff;
                                      sum00GG00 += row[x] & 0x00ff00;

                                      // Flush before overflow occurs.
                                      if (flushCounter++ == 0xff)
                                      {
                                      totals[0] += sumRR00BB >> 16;
                                      totals[1] += sum00GG00 >> 8;
                                      totals[2] += sumRR00BB & 0xffff;

                                      sumRR00BB = 0;
                                      sum00GG00 = 0;

                                      flushCounter = 0;
                                      }
                                      }
                                      }

                                      // Flush left-over's.
                                      totals[0] += sumRR00BB >> 16;
                                      totals[1] += sum00GG00 >> 8;
                                      totals[2] += sumRR00BB & 0xffff;

                                      // Average the totals.
                                      totals[0] /= (uint)pixelCount;
                                      totals[1] /= (uint)pixelCount;
                                      totals[2] /= (uint)pixelCount;

                                      source.UnlockBits(bitmapData);

                                      return totals;
                                      }





                                      share|improve this answer












                                      Currently doing real-time processing where every frame counts, with a modified version of Peter Taylor's answer. Tested on a weak Raspberry PI it still resulted in 24 out of 25 frames per second. Which is amazing performance knowing what the specifications are of it and running it through Mono.



                                      The method itself was hard to understand at first, but maybe this way of using it might clear a few things up.



                                      Please note that this only works with 32bpp images due to the use of the Unsigned 32-bit integer to go through the memory.



                                          public unsafe uint GetAverageColorInRegion(Bitmap source, Rectangle region)
                                      {
                                      var rectangle = new Rectangle(0, 0, source.Width, source.Height);
                                      var bitmapData = source.LockBits(rectangle, ImageLockMode.ReadOnly, source.PixelFormat);
                                      var pixelCount = region.Width * region.Height;
                                      var scanWidth = bitmapData.Stride / 4;

                                      uint totals = { 0, 0, 0 };
                                      int flushCounter = 0;
                                      uint sumRR00BB = 0;
                                      uint sum00GG00 = 0;

                                      for (var y = region.Y; y < region.Y + region.Height; y++)
                                      {
                                      uint* row = (uint*)bitmapData.Scan0 + y * scanWidth;

                                      for (var x = region.X; x < region.X + region.Width; x++)
                                      {
                                      sumRR00BB += row[x] & 0xff00ff;
                                      sum00GG00 += row[x] & 0x00ff00;

                                      // Flush before overflow occurs.
                                      if (flushCounter++ == 0xff)
                                      {
                                      totals[0] += sumRR00BB >> 16;
                                      totals[1] += sum00GG00 >> 8;
                                      totals[2] += sumRR00BB & 0xffff;

                                      sumRR00BB = 0;
                                      sum00GG00 = 0;

                                      flushCounter = 0;
                                      }
                                      }
                                      }

                                      // Flush left-over's.
                                      totals[0] += sumRR00BB >> 16;
                                      totals[1] += sum00GG00 >> 8;
                                      totals[2] += sumRR00BB & 0xffff;

                                      // Average the totals.
                                      totals[0] /= (uint)pixelCount;
                                      totals[1] /= (uint)pixelCount;
                                      totals[2] /= (uint)pixelCount;

                                      source.UnlockBits(bitmapData);

                                      return totals;
                                      }






                                      share|improve this answer












                                      share|improve this answer



                                      share|improve this answer










                                      answered 4 hours ago









                                      Viezevingertjes

                                      1375




                                      1375






























                                           

                                          draft saved


                                          draft discarded



















































                                           


                                          draft saved


                                          draft discarded














                                          StackExchange.ready(
                                          function () {
                                          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f157667%2fgetting-the-dominant-rgb-color-of-a-bitmap%23new-answer', 'question_page');
                                          }
                                          );

                                          Post as a guest




















































































                                          Popular posts from this blog

                                          Сан-Квентин

                                          8-я гвардейская общевойсковая армия

                                          Алькесар