Getting the dominant RGB color of a bitmap
up vote
12
down vote
favorite
I'm currently running this function 60 times per second in order to get the dominant color on my screen in RGB values. It is using approximately 15% of my CPU on 30FPS and 25% of my CPU on 60FPS. Is there any way I can improve the efficiency of this loop or is there a better way to get the color altogether?
public Color getDominantColor(System.Drawing.Bitmap bmp) {
BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, System.Drawing.Imaging.PixelFormat.Format32bppArgb);
int stride = srcData.Stride;
IntPtr Scan0 = srcData.Scan0;
int totals = new int { 0, 0, 0 };
int width = bmp.Width;
int height = bmp.Height;
unsafe
{
byte* p = (byte*)(void*)Scan0;
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
for (int color = 0; color < 3; color++) {
int idx = (y * stride) + x * 4 + color;
totals[color] += p[idx];
}
}
}
}
int avgB = totals[0] / (width * height);
int avgG = totals[1] / (width * height);
int avgR = totals[2] / (width * height);
bmp.UnlockBits(srcData);
return Color.FromArgb(avgR, avgG, avgB);
}
c# performance image
add a comment |
up vote
12
down vote
favorite
I'm currently running this function 60 times per second in order to get the dominant color on my screen in RGB values. It is using approximately 15% of my CPU on 30FPS and 25% of my CPU on 60FPS. Is there any way I can improve the efficiency of this loop or is there a better way to get the color altogether?
public Color getDominantColor(System.Drawing.Bitmap bmp) {
BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, System.Drawing.Imaging.PixelFormat.Format32bppArgb);
int stride = srcData.Stride;
IntPtr Scan0 = srcData.Scan0;
int totals = new int { 0, 0, 0 };
int width = bmp.Width;
int height = bmp.Height;
unsafe
{
byte* p = (byte*)(void*)Scan0;
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
for (int color = 0; color < 3; color++) {
int idx = (y * stride) + x * 4 + color;
totals[color] += p[idx];
}
}
}
}
int avgB = totals[0] / (width * height);
int avgG = totals[1] / (width * height);
int avgR = totals[2] / (width * height);
bmp.UnlockBits(srcData);
return Color.FromArgb(avgR, avgG, avgB);
}
c# performance image
The indentation looks very off - does it look exactly like that in your IDE?
– Mathieu Guindon
Mar 13 '17 at 18:56
Did you try to move independent computations outside the loop and not recalculating stuff?
– Tamoghna Chowdhury
Mar 14 '17 at 5:01
Consider Cuda/OpenGl. This is what GPUs are really for.
– vnp
Mar 14 '17 at 5:28
add a comment |
up vote
12
down vote
favorite
up vote
12
down vote
favorite
I'm currently running this function 60 times per second in order to get the dominant color on my screen in RGB values. It is using approximately 15% of my CPU on 30FPS and 25% of my CPU on 60FPS. Is there any way I can improve the efficiency of this loop or is there a better way to get the color altogether?
public Color getDominantColor(System.Drawing.Bitmap bmp) {
BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, System.Drawing.Imaging.PixelFormat.Format32bppArgb);
int stride = srcData.Stride;
IntPtr Scan0 = srcData.Scan0;
int totals = new int { 0, 0, 0 };
int width = bmp.Width;
int height = bmp.Height;
unsafe
{
byte* p = (byte*)(void*)Scan0;
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
for (int color = 0; color < 3; color++) {
int idx = (y * stride) + x * 4 + color;
totals[color] += p[idx];
}
}
}
}
int avgB = totals[0] / (width * height);
int avgG = totals[1] / (width * height);
int avgR = totals[2] / (width * height);
bmp.UnlockBits(srcData);
return Color.FromArgb(avgR, avgG, avgB);
}
c# performance image
I'm currently running this function 60 times per second in order to get the dominant color on my screen in RGB values. It is using approximately 15% of my CPU on 30FPS and 25% of my CPU on 60FPS. Is there any way I can improve the efficiency of this loop or is there a better way to get the color altogether?
public Color getDominantColor(System.Drawing.Bitmap bmp) {
BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, System.Drawing.Imaging.PixelFormat.Format32bppArgb);
int stride = srcData.Stride;
IntPtr Scan0 = srcData.Scan0;
int totals = new int { 0, 0, 0 };
int width = bmp.Width;
int height = bmp.Height;
unsafe
{
byte* p = (byte*)(void*)Scan0;
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
for (int color = 0; color < 3; color++) {
int idx = (y * stride) + x * 4 + color;
totals[color] += p[idx];
}
}
}
}
int avgB = totals[0] / (width * height);
int avgG = totals[1] / (width * height);
int avgR = totals[2] / (width * height);
bmp.UnlockBits(srcData);
return Color.FromArgb(avgR, avgG, avgB);
}
c# performance image
c# performance image
edited Mar 13 '17 at 19:16
200_success
127k14148410
127k14148410
asked Mar 13 '17 at 18:52
Arjan Kuiper
6113
6113
The indentation looks very off - does it look exactly like that in your IDE?
– Mathieu Guindon
Mar 13 '17 at 18:56
Did you try to move independent computations outside the loop and not recalculating stuff?
– Tamoghna Chowdhury
Mar 14 '17 at 5:01
Consider Cuda/OpenGl. This is what GPUs are really for.
– vnp
Mar 14 '17 at 5:28
add a comment |
The indentation looks very off - does it look exactly like that in your IDE?
– Mathieu Guindon
Mar 13 '17 at 18:56
Did you try to move independent computations outside the loop and not recalculating stuff?
– Tamoghna Chowdhury
Mar 14 '17 at 5:01
Consider Cuda/OpenGl. This is what GPUs are really for.
– vnp
Mar 14 '17 at 5:28
The indentation looks very off - does it look exactly like that in your IDE?
– Mathieu Guindon
Mar 13 '17 at 18:56
The indentation looks very off - does it look exactly like that in your IDE?
– Mathieu Guindon
Mar 13 '17 at 18:56
Did you try to move independent computations outside the loop and not recalculating stuff?
– Tamoghna Chowdhury
Mar 14 '17 at 5:01
Did you try to move independent computations outside the loop and not recalculating stuff?
– Tamoghna Chowdhury
Mar 14 '17 at 5:01
Consider Cuda/OpenGl. This is what GPUs are really for.
– vnp
Mar 14 '17 at 5:28
Consider Cuda/OpenGl. This is what GPUs are really for.
– vnp
Mar 14 '17 at 5:28
add a comment |
6 Answers
6
active
oldest
votes
up vote
5
down vote
If we look at your for
loop very carefully we can see that you can eliminate two multiplications:
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
for (int color = 0; color < 3; color++) {
int idx = (y * stride) + x * 4 + color;
totals[color] += p[idx];
}
}
}
You don't use y
except for y * stride
, and you don't use x
except for x * 4
, rewrite the for
loop and you can eliminate those entirely.
var heightLimit = height * stride;
var widthLimit = width * 4;
for (int y = 0; y < heightLimit; y += stride) {
for (int x = 0; x < widthLimit; x += 4) {
for (int color = 0; color < 3; color++) {
int idx = y + x + color;
totals[color] += p[idx];
}
}
}
By removing all three multiplication operations, we effectively reduce the amount of work we do by a significant portion. (There are roughly 8 instructions in the original, and 6 in the reduced, so we've eliminated 25% of our work.)
Other than that, there really isn't much you can do. You could consider chunking the work, then only recalculate the regions that were re-drawn if that's reasonable, you could also chunk and thread it (this wouldn't reduce CPU usage but it would reduce time spend on this method).
With as frequently as you call this method, the next lines may be a potential for optimization as well:
int avgB = totals[0] / (width * height);
int avgG = totals[1] / (width * height);
int avgR = totals[2] / (width * height);
Why do width * height
in all three calculations? Why not store var pixelCount = width * height;
then divide by pixelCount
? Of course, division is still slow, but you're not using floating-point math so we can't use the reciprocal of it.
You could consider, as mentioned in a comment, using CUDA/OpenGL/GPU-level work. Basically, operating on the GPU itself instead of using the CPU to do what could be very efficient on the GPU. (It's built specifically for this type of processing.) There is at least one Stack Overflow question on running C# code on the GPU, it's not very easy or simple, but it can give you a lot of power.
add a comment |
up vote
5
down vote
One useful performance trick when dealing with low-level pixel manipulations like this is that it's often possible to process red and blue together using the 8 bits for green as a gap. Since here you're just adding them, you can add 256 blue values before they would overflow past green into red.
Taking into account John Wu's comment about the stride being irrelevant you can do (untested, and in particular might have endianness bugs; it's several years since I regularly wrote this kind of code, and that was in Java rather than C#):
unsafe
{
uint* p = (uint*)(void*)Scan0;
uint pixelCount = width * height;
uint idx = 0;
while (idx < (pixelCount & ~0xff)) {
uint sumRR00BB = 0;
uint sum00GG00 = 0;
for (int j = 0; j < 0x100; j++) {
sumRR00BB += p[idx] & 0xff00ff;
sum00GG00 += p[idx] & 0x00ff00;
idx++;
}
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
}
// And the final partial block of fewer than 0x100 pixels.
{
uint sumRR00BB = 0;
uint sum00GG00 = 0;
while (idx < pixelCount) {
sumRR00BB += p[idx] & 0xff00ff;
sum00GG00 += p[idx] & 0x00ff00;
idx++;
}
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
}
}
1
You need to change tototals[0] += sumRR00BB & 0xffff;
andtotals[2] += sumRR00BB >> 16;
because the color components are storedBGR
instead ofRGB
.
– Heslacher
Mar 14 '17 at 11:45
This makes no sense anyway right? 'p' would be the BGR addresses, so it makes no sense to apply bitmasks over the same p[idx] value unless i'm missing something? Maybe that's how java does it in memory, but surely C# doesnt return uint* in that case.
– Viezevingertjes
Oct 31 at 19:43
@Viezevingertjes, I don't understand your objection. What doesp[idx]
contain if not a pixel's value?
– Peter Taylor
Oct 31 at 19:57
p[idx] = 1 byte = 8 bits, 0-255 so R, G or B, not all three, why would those bitmasks work exactly? Not trying to be a dick, would be interested in how this theoretically would work but not seeying this work in C# as-is.
– Viezevingertjes
Oct 31 at 20:01
1
@Viezevingertjes, I tried to explain that in the first paragraph. If you sum 256 values in the range0x00
to0xff
the result will be in the range0x0000
to0xffff
. So if you mask with0xff00ff
and add 256 values then the top 16 bits of the result will contain the sums of the bits masked by0xff0000
and the bottom 16 bits of the result will contain the sums of the bits masked by0x0000ff
. If that's a clearer explanation then I can edit it in, or feel free to do so yourself.
– Peter Taylor
Nov 1 at 7:53
|
show 4 more comments
up vote
4
down vote
Five ideas, from easy to hard:
You can simplify your x/y loop to run in one dimension instead of two--
for ( i = 0; i < y * x * c; i += 4 )
. Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.
You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.
Keep a copy of the last bitmap and its totals. Use
REPE CMPSD
(yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.
Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:
MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
MOV EDX, Scan0 ;EDX is our data pointer
SUB BX, BX ;Set BX to 0 for later
Loop:
LODSL ;Load EAX from array, increment pointer
SHRL 8, EAX ;Dump the least eight bits
ADDB GreenTotal, AL ;Add least 8 bits to green total
ADCW GreenTotal+1,BX ;Carry the 1
SHRL 8, EAX ;Shift right 8 more bits
ADDB BlueTotal, AL ;Add least 8 bits to blue total
ADCW BlueTotal+1, BX ;Carry the 1
SHRL 8, EAX ;Shift right 8 more bits
ADDB RedTotal, AL ;Add least 8 bits to red total
ADCW RedTotal+1, BX ;Carry the 1
LOOPNZ Loop ;Decrement CX and keep going until it is zero
If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of
if
conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.
2
I find it hard to believe that this would improve performance at all.
– Nic Hartley
Mar 14 '17 at 4:15
Maybe, maybe not. Depends how you do it. If you render+transform the bitmap onto a framebuffer (that you never actually display), the operation may take advantage of GPU hardware acceleration-- that could actually be blazingly fast.
– John Wu
Mar 14 '17 at 6:20
Oh, for context, the answer has been edited since I replied to it -- when I replied, it was just the point about resizing.
– Nic Hartley
Mar 14 '17 at 15:22
Haha, true. Anyway, most of this conversation is irrelevant; I've purged all but my first two comments. I'd recommend editing the think you commented into your answer as a potential resource for how to link C++ to C#; it's useful to speed things up.
– Nic Hartley
Mar 14 '17 at 17:20
add a comment |
up vote
2
down vote
Validation
Because the method is public
you shouldn't assume that you get a valid non null
Bitmap
. You should add a null
check otherwise you are exposing implementation details of your method.
Naming
Based on the C# Naming guidelines methods should be named using PascalCase
casing. Method level variables should be named using camelCase
casing. Hence getDominantColor
->GetDominantColor
and IntPtr Scan0
->IntPtr scan0
.
Possible problems
You state in your question that this method is used to get the dominant color of your desktop. If you use it only for that, then all will be good.
A problem could come up if you use this method with some different bitmap.
- If the bitmap which is passed is of DIN A4 size with e.g 300dpi your
int totals
will overflow.
Performance
I would suggest to use pointer arithmetic instead of calculating each time the idx
value. I also would remove the most inner loop like @Zefick posted.
public System.Drawing.Color GetDominantColor(Bitmap bmp)
{
if (bmp == null)
{
throw new ArgumentNullException("bmp");
}
BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, bmp.PixelFormat);
int bytesPerPixel = Image.GetPixelFormatSize(srcData.PixelFormat) / 8;
int stride = srcData.Stride;
IntPtr scan0 = srcData.Scan0;
long totals = new long { 0, 0, 0 };
int width = bmp.Width * bytesPerPixel;
int height = bmp.Height;
unsafe
{
byte* p = (byte*)(void*)scan0;
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x += bytesPerPixel)
{
totals[0] += p[x + 0];
totals[1] += p[x + 1];
totals[2] += p[x + 2];
}
p += stride;
}
}
long pixelCount = bmp.Width * height;
int avgB = Convert.ToInt32(totals[0] / pixelCount);
int avgG = Convert.ToInt32(totals[1] / pixelCount);
int avgR = Convert.ToInt32(totals[2] / pixelCount);
bmp.UnlockBits(srcData);
return Color.FromArgb(avgR, avgG, avgB);
}
Benchmarking with BechnmarkDotNet with x64 compiled yields
Yours: 17.5252 ms
EBrown's: 14.6109 ms
Mine: 8.4846 ms
Peter Taylor's: 4.6419 ms
Until @PeterTylor doesn't change its code please see my comment: Getting the dominant RGB color of a bitmap
add a comment |
up vote
1
down vote
At least you can unroll the most inner loop without losing quality following way:
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
int idx = (y * stride) + x * 4;
totals[color] += p[idx];
totals[color+1] += p[idx+1];
totals[color+2] += p[idx+2];
}
}
Potentially, the compiler can do this optimization itself, but I'm not sure that it does this inside an "unsafe" block.
add a comment |
up vote
0
down vote
Currently doing real-time processing where every frame counts, with a modified version of Peter Taylor's answer. Tested on a weak Raspberry PI it still resulted in 24 out of 25 frames per second. Which is amazing performance knowing what the specifications are of it and running it through Mono.
The method itself was hard to understand at first, but maybe this way of using it might clear a few things up.
Please note that this only works with 32bpp images due to the use of the Unsigned 32-bit integer to go through the memory.
public unsafe uint GetAverageColorInRegion(Bitmap source, Rectangle region)
{
var rectangle = new Rectangle(0, 0, source.Width, source.Height);
var bitmapData = source.LockBits(rectangle, ImageLockMode.ReadOnly, source.PixelFormat);
var pixelCount = region.Width * region.Height;
var scanWidth = bitmapData.Stride / 4;
uint totals = { 0, 0, 0 };
int flushCounter = 0;
uint sumRR00BB = 0;
uint sum00GG00 = 0;
for (var y = region.Y; y < region.Y + region.Height; y++)
{
uint* row = (uint*)bitmapData.Scan0 + y * scanWidth;
for (var x = region.X; x < region.X + region.Width; x++)
{
sumRR00BB += row[x] & 0xff00ff;
sum00GG00 += row[x] & 0x00ff00;
// Flush before overflow occurs.
if (flushCounter++ == 0xff)
{
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
sumRR00BB = 0;
sum00GG00 = 0;
flushCounter = 0;
}
}
}
// Flush left-over's.
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
// Average the totals.
totals[0] /= (uint)pixelCount;
totals[1] /= (uint)pixelCount;
totals[2] /= (uint)pixelCount;
source.UnlockBits(bitmapData);
return totals;
}
add a comment |
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
5
down vote
If we look at your for
loop very carefully we can see that you can eliminate two multiplications:
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
for (int color = 0; color < 3; color++) {
int idx = (y * stride) + x * 4 + color;
totals[color] += p[idx];
}
}
}
You don't use y
except for y * stride
, and you don't use x
except for x * 4
, rewrite the for
loop and you can eliminate those entirely.
var heightLimit = height * stride;
var widthLimit = width * 4;
for (int y = 0; y < heightLimit; y += stride) {
for (int x = 0; x < widthLimit; x += 4) {
for (int color = 0; color < 3; color++) {
int idx = y + x + color;
totals[color] += p[idx];
}
}
}
By removing all three multiplication operations, we effectively reduce the amount of work we do by a significant portion. (There are roughly 8 instructions in the original, and 6 in the reduced, so we've eliminated 25% of our work.)
Other than that, there really isn't much you can do. You could consider chunking the work, then only recalculate the regions that were re-drawn if that's reasonable, you could also chunk and thread it (this wouldn't reduce CPU usage but it would reduce time spend on this method).
With as frequently as you call this method, the next lines may be a potential for optimization as well:
int avgB = totals[0] / (width * height);
int avgG = totals[1] / (width * height);
int avgR = totals[2] / (width * height);
Why do width * height
in all three calculations? Why not store var pixelCount = width * height;
then divide by pixelCount
? Of course, division is still slow, but you're not using floating-point math so we can't use the reciprocal of it.
You could consider, as mentioned in a comment, using CUDA/OpenGL/GPU-level work. Basically, operating on the GPU itself instead of using the CPU to do what could be very efficient on the GPU. (It's built specifically for this type of processing.) There is at least one Stack Overflow question on running C# code on the GPU, it's not very easy or simple, but it can give you a lot of power.
add a comment |
up vote
5
down vote
If we look at your for
loop very carefully we can see that you can eliminate two multiplications:
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
for (int color = 0; color < 3; color++) {
int idx = (y * stride) + x * 4 + color;
totals[color] += p[idx];
}
}
}
You don't use y
except for y * stride
, and you don't use x
except for x * 4
, rewrite the for
loop and you can eliminate those entirely.
var heightLimit = height * stride;
var widthLimit = width * 4;
for (int y = 0; y < heightLimit; y += stride) {
for (int x = 0; x < widthLimit; x += 4) {
for (int color = 0; color < 3; color++) {
int idx = y + x + color;
totals[color] += p[idx];
}
}
}
By removing all three multiplication operations, we effectively reduce the amount of work we do by a significant portion. (There are roughly 8 instructions in the original, and 6 in the reduced, so we've eliminated 25% of our work.)
Other than that, there really isn't much you can do. You could consider chunking the work, then only recalculate the regions that were re-drawn if that's reasonable, you could also chunk and thread it (this wouldn't reduce CPU usage but it would reduce time spend on this method).
With as frequently as you call this method, the next lines may be a potential for optimization as well:
int avgB = totals[0] / (width * height);
int avgG = totals[1] / (width * height);
int avgR = totals[2] / (width * height);
Why do width * height
in all three calculations? Why not store var pixelCount = width * height;
then divide by pixelCount
? Of course, division is still slow, but you're not using floating-point math so we can't use the reciprocal of it.
You could consider, as mentioned in a comment, using CUDA/OpenGL/GPU-level work. Basically, operating on the GPU itself instead of using the CPU to do what could be very efficient on the GPU. (It's built specifically for this type of processing.) There is at least one Stack Overflow question on running C# code on the GPU, it's not very easy or simple, but it can give you a lot of power.
add a comment |
up vote
5
down vote
up vote
5
down vote
If we look at your for
loop very carefully we can see that you can eliminate two multiplications:
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
for (int color = 0; color < 3; color++) {
int idx = (y * stride) + x * 4 + color;
totals[color] += p[idx];
}
}
}
You don't use y
except for y * stride
, and you don't use x
except for x * 4
, rewrite the for
loop and you can eliminate those entirely.
var heightLimit = height * stride;
var widthLimit = width * 4;
for (int y = 0; y < heightLimit; y += stride) {
for (int x = 0; x < widthLimit; x += 4) {
for (int color = 0; color < 3; color++) {
int idx = y + x + color;
totals[color] += p[idx];
}
}
}
By removing all three multiplication operations, we effectively reduce the amount of work we do by a significant portion. (There are roughly 8 instructions in the original, and 6 in the reduced, so we've eliminated 25% of our work.)
Other than that, there really isn't much you can do. You could consider chunking the work, then only recalculate the regions that were re-drawn if that's reasonable, you could also chunk and thread it (this wouldn't reduce CPU usage but it would reduce time spend on this method).
With as frequently as you call this method, the next lines may be a potential for optimization as well:
int avgB = totals[0] / (width * height);
int avgG = totals[1] / (width * height);
int avgR = totals[2] / (width * height);
Why do width * height
in all three calculations? Why not store var pixelCount = width * height;
then divide by pixelCount
? Of course, division is still slow, but you're not using floating-point math so we can't use the reciprocal of it.
You could consider, as mentioned in a comment, using CUDA/OpenGL/GPU-level work. Basically, operating on the GPU itself instead of using the CPU to do what could be very efficient on the GPU. (It's built specifically for this type of processing.) There is at least one Stack Overflow question on running C# code on the GPU, it's not very easy or simple, but it can give you a lot of power.
If we look at your for
loop very carefully we can see that you can eliminate two multiplications:
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
for (int color = 0; color < 3; color++) {
int idx = (y * stride) + x * 4 + color;
totals[color] += p[idx];
}
}
}
You don't use y
except for y * stride
, and you don't use x
except for x * 4
, rewrite the for
loop and you can eliminate those entirely.
var heightLimit = height * stride;
var widthLimit = width * 4;
for (int y = 0; y < heightLimit; y += stride) {
for (int x = 0; x < widthLimit; x += 4) {
for (int color = 0; color < 3; color++) {
int idx = y + x + color;
totals[color] += p[idx];
}
}
}
By removing all three multiplication operations, we effectively reduce the amount of work we do by a significant portion. (There are roughly 8 instructions in the original, and 6 in the reduced, so we've eliminated 25% of our work.)
Other than that, there really isn't much you can do. You could consider chunking the work, then only recalculate the regions that were re-drawn if that's reasonable, you could also chunk and thread it (this wouldn't reduce CPU usage but it would reduce time spend on this method).
With as frequently as you call this method, the next lines may be a potential for optimization as well:
int avgB = totals[0] / (width * height);
int avgG = totals[1] / (width * height);
int avgR = totals[2] / (width * height);
Why do width * height
in all three calculations? Why not store var pixelCount = width * height;
then divide by pixelCount
? Of course, division is still slow, but you're not using floating-point math so we can't use the reciprocal of it.
You could consider, as mentioned in a comment, using CUDA/OpenGL/GPU-level work. Basically, operating on the GPU itself instead of using the CPU to do what could be very efficient on the GPU. (It's built specifically for this type of processing.) There is at least one Stack Overflow question on running C# code on the GPU, it's not very easy or simple, but it can give you a lot of power.
edited May 23 '17 at 12:40
Community♦
1
1
answered Mar 14 '17 at 6:37
202_accepted
15.2k249130
15.2k249130
add a comment |
add a comment |
up vote
5
down vote
One useful performance trick when dealing with low-level pixel manipulations like this is that it's often possible to process red and blue together using the 8 bits for green as a gap. Since here you're just adding them, you can add 256 blue values before they would overflow past green into red.
Taking into account John Wu's comment about the stride being irrelevant you can do (untested, and in particular might have endianness bugs; it's several years since I regularly wrote this kind of code, and that was in Java rather than C#):
unsafe
{
uint* p = (uint*)(void*)Scan0;
uint pixelCount = width * height;
uint idx = 0;
while (idx < (pixelCount & ~0xff)) {
uint sumRR00BB = 0;
uint sum00GG00 = 0;
for (int j = 0; j < 0x100; j++) {
sumRR00BB += p[idx] & 0xff00ff;
sum00GG00 += p[idx] & 0x00ff00;
idx++;
}
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
}
// And the final partial block of fewer than 0x100 pixels.
{
uint sumRR00BB = 0;
uint sum00GG00 = 0;
while (idx < pixelCount) {
sumRR00BB += p[idx] & 0xff00ff;
sum00GG00 += p[idx] & 0x00ff00;
idx++;
}
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
}
}
1
You need to change tototals[0] += sumRR00BB & 0xffff;
andtotals[2] += sumRR00BB >> 16;
because the color components are storedBGR
instead ofRGB
.
– Heslacher
Mar 14 '17 at 11:45
This makes no sense anyway right? 'p' would be the BGR addresses, so it makes no sense to apply bitmasks over the same p[idx] value unless i'm missing something? Maybe that's how java does it in memory, but surely C# doesnt return uint* in that case.
– Viezevingertjes
Oct 31 at 19:43
@Viezevingertjes, I don't understand your objection. What doesp[idx]
contain if not a pixel's value?
– Peter Taylor
Oct 31 at 19:57
p[idx] = 1 byte = 8 bits, 0-255 so R, G or B, not all three, why would those bitmasks work exactly? Not trying to be a dick, would be interested in how this theoretically would work but not seeying this work in C# as-is.
– Viezevingertjes
Oct 31 at 20:01
1
@Viezevingertjes, I tried to explain that in the first paragraph. If you sum 256 values in the range0x00
to0xff
the result will be in the range0x0000
to0xffff
. So if you mask with0xff00ff
and add 256 values then the top 16 bits of the result will contain the sums of the bits masked by0xff0000
and the bottom 16 bits of the result will contain the sums of the bits masked by0x0000ff
. If that's a clearer explanation then I can edit it in, or feel free to do so yourself.
– Peter Taylor
Nov 1 at 7:53
|
show 4 more comments
up vote
5
down vote
One useful performance trick when dealing with low-level pixel manipulations like this is that it's often possible to process red and blue together using the 8 bits for green as a gap. Since here you're just adding them, you can add 256 blue values before they would overflow past green into red.
Taking into account John Wu's comment about the stride being irrelevant you can do (untested, and in particular might have endianness bugs; it's several years since I regularly wrote this kind of code, and that was in Java rather than C#):
unsafe
{
uint* p = (uint*)(void*)Scan0;
uint pixelCount = width * height;
uint idx = 0;
while (idx < (pixelCount & ~0xff)) {
uint sumRR00BB = 0;
uint sum00GG00 = 0;
for (int j = 0; j < 0x100; j++) {
sumRR00BB += p[idx] & 0xff00ff;
sum00GG00 += p[idx] & 0x00ff00;
idx++;
}
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
}
// And the final partial block of fewer than 0x100 pixels.
{
uint sumRR00BB = 0;
uint sum00GG00 = 0;
while (idx < pixelCount) {
sumRR00BB += p[idx] & 0xff00ff;
sum00GG00 += p[idx] & 0x00ff00;
idx++;
}
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
}
}
1
You need to change tototals[0] += sumRR00BB & 0xffff;
andtotals[2] += sumRR00BB >> 16;
because the color components are storedBGR
instead ofRGB
.
– Heslacher
Mar 14 '17 at 11:45
This makes no sense anyway right? 'p' would be the BGR addresses, so it makes no sense to apply bitmasks over the same p[idx] value unless i'm missing something? Maybe that's how java does it in memory, but surely C# doesnt return uint* in that case.
– Viezevingertjes
Oct 31 at 19:43
@Viezevingertjes, I don't understand your objection. What doesp[idx]
contain if not a pixel's value?
– Peter Taylor
Oct 31 at 19:57
p[idx] = 1 byte = 8 bits, 0-255 so R, G or B, not all three, why would those bitmasks work exactly? Not trying to be a dick, would be interested in how this theoretically would work but not seeying this work in C# as-is.
– Viezevingertjes
Oct 31 at 20:01
1
@Viezevingertjes, I tried to explain that in the first paragraph. If you sum 256 values in the range0x00
to0xff
the result will be in the range0x0000
to0xffff
. So if you mask with0xff00ff
and add 256 values then the top 16 bits of the result will contain the sums of the bits masked by0xff0000
and the bottom 16 bits of the result will contain the sums of the bits masked by0x0000ff
. If that's a clearer explanation then I can edit it in, or feel free to do so yourself.
– Peter Taylor
Nov 1 at 7:53
|
show 4 more comments
up vote
5
down vote
up vote
5
down vote
One useful performance trick when dealing with low-level pixel manipulations like this is that it's often possible to process red and blue together using the 8 bits for green as a gap. Since here you're just adding them, you can add 256 blue values before they would overflow past green into red.
Taking into account John Wu's comment about the stride being irrelevant you can do (untested, and in particular might have endianness bugs; it's several years since I regularly wrote this kind of code, and that was in Java rather than C#):
unsafe
{
uint* p = (uint*)(void*)Scan0;
uint pixelCount = width * height;
uint idx = 0;
while (idx < (pixelCount & ~0xff)) {
uint sumRR00BB = 0;
uint sum00GG00 = 0;
for (int j = 0; j < 0x100; j++) {
sumRR00BB += p[idx] & 0xff00ff;
sum00GG00 += p[idx] & 0x00ff00;
idx++;
}
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
}
// And the final partial block of fewer than 0x100 pixels.
{
uint sumRR00BB = 0;
uint sum00GG00 = 0;
while (idx < pixelCount) {
sumRR00BB += p[idx] & 0xff00ff;
sum00GG00 += p[idx] & 0x00ff00;
idx++;
}
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
}
}
One useful performance trick when dealing with low-level pixel manipulations like this is that it's often possible to process red and blue together using the 8 bits for green as a gap. Since here you're just adding them, you can add 256 blue values before they would overflow past green into red.
Taking into account John Wu's comment about the stride being irrelevant you can do (untested, and in particular might have endianness bugs; it's several years since I regularly wrote this kind of code, and that was in Java rather than C#):
unsafe
{
uint* p = (uint*)(void*)Scan0;
uint pixelCount = width * height;
uint idx = 0;
while (idx < (pixelCount & ~0xff)) {
uint sumRR00BB = 0;
uint sum00GG00 = 0;
for (int j = 0; j < 0x100; j++) {
sumRR00BB += p[idx] & 0xff00ff;
sum00GG00 += p[idx] & 0x00ff00;
idx++;
}
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
}
// And the final partial block of fewer than 0x100 pixels.
{
uint sumRR00BB = 0;
uint sum00GG00 = 0;
while (idx < pixelCount) {
sumRR00BB += p[idx] & 0xff00ff;
sum00GG00 += p[idx] & 0x00ff00;
idx++;
}
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
}
}
answered Mar 14 '17 at 9:00
Peter Taylor
15.3k2657
15.3k2657
1
You need to change tototals[0] += sumRR00BB & 0xffff;
andtotals[2] += sumRR00BB >> 16;
because the color components are storedBGR
instead ofRGB
.
– Heslacher
Mar 14 '17 at 11:45
This makes no sense anyway right? 'p' would be the BGR addresses, so it makes no sense to apply bitmasks over the same p[idx] value unless i'm missing something? Maybe that's how java does it in memory, but surely C# doesnt return uint* in that case.
– Viezevingertjes
Oct 31 at 19:43
@Viezevingertjes, I don't understand your objection. What doesp[idx]
contain if not a pixel's value?
– Peter Taylor
Oct 31 at 19:57
p[idx] = 1 byte = 8 bits, 0-255 so R, G or B, not all three, why would those bitmasks work exactly? Not trying to be a dick, would be interested in how this theoretically would work but not seeying this work in C# as-is.
– Viezevingertjes
Oct 31 at 20:01
1
@Viezevingertjes, I tried to explain that in the first paragraph. If you sum 256 values in the range0x00
to0xff
the result will be in the range0x0000
to0xffff
. So if you mask with0xff00ff
and add 256 values then the top 16 bits of the result will contain the sums of the bits masked by0xff0000
and the bottom 16 bits of the result will contain the sums of the bits masked by0x0000ff
. If that's a clearer explanation then I can edit it in, or feel free to do so yourself.
– Peter Taylor
Nov 1 at 7:53
|
show 4 more comments
1
You need to change tototals[0] += sumRR00BB & 0xffff;
andtotals[2] += sumRR00BB >> 16;
because the color components are storedBGR
instead ofRGB
.
– Heslacher
Mar 14 '17 at 11:45
This makes no sense anyway right? 'p' would be the BGR addresses, so it makes no sense to apply bitmasks over the same p[idx] value unless i'm missing something? Maybe that's how java does it in memory, but surely C# doesnt return uint* in that case.
– Viezevingertjes
Oct 31 at 19:43
@Viezevingertjes, I don't understand your objection. What doesp[idx]
contain if not a pixel's value?
– Peter Taylor
Oct 31 at 19:57
p[idx] = 1 byte = 8 bits, 0-255 so R, G or B, not all three, why would those bitmasks work exactly? Not trying to be a dick, would be interested in how this theoretically would work but not seeying this work in C# as-is.
– Viezevingertjes
Oct 31 at 20:01
1
@Viezevingertjes, I tried to explain that in the first paragraph. If you sum 256 values in the range0x00
to0xff
the result will be in the range0x0000
to0xffff
. So if you mask with0xff00ff
and add 256 values then the top 16 bits of the result will contain the sums of the bits masked by0xff0000
and the bottom 16 bits of the result will contain the sums of the bits masked by0x0000ff
. If that's a clearer explanation then I can edit it in, or feel free to do so yourself.
– Peter Taylor
Nov 1 at 7:53
1
1
You need to change to
totals[0] += sumRR00BB & 0xffff;
and totals[2] += sumRR00BB >> 16;
because the color components are stored BGR
instead of RGB
.– Heslacher
Mar 14 '17 at 11:45
You need to change to
totals[0] += sumRR00BB & 0xffff;
and totals[2] += sumRR00BB >> 16;
because the color components are stored BGR
instead of RGB
.– Heslacher
Mar 14 '17 at 11:45
This makes no sense anyway right? 'p' would be the BGR addresses, so it makes no sense to apply bitmasks over the same p[idx] value unless i'm missing something? Maybe that's how java does it in memory, but surely C# doesnt return uint* in that case.
– Viezevingertjes
Oct 31 at 19:43
This makes no sense anyway right? 'p' would be the BGR addresses, so it makes no sense to apply bitmasks over the same p[idx] value unless i'm missing something? Maybe that's how java does it in memory, but surely C# doesnt return uint* in that case.
– Viezevingertjes
Oct 31 at 19:43
@Viezevingertjes, I don't understand your objection. What does
p[idx]
contain if not a pixel's value?– Peter Taylor
Oct 31 at 19:57
@Viezevingertjes, I don't understand your objection. What does
p[idx]
contain if not a pixel's value?– Peter Taylor
Oct 31 at 19:57
p[idx] = 1 byte = 8 bits, 0-255 so R, G or B, not all three, why would those bitmasks work exactly? Not trying to be a dick, would be interested in how this theoretically would work but not seeying this work in C# as-is.
– Viezevingertjes
Oct 31 at 20:01
p[idx] = 1 byte = 8 bits, 0-255 so R, G or B, not all three, why would those bitmasks work exactly? Not trying to be a dick, would be interested in how this theoretically would work but not seeying this work in C# as-is.
– Viezevingertjes
Oct 31 at 20:01
1
1
@Viezevingertjes, I tried to explain that in the first paragraph. If you sum 256 values in the range
0x00
to 0xff
the result will be in the range 0x0000
to 0xffff
. So if you mask with 0xff00ff
and add 256 values then the top 16 bits of the result will contain the sums of the bits masked by 0xff0000
and the bottom 16 bits of the result will contain the sums of the bits masked by 0x0000ff
. If that's a clearer explanation then I can edit it in, or feel free to do so yourself.– Peter Taylor
Nov 1 at 7:53
@Viezevingertjes, I tried to explain that in the first paragraph. If you sum 256 values in the range
0x00
to 0xff
the result will be in the range 0x0000
to 0xffff
. So if you mask with 0xff00ff
and add 256 values then the top 16 bits of the result will contain the sums of the bits masked by 0xff0000
and the bottom 16 bits of the result will contain the sums of the bits masked by 0x0000ff
. If that's a clearer explanation then I can edit it in, or feel free to do so yourself.– Peter Taylor
Nov 1 at 7:53
|
show 4 more comments
up vote
4
down vote
Five ideas, from easy to hard:
You can simplify your x/y loop to run in one dimension instead of two--
for ( i = 0; i < y * x * c; i += 4 )
. Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.
You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.
Keep a copy of the last bitmap and its totals. Use
REPE CMPSD
(yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.
Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:
MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
MOV EDX, Scan0 ;EDX is our data pointer
SUB BX, BX ;Set BX to 0 for later
Loop:
LODSL ;Load EAX from array, increment pointer
SHRL 8, EAX ;Dump the least eight bits
ADDB GreenTotal, AL ;Add least 8 bits to green total
ADCW GreenTotal+1,BX ;Carry the 1
SHRL 8, EAX ;Shift right 8 more bits
ADDB BlueTotal, AL ;Add least 8 bits to blue total
ADCW BlueTotal+1, BX ;Carry the 1
SHRL 8, EAX ;Shift right 8 more bits
ADDB RedTotal, AL ;Add least 8 bits to red total
ADCW RedTotal+1, BX ;Carry the 1
LOOPNZ Loop ;Decrement CX and keep going until it is zero
If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of
if
conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.
2
I find it hard to believe that this would improve performance at all.
– Nic Hartley
Mar 14 '17 at 4:15
Maybe, maybe not. Depends how you do it. If you render+transform the bitmap onto a framebuffer (that you never actually display), the operation may take advantage of GPU hardware acceleration-- that could actually be blazingly fast.
– John Wu
Mar 14 '17 at 6:20
Oh, for context, the answer has been edited since I replied to it -- when I replied, it was just the point about resizing.
– Nic Hartley
Mar 14 '17 at 15:22
Haha, true. Anyway, most of this conversation is irrelevant; I've purged all but my first two comments. I'd recommend editing the think you commented into your answer as a potential resource for how to link C++ to C#; it's useful to speed things up.
– Nic Hartley
Mar 14 '17 at 17:20
add a comment |
up vote
4
down vote
Five ideas, from easy to hard:
You can simplify your x/y loop to run in one dimension instead of two--
for ( i = 0; i < y * x * c; i += 4 )
. Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.
You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.
Keep a copy of the last bitmap and its totals. Use
REPE CMPSD
(yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.
Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:
MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
MOV EDX, Scan0 ;EDX is our data pointer
SUB BX, BX ;Set BX to 0 for later
Loop:
LODSL ;Load EAX from array, increment pointer
SHRL 8, EAX ;Dump the least eight bits
ADDB GreenTotal, AL ;Add least 8 bits to green total
ADCW GreenTotal+1,BX ;Carry the 1
SHRL 8, EAX ;Shift right 8 more bits
ADDB BlueTotal, AL ;Add least 8 bits to blue total
ADCW BlueTotal+1, BX ;Carry the 1
SHRL 8, EAX ;Shift right 8 more bits
ADDB RedTotal, AL ;Add least 8 bits to red total
ADCW RedTotal+1, BX ;Carry the 1
LOOPNZ Loop ;Decrement CX and keep going until it is zero
If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of
if
conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.
2
I find it hard to believe that this would improve performance at all.
– Nic Hartley
Mar 14 '17 at 4:15
Maybe, maybe not. Depends how you do it. If you render+transform the bitmap onto a framebuffer (that you never actually display), the operation may take advantage of GPU hardware acceleration-- that could actually be blazingly fast.
– John Wu
Mar 14 '17 at 6:20
Oh, for context, the answer has been edited since I replied to it -- when I replied, it was just the point about resizing.
– Nic Hartley
Mar 14 '17 at 15:22
Haha, true. Anyway, most of this conversation is irrelevant; I've purged all but my first two comments. I'd recommend editing the think you commented into your answer as a potential resource for how to link C++ to C#; it's useful to speed things up.
– Nic Hartley
Mar 14 '17 at 17:20
add a comment |
up vote
4
down vote
up vote
4
down vote
Five ideas, from easy to hard:
You can simplify your x/y loop to run in one dimension instead of two--
for ( i = 0; i < y * x * c; i += 4 )
. Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.
You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.
Keep a copy of the last bitmap and its totals. Use
REPE CMPSD
(yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.
Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:
MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
MOV EDX, Scan0 ;EDX is our data pointer
SUB BX, BX ;Set BX to 0 for later
Loop:
LODSL ;Load EAX from array, increment pointer
SHRL 8, EAX ;Dump the least eight bits
ADDB GreenTotal, AL ;Add least 8 bits to green total
ADCW GreenTotal+1,BX ;Carry the 1
SHRL 8, EAX ;Shift right 8 more bits
ADDB BlueTotal, AL ;Add least 8 bits to blue total
ADCW BlueTotal+1, BX ;Carry the 1
SHRL 8, EAX ;Shift right 8 more bits
ADDB RedTotal, AL ;Add least 8 bits to red total
ADCW RedTotal+1, BX ;Carry the 1
LOOPNZ Loop ;Decrement CX and keep going until it is zero
If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of
if
conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.
Five ideas, from easy to hard:
You can simplify your x/y loop to run in one dimension instead of two--
for ( i = 0; i < y * x * c; i += 4 )
. Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.
You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.
Keep a copy of the last bitmap and its totals. Use
REPE CMPSD
(yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.
Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:
MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
MOV EDX, Scan0 ;EDX is our data pointer
SUB BX, BX ;Set BX to 0 for later
Loop:
LODSL ;Load EAX from array, increment pointer
SHRL 8, EAX ;Dump the least eight bits
ADDB GreenTotal, AL ;Add least 8 bits to green total
ADCW GreenTotal+1,BX ;Carry the 1
SHRL 8, EAX ;Shift right 8 more bits
ADDB BlueTotal, AL ;Add least 8 bits to blue total
ADCW BlueTotal+1, BX ;Carry the 1
SHRL 8, EAX ;Shift right 8 more bits
ADDB RedTotal, AL ;Add least 8 bits to red total
ADCW RedTotal+1, BX ;Carry the 1
LOOPNZ Loop ;Decrement CX and keep going until it is zero
If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of
if
conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.
edited May 23 '17 at 12:40
Community♦
1
1
answered Mar 14 '17 at 3:09
John Wu
54136
54136
2
I find it hard to believe that this would improve performance at all.
– Nic Hartley
Mar 14 '17 at 4:15
Maybe, maybe not. Depends how you do it. If you render+transform the bitmap onto a framebuffer (that you never actually display), the operation may take advantage of GPU hardware acceleration-- that could actually be blazingly fast.
– John Wu
Mar 14 '17 at 6:20
Oh, for context, the answer has been edited since I replied to it -- when I replied, it was just the point about resizing.
– Nic Hartley
Mar 14 '17 at 15:22
Haha, true. Anyway, most of this conversation is irrelevant; I've purged all but my first two comments. I'd recommend editing the think you commented into your answer as a potential resource for how to link C++ to C#; it's useful to speed things up.
– Nic Hartley
Mar 14 '17 at 17:20
add a comment |
2
I find it hard to believe that this would improve performance at all.
– Nic Hartley
Mar 14 '17 at 4:15
Maybe, maybe not. Depends how you do it. If you render+transform the bitmap onto a framebuffer (that you never actually display), the operation may take advantage of GPU hardware acceleration-- that could actually be blazingly fast.
– John Wu
Mar 14 '17 at 6:20
Oh, for context, the answer has been edited since I replied to it -- when I replied, it was just the point about resizing.
– Nic Hartley
Mar 14 '17 at 15:22
Haha, true. Anyway, most of this conversation is irrelevant; I've purged all but my first two comments. I'd recommend editing the think you commented into your answer as a potential resource for how to link C++ to C#; it's useful to speed things up.
– Nic Hartley
Mar 14 '17 at 17:20
2
2
I find it hard to believe that this would improve performance at all.
– Nic Hartley
Mar 14 '17 at 4:15
I find it hard to believe that this would improve performance at all.
– Nic Hartley
Mar 14 '17 at 4:15
Maybe, maybe not. Depends how you do it. If you render+transform the bitmap onto a framebuffer (that you never actually display), the operation may take advantage of GPU hardware acceleration-- that could actually be blazingly fast.
– John Wu
Mar 14 '17 at 6:20
Maybe, maybe not. Depends how you do it. If you render+transform the bitmap onto a framebuffer (that you never actually display), the operation may take advantage of GPU hardware acceleration-- that could actually be blazingly fast.
– John Wu
Mar 14 '17 at 6:20
Oh, for context, the answer has been edited since I replied to it -- when I replied, it was just the point about resizing.
– Nic Hartley
Mar 14 '17 at 15:22
Oh, for context, the answer has been edited since I replied to it -- when I replied, it was just the point about resizing.
– Nic Hartley
Mar 14 '17 at 15:22
Haha, true. Anyway, most of this conversation is irrelevant; I've purged all but my first two comments. I'd recommend editing the think you commented into your answer as a potential resource for how to link C++ to C#; it's useful to speed things up.
– Nic Hartley
Mar 14 '17 at 17:20
Haha, true. Anyway, most of this conversation is irrelevant; I've purged all but my first two comments. I'd recommend editing the think you commented into your answer as a potential resource for how to link C++ to C#; it's useful to speed things up.
– Nic Hartley
Mar 14 '17 at 17:20
add a comment |
up vote
2
down vote
Validation
Because the method is public
you shouldn't assume that you get a valid non null
Bitmap
. You should add a null
check otherwise you are exposing implementation details of your method.
Naming
Based on the C# Naming guidelines methods should be named using PascalCase
casing. Method level variables should be named using camelCase
casing. Hence getDominantColor
->GetDominantColor
and IntPtr Scan0
->IntPtr scan0
.
Possible problems
You state in your question that this method is used to get the dominant color of your desktop. If you use it only for that, then all will be good.
A problem could come up if you use this method with some different bitmap.
- If the bitmap which is passed is of DIN A4 size with e.g 300dpi your
int totals
will overflow.
Performance
I would suggest to use pointer arithmetic instead of calculating each time the idx
value. I also would remove the most inner loop like @Zefick posted.
public System.Drawing.Color GetDominantColor(Bitmap bmp)
{
if (bmp == null)
{
throw new ArgumentNullException("bmp");
}
BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, bmp.PixelFormat);
int bytesPerPixel = Image.GetPixelFormatSize(srcData.PixelFormat) / 8;
int stride = srcData.Stride;
IntPtr scan0 = srcData.Scan0;
long totals = new long { 0, 0, 0 };
int width = bmp.Width * bytesPerPixel;
int height = bmp.Height;
unsafe
{
byte* p = (byte*)(void*)scan0;
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x += bytesPerPixel)
{
totals[0] += p[x + 0];
totals[1] += p[x + 1];
totals[2] += p[x + 2];
}
p += stride;
}
}
long pixelCount = bmp.Width * height;
int avgB = Convert.ToInt32(totals[0] / pixelCount);
int avgG = Convert.ToInt32(totals[1] / pixelCount);
int avgR = Convert.ToInt32(totals[2] / pixelCount);
bmp.UnlockBits(srcData);
return Color.FromArgb(avgR, avgG, avgB);
}
Benchmarking with BechnmarkDotNet with x64 compiled yields
Yours: 17.5252 ms
EBrown's: 14.6109 ms
Mine: 8.4846 ms
Peter Taylor's: 4.6419 ms
Until @PeterTylor doesn't change its code please see my comment: Getting the dominant RGB color of a bitmap
add a comment |
up vote
2
down vote
Validation
Because the method is public
you shouldn't assume that you get a valid non null
Bitmap
. You should add a null
check otherwise you are exposing implementation details of your method.
Naming
Based on the C# Naming guidelines methods should be named using PascalCase
casing. Method level variables should be named using camelCase
casing. Hence getDominantColor
->GetDominantColor
and IntPtr Scan0
->IntPtr scan0
.
Possible problems
You state in your question that this method is used to get the dominant color of your desktop. If you use it only for that, then all will be good.
A problem could come up if you use this method with some different bitmap.
- If the bitmap which is passed is of DIN A4 size with e.g 300dpi your
int totals
will overflow.
Performance
I would suggest to use pointer arithmetic instead of calculating each time the idx
value. I also would remove the most inner loop like @Zefick posted.
public System.Drawing.Color GetDominantColor(Bitmap bmp)
{
if (bmp == null)
{
throw new ArgumentNullException("bmp");
}
BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, bmp.PixelFormat);
int bytesPerPixel = Image.GetPixelFormatSize(srcData.PixelFormat) / 8;
int stride = srcData.Stride;
IntPtr scan0 = srcData.Scan0;
long totals = new long { 0, 0, 0 };
int width = bmp.Width * bytesPerPixel;
int height = bmp.Height;
unsafe
{
byte* p = (byte*)(void*)scan0;
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x += bytesPerPixel)
{
totals[0] += p[x + 0];
totals[1] += p[x + 1];
totals[2] += p[x + 2];
}
p += stride;
}
}
long pixelCount = bmp.Width * height;
int avgB = Convert.ToInt32(totals[0] / pixelCount);
int avgG = Convert.ToInt32(totals[1] / pixelCount);
int avgR = Convert.ToInt32(totals[2] / pixelCount);
bmp.UnlockBits(srcData);
return Color.FromArgb(avgR, avgG, avgB);
}
Benchmarking with BechnmarkDotNet with x64 compiled yields
Yours: 17.5252 ms
EBrown's: 14.6109 ms
Mine: 8.4846 ms
Peter Taylor's: 4.6419 ms
Until @PeterTylor doesn't change its code please see my comment: Getting the dominant RGB color of a bitmap
add a comment |
up vote
2
down vote
up vote
2
down vote
Validation
Because the method is public
you shouldn't assume that you get a valid non null
Bitmap
. You should add a null
check otherwise you are exposing implementation details of your method.
Naming
Based on the C# Naming guidelines methods should be named using PascalCase
casing. Method level variables should be named using camelCase
casing. Hence getDominantColor
->GetDominantColor
and IntPtr Scan0
->IntPtr scan0
.
Possible problems
You state in your question that this method is used to get the dominant color of your desktop. If you use it only for that, then all will be good.
A problem could come up if you use this method with some different bitmap.
- If the bitmap which is passed is of DIN A4 size with e.g 300dpi your
int totals
will overflow.
Performance
I would suggest to use pointer arithmetic instead of calculating each time the idx
value. I also would remove the most inner loop like @Zefick posted.
public System.Drawing.Color GetDominantColor(Bitmap bmp)
{
if (bmp == null)
{
throw new ArgumentNullException("bmp");
}
BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, bmp.PixelFormat);
int bytesPerPixel = Image.GetPixelFormatSize(srcData.PixelFormat) / 8;
int stride = srcData.Stride;
IntPtr scan0 = srcData.Scan0;
long totals = new long { 0, 0, 0 };
int width = bmp.Width * bytesPerPixel;
int height = bmp.Height;
unsafe
{
byte* p = (byte*)(void*)scan0;
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x += bytesPerPixel)
{
totals[0] += p[x + 0];
totals[1] += p[x + 1];
totals[2] += p[x + 2];
}
p += stride;
}
}
long pixelCount = bmp.Width * height;
int avgB = Convert.ToInt32(totals[0] / pixelCount);
int avgG = Convert.ToInt32(totals[1] / pixelCount);
int avgR = Convert.ToInt32(totals[2] / pixelCount);
bmp.UnlockBits(srcData);
return Color.FromArgb(avgR, avgG, avgB);
}
Benchmarking with BechnmarkDotNet with x64 compiled yields
Yours: 17.5252 ms
EBrown's: 14.6109 ms
Mine: 8.4846 ms
Peter Taylor's: 4.6419 ms
Until @PeterTylor doesn't change its code please see my comment: Getting the dominant RGB color of a bitmap
Validation
Because the method is public
you shouldn't assume that you get a valid non null
Bitmap
. You should add a null
check otherwise you are exposing implementation details of your method.
Naming
Based on the C# Naming guidelines methods should be named using PascalCase
casing. Method level variables should be named using camelCase
casing. Hence getDominantColor
->GetDominantColor
and IntPtr Scan0
->IntPtr scan0
.
Possible problems
You state in your question that this method is used to get the dominant color of your desktop. If you use it only for that, then all will be good.
A problem could come up if you use this method with some different bitmap.
- If the bitmap which is passed is of DIN A4 size with e.g 300dpi your
int totals
will overflow.
Performance
I would suggest to use pointer arithmetic instead of calculating each time the idx
value. I also would remove the most inner loop like @Zefick posted.
public System.Drawing.Color GetDominantColor(Bitmap bmp)
{
if (bmp == null)
{
throw new ArgumentNullException("bmp");
}
BitmapData srcData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadOnly, bmp.PixelFormat);
int bytesPerPixel = Image.GetPixelFormatSize(srcData.PixelFormat) / 8;
int stride = srcData.Stride;
IntPtr scan0 = srcData.Scan0;
long totals = new long { 0, 0, 0 };
int width = bmp.Width * bytesPerPixel;
int height = bmp.Height;
unsafe
{
byte* p = (byte*)(void*)scan0;
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x += bytesPerPixel)
{
totals[0] += p[x + 0];
totals[1] += p[x + 1];
totals[2] += p[x + 2];
}
p += stride;
}
}
long pixelCount = bmp.Width * height;
int avgB = Convert.ToInt32(totals[0] / pixelCount);
int avgG = Convert.ToInt32(totals[1] / pixelCount);
int avgR = Convert.ToInt32(totals[2] / pixelCount);
bmp.UnlockBits(srcData);
return Color.FromArgb(avgR, avgG, avgB);
}
Benchmarking with BechnmarkDotNet with x64 compiled yields
Yours: 17.5252 ms
EBrown's: 14.6109 ms
Mine: 8.4846 ms
Peter Taylor's: 4.6419 ms
Until @PeterTylor doesn't change its code please see my comment: Getting the dominant RGB color of a bitmap
edited Apr 13 '17 at 12:40
Community♦
1
1
answered Mar 14 '17 at 13:25
Heslacher
44.3k460153
44.3k460153
add a comment |
add a comment |
up vote
1
down vote
At least you can unroll the most inner loop without losing quality following way:
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
int idx = (y * stride) + x * 4;
totals[color] += p[idx];
totals[color+1] += p[idx+1];
totals[color+2] += p[idx+2];
}
}
Potentially, the compiler can do this optimization itself, but I'm not sure that it does this inside an "unsafe" block.
add a comment |
up vote
1
down vote
At least you can unroll the most inner loop without losing quality following way:
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
int idx = (y * stride) + x * 4;
totals[color] += p[idx];
totals[color+1] += p[idx+1];
totals[color+2] += p[idx+2];
}
}
Potentially, the compiler can do this optimization itself, but I'm not sure that it does this inside an "unsafe" block.
add a comment |
up vote
1
down vote
up vote
1
down vote
At least you can unroll the most inner loop without losing quality following way:
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
int idx = (y * stride) + x * 4;
totals[color] += p[idx];
totals[color+1] += p[idx+1];
totals[color+2] += p[idx+2];
}
}
Potentially, the compiler can do this optimization itself, but I'm not sure that it does this inside an "unsafe" block.
At least you can unroll the most inner loop without losing quality following way:
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
int idx = (y * stride) + x * 4;
totals[color] += p[idx];
totals[color+1] += p[idx+1];
totals[color+2] += p[idx+2];
}
}
Potentially, the compiler can do this optimization itself, but I'm not sure that it does this inside an "unsafe" block.
edited Mar 14 '17 at 5:31
answered Mar 14 '17 at 4:53
Zefick
29626
29626
add a comment |
add a comment |
up vote
0
down vote
Currently doing real-time processing where every frame counts, with a modified version of Peter Taylor's answer. Tested on a weak Raspberry PI it still resulted in 24 out of 25 frames per second. Which is amazing performance knowing what the specifications are of it and running it through Mono.
The method itself was hard to understand at first, but maybe this way of using it might clear a few things up.
Please note that this only works with 32bpp images due to the use of the Unsigned 32-bit integer to go through the memory.
public unsafe uint GetAverageColorInRegion(Bitmap source, Rectangle region)
{
var rectangle = new Rectangle(0, 0, source.Width, source.Height);
var bitmapData = source.LockBits(rectangle, ImageLockMode.ReadOnly, source.PixelFormat);
var pixelCount = region.Width * region.Height;
var scanWidth = bitmapData.Stride / 4;
uint totals = { 0, 0, 0 };
int flushCounter = 0;
uint sumRR00BB = 0;
uint sum00GG00 = 0;
for (var y = region.Y; y < region.Y + region.Height; y++)
{
uint* row = (uint*)bitmapData.Scan0 + y * scanWidth;
for (var x = region.X; x < region.X + region.Width; x++)
{
sumRR00BB += row[x] & 0xff00ff;
sum00GG00 += row[x] & 0x00ff00;
// Flush before overflow occurs.
if (flushCounter++ == 0xff)
{
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
sumRR00BB = 0;
sum00GG00 = 0;
flushCounter = 0;
}
}
}
// Flush left-over's.
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
// Average the totals.
totals[0] /= (uint)pixelCount;
totals[1] /= (uint)pixelCount;
totals[2] /= (uint)pixelCount;
source.UnlockBits(bitmapData);
return totals;
}
add a comment |
up vote
0
down vote
Currently doing real-time processing where every frame counts, with a modified version of Peter Taylor's answer. Tested on a weak Raspberry PI it still resulted in 24 out of 25 frames per second. Which is amazing performance knowing what the specifications are of it and running it through Mono.
The method itself was hard to understand at first, but maybe this way of using it might clear a few things up.
Please note that this only works with 32bpp images due to the use of the Unsigned 32-bit integer to go through the memory.
public unsafe uint GetAverageColorInRegion(Bitmap source, Rectangle region)
{
var rectangle = new Rectangle(0, 0, source.Width, source.Height);
var bitmapData = source.LockBits(rectangle, ImageLockMode.ReadOnly, source.PixelFormat);
var pixelCount = region.Width * region.Height;
var scanWidth = bitmapData.Stride / 4;
uint totals = { 0, 0, 0 };
int flushCounter = 0;
uint sumRR00BB = 0;
uint sum00GG00 = 0;
for (var y = region.Y; y < region.Y + region.Height; y++)
{
uint* row = (uint*)bitmapData.Scan0 + y * scanWidth;
for (var x = region.X; x < region.X + region.Width; x++)
{
sumRR00BB += row[x] & 0xff00ff;
sum00GG00 += row[x] & 0x00ff00;
// Flush before overflow occurs.
if (flushCounter++ == 0xff)
{
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
sumRR00BB = 0;
sum00GG00 = 0;
flushCounter = 0;
}
}
}
// Flush left-over's.
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
// Average the totals.
totals[0] /= (uint)pixelCount;
totals[1] /= (uint)pixelCount;
totals[2] /= (uint)pixelCount;
source.UnlockBits(bitmapData);
return totals;
}
add a comment |
up vote
0
down vote
up vote
0
down vote
Currently doing real-time processing where every frame counts, with a modified version of Peter Taylor's answer. Tested on a weak Raspberry PI it still resulted in 24 out of 25 frames per second. Which is amazing performance knowing what the specifications are of it and running it through Mono.
The method itself was hard to understand at first, but maybe this way of using it might clear a few things up.
Please note that this only works with 32bpp images due to the use of the Unsigned 32-bit integer to go through the memory.
public unsafe uint GetAverageColorInRegion(Bitmap source, Rectangle region)
{
var rectangle = new Rectangle(0, 0, source.Width, source.Height);
var bitmapData = source.LockBits(rectangle, ImageLockMode.ReadOnly, source.PixelFormat);
var pixelCount = region.Width * region.Height;
var scanWidth = bitmapData.Stride / 4;
uint totals = { 0, 0, 0 };
int flushCounter = 0;
uint sumRR00BB = 0;
uint sum00GG00 = 0;
for (var y = region.Y; y < region.Y + region.Height; y++)
{
uint* row = (uint*)bitmapData.Scan0 + y * scanWidth;
for (var x = region.X; x < region.X + region.Width; x++)
{
sumRR00BB += row[x] & 0xff00ff;
sum00GG00 += row[x] & 0x00ff00;
// Flush before overflow occurs.
if (flushCounter++ == 0xff)
{
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
sumRR00BB = 0;
sum00GG00 = 0;
flushCounter = 0;
}
}
}
// Flush left-over's.
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
// Average the totals.
totals[0] /= (uint)pixelCount;
totals[1] /= (uint)pixelCount;
totals[2] /= (uint)pixelCount;
source.UnlockBits(bitmapData);
return totals;
}
Currently doing real-time processing where every frame counts, with a modified version of Peter Taylor's answer. Tested on a weak Raspberry PI it still resulted in 24 out of 25 frames per second. Which is amazing performance knowing what the specifications are of it and running it through Mono.
The method itself was hard to understand at first, but maybe this way of using it might clear a few things up.
Please note that this only works with 32bpp images due to the use of the Unsigned 32-bit integer to go through the memory.
public unsafe uint GetAverageColorInRegion(Bitmap source, Rectangle region)
{
var rectangle = new Rectangle(0, 0, source.Width, source.Height);
var bitmapData = source.LockBits(rectangle, ImageLockMode.ReadOnly, source.PixelFormat);
var pixelCount = region.Width * region.Height;
var scanWidth = bitmapData.Stride / 4;
uint totals = { 0, 0, 0 };
int flushCounter = 0;
uint sumRR00BB = 0;
uint sum00GG00 = 0;
for (var y = region.Y; y < region.Y + region.Height; y++)
{
uint* row = (uint*)bitmapData.Scan0 + y * scanWidth;
for (var x = region.X; x < region.X + region.Width; x++)
{
sumRR00BB += row[x] & 0xff00ff;
sum00GG00 += row[x] & 0x00ff00;
// Flush before overflow occurs.
if (flushCounter++ == 0xff)
{
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
sumRR00BB = 0;
sum00GG00 = 0;
flushCounter = 0;
}
}
}
// Flush left-over's.
totals[0] += sumRR00BB >> 16;
totals[1] += sum00GG00 >> 8;
totals[2] += sumRR00BB & 0xffff;
// Average the totals.
totals[0] /= (uint)pixelCount;
totals[1] /= (uint)pixelCount;
totals[2] /= (uint)pixelCount;
source.UnlockBits(bitmapData);
return totals;
}
answered 4 hours ago
Viezevingertjes
1375
1375
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f157667%2fgetting-the-dominant-rgb-color-of-a-bitmap%23new-answer', 'question_page');
}
);
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
The indentation looks very off - does it look exactly like that in your IDE?
– Mathieu Guindon
Mar 13 '17 at 18:56
Did you try to move independent computations outside the loop and not recalculating stuff?
– Tamoghna Chowdhury
Mar 14 '17 at 5:01
Consider Cuda/OpenGl. This is what GPUs are really for.
– vnp
Mar 14 '17 at 5:28