开发者

Making a C code run faster

开发者 https://www.devze.com 2023-03-16 10:08 出处:网络
I have written a piece of code which is used for counting the frequency of numbers between 0 and 255.

I have written a piece of code which is used for counting the frequency of numbers between 0 and 255.

unsigned char arr[4096]; //aligned 64 bytes, filled with random characters

short counter[25开发者_开发知识库6]; //aligned 32 bytes

register int i;

for(i = 0; i < 4096; i++)
    ++counter[arr[i]];

It is taking a lot of time to execute; the random access to counter array is very expensive.

Does anyone have any ideas which I could use to either make the accesses sequential or any other approach I could use?


What makes you think that random access to the counter array is expensive? Have you profiled? Try Valgrind, which has a cache profiling tool called "cachegrind". Profiling also lets you know if the code is actually slow or if you just think it is slow because it ought to be.

This is a very simple piece of code and before optimizing it is important to know whether it is memory bound or if it is not memory bound (w.r.t. the data, not the histogram table). I can't answer that off the top of my head. Try comparing to a simple algorithm which just sums the entire input: if both run at about the same speed, then your algorithm is memory bound and you are done.

My best guess is that the main issue which could slow you down is this:

   Registers                      RAM
1.  <-- read data[i] ---------------
2.  <-- read histogram[data[i]] ----
3. increment
4.  --- write histogram[data[i]] -->
5.  <-- read data[i] ---------------
6.  <-- read histogram[data[i]] ----

The compiler and processor are not allowed to reorder most of the instructions here (except #1 and #5, which can be done ahead of time) so you are basically going to be limited by whichever is smaller: the bandwidth of your L1 cache (which is where the histogram is) and the bandwidth of your main RAM, each multiplied by some unknown constant factor. (Note: the compiler can only move #1/5 around if it unrolls the loop, but the processor might be able to move it around anyway.)

Which is why you profile before you try to get clever -- because if your L1 cache has enough bandwidth, then you will always be starving for data and there is nothing you can do about it.

Footnote:

This code:

register int i;
for(i = 0; i < 4096; i++)
    ++counter[arr[i]];

Generates the same assembly as this code:

int i;
for(i = 0; i < 4096; i++)
    counter[arr[i]]++;

But this code is easier to read.


More idiomatic:

// make sure you actually fill this with random chars
// if this is declared in a function, it _might_ have stack garbage
// if it's declared globally, it will be zeroed (which makes for a boring result)
unsigned char arr[4096]; 
// since you're counting bytes in an array, the array can't have more
// bytes than the current system memory width, so then size_t will never overflow
// for this usage
size_t counter[256];

for(size_t i = 0; i < sizeof(arr)/sizeof(*arr); ++i)
    ++counter[arr[i]];

Now the key is to compile with C99, and some serious optimization flags:

cc mycode.c -O3 -std=c99

Any optimization at all on a simple loop like this will make it extremely quick. Don't waste more time on making something this simple faster.


First, I complete agree with Dietrich, please prove (yourself and us) where the real bottleneck lies, first.

The only possible improvement I can see is in your short. The size of the table will not be a problem here, I guess, but promotions and overflow. Use a type that handles this by default, namely unsigned.

Anyhow, counters should always be unsigned (even better size_t), this is the semantic of cardinalities. As an extra advantage unsigned types don't overflow, but wrap a arround in a controled way. The compiler doesn't have to use an additional instruction for that.

Then arithmetic in C is done with a width of at least that of int. This then has to be cast back to short.


the code takes a data of size 4k...it adds every 3 consecutive bytes and stores the result in a temporary buffer of size 4k. The temporary buffer is used for the generation of the histogram.

Vectorization is possible for the addition of 3 consecutive bytes which I did using SIMD instructions.

As per what Dietrich suggested, if instead of doing generating the histogram, I simply added the values in the temporary buffer, it executes very fast. But generation of histogram is the part which takes time. I did the profiling of the code using cache grind...the output is:

==11845== 
==11845== I   refs:      212,171
==11845== I1  misses:        842
==11845== LLi misses:        827
==11845== I1  miss rate:    0.39%
==11845== LLi miss rate:    0.38%
==11845== 
==11845== D   refs:       69,179  (56,158 rd   + 13,021 wr)
==11845== D1  misses:      2,905  ( 2,289 rd   +    616 wr)
==11845== LLd misses:      2,470  ( 1,895 rd   +    575 wr)
==11845== D1  miss rate:     4.1% (   4.0%     +    4.7%  )
==11845== LLd miss rate:     3.5% (   3.3%     +    4.4%  )
==11845== 
==11845== LL refs:         3,747  ( 3,131 rd   +    616 wr)
==11845== LL misses:       3,297  ( 2,722 rd   +    575 wr)
==11845== LL miss rate:      1.1% (   1.0%     +    4.4%  )

and the complete output is:

I1 cache:         65536 B, 64 B, 2-way associative
D1 cache:         65536 B, 64 B, 2-way associative
LL cache:         1048576 B, 64 B, 16-way associative
Command:          ./a.out
Data file:        cachegrind.out.11845
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   
Auto-annotation:  off

--------------------------------------------------------------------------------
     Ir I1mr ILmr     Dr  D1mr  DLmr     Dw D1mw DLmw 
--------------------------------------------------------------------------------
212,171  842  827 56,158 2,289 1,895 13,021  616  575  PROGRAM TOTALS

--------------------------------------------------------------------------------
    Ir I1mr ILmr     Dr  D1mr  DLmr     Dw D1mw DLmw  file:function
--------------------------------------------------------------------------------
97,335  651  642 26,648 1,295 1,030 10,883  517  479  ???:???
59,413   13   13 13,348   886   829     17    1    0  ???:_dl_addr
40,023    7    7 12,405    10     8    223   18   17  ???:core_get_signature
 5,123    2    2  1,277    64    19    256   64   64  ???:core_get_signature_parallel
 3,039   46   44    862     9     4    665    8    8  ???:vfprintf
 2,344   11   11    407     0     0    254    1    1  ???:_IO_file_xsputn
   887    7    7    234     0     0    134    1    0  ???:_IO_file_overflow
   720    9    7    250     5     2    150    0    0  ???:__printf_chk
   538    4    4    104     0     0    102    2    2  ???:__libc_memalign
   507    6    6    145     0     0    114    0    0  ???:_IO_do_write
   478    2    2     42     1     1      0    0    0  ???:strchrnul
   350    3    3     80     0     0     50    0    0  ???:_IO_file_write
   297    4    4     98     0     0     23    0    0  ???:_IO_default_xsputn


Well, Richard is certainly right. This is because the compiler has to convert the array to pointer, but that takes some time, thus increasing execution time. For instance, try this:

for(i = 0; i < 4096; i++)
     ++*(counter+*(arr+i));


Consider using a pointer to arr, instead of indexing.

unsigned char p = &arr;
for (i = 4096-1; 0 <= i; --i)
  ++counter[*p++];
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号