开发者

Why does GCC not optimize this set of branching and conditionals as much as it could?

开发者 https://www.devze.com 2023-04-11 19:12 出处:网络
The following three pieces of code achieves exactly the same effect. Yet, when compiled with -O3 on GCC 4.5.2 the times for a lot of iterations vary quite markedly.

The following three pieces of code achieves exactly the same effect. Yet, when compiled with -O3 on GCC 4.5.2 the times for a lot of iterations vary quite markedly.

1 - Normal branching, using multiple conditions, best time 1.0:

// a, b, c, d are set to random values 0-255 before each iteration.
if (a < 16 or b < 32 or c < 64 or d < 128) result += a+b+c+d;

2 - Branching, manually using bitwise or to check conditions, best time 0.92:

if (a < 16 | b < 32 | c < 64 | d < 128) result += a+b+c+d;

3 - Finally, getting the same result without a branch, best time 0.85:

result += (a+b+c+d) * (a < 16 | b < 32 | c < 64 | d < 128);

The above times are the best for each method when run as the the inner loop of a benchmark program I made. The random() is seeded the same way before each run.

Before I made this benchmark I assumed GCC would optimize away the differences. Especially the 2nd example makes me scratch my head. Can anyone explain why GCC doesn't turn code like this into equivalent faster code?

EDIT: Fixed some errors, and also made it clear that the random numbers are created regardless, and used, so as to not be optimized away. They always were in the original benchmark, I just botched the code I put on here.

Here is an example of an actual benchmark function:

boost::random::mt19937 rng;
boost::random::uniform_int_distribution<> ranchar(0, 255);

double quadruple_or(uint64_t runs) {
  uint64_t result = 0;
  rng.seed(0);

  boost::chrono::high_resolution_clock::time_point start = 
    boost::chrono::high_resolution_clock::now();
  for (; runs; runs--) {
    int a = ranchar(rng);
    int b = ranchar(rng);
    int c = ranchar(rng);
    i开发者_运维知识库nt d = ranchar(rng);
    if (a < 16 or b < 32 or c < 64 or d < 128) result += a;
    if (d > 16 or c > 32 or b > 64 or a > 128) result += b;
    if (a < 96 or b < 53 or c < 199 or d < 177) result += c;
    if (d > 66 or c > 35 or b > 99 or a > 77) result += d;
  }

  // Force gcc to not optimize away result.
  std::cout << "Result check " << result << std::endl;
  boost::chrono::duration<double> sec = 
    boost::chrono::high_resolution_clock::now() - start;
  return sec.count();
}

The full benchmark can be found here.


The OP has changed a bit since my original answer. Let me try to revisit here.

In case 1, because of or short-circuiting I would expect the compiler to generate four compare-then-branch code sections. Branches can obviously be pretty expensive especially if they don't go the predicted path.

In case 2, the compiler can decide to do all four comparisons, convert them to bool 0/1 results, and then bitwise or all four pieces together, then doing a single (additional) branch. This trades possibly more comparisons for possibly fewer branches. It appears as if reducing the number of branches does improve performance.

In case 3, things work pretty much the same as 2 except at the very end one more branch may be eliminated by explicitly telling the compiler "I know the result will be zero or one, so just multiply the thing on the left by that value". The multiply apparently comes out faster than the corresponding branch on your hardware. This is in contrast to the second example where the compiler doesn't know the range of possible outputs from the bitwise or so it has to assume it could be any integer and must do a compare-and-jump instead.

Original answer for history: The first case is functionally different from the second and third if random has side effects (which a normal PRNG would), so it stands to reason that the compiler may optimize them differently. Specifically, the first case will only call random as many times as needed to pass the check while in the other two cases random will always be called four times. This will (assuming random really is stateful) result in the future random numbers being different.

The difference between the second and third is because the compiler probably can't figure out for some reason that the result of the bitwise or will always be 0 or 1. When you give it a hint to do the multiplication instead of branching the multiplication probably comes out faster due to pipelining.


With logical operators the code will branch and early-out. Bitwise operators always do the full job.

Branch prediction will be worse in the first case, although it will outperform the bitwise case for larger examples.

It cannot optimise away random() because that function is not pure (idempotent).


You can always try optimizing away the branch and multiply. Instead of:

if (test) result+= blah;

or

result+= blah*(test);

You can do:

result+= blah&(-(test));

If test is false, -false==0 and (blah&0)==0. If test is true, -true==~0 and (blah&~0)==blah. You may have to fiddle with test as !!test to ensure true==1.


On my machine (Intel E5503) with gcc 4.5.3, I find that version 1 is generally the fastest, though the difference is well within the measurement noise (f3 is the slowest, though only about 2% slower than f1).

How are you measuring your timings? You might find that the differences you're seeing are due more to that than the actual difference in the code produced.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号