I wrote a simple loop:
int volatile value = 0;
void loop(int limit) {
for (int i = 0; i < limit; ++i) {
++value;
}
}
I compiled this with gcc and clang(-O3 -fno-unroll-loops
) and got different outputs. They differ in ++value
part:
clang:
add dword ptr [rip + value], 1 # ++value
add edi, -1 # --limit
jne .LBB0_1 # if limit > 0 then continue looping
gcc:
mov eax, DWORD PTR value[rip] # copy value to a register
add edx, 1 # ++i
add eax, 1 # increment a copy of value
mov DWORD PTR value[rip], eax # store incremented copy to value, i. e. ++value
cmp edi, edx # compare i < limit
jne .L3 # if i < limit then continue looping
C and C++ versions are same on each compiler(https://gcc.godbolt.org/z/5x5jGP) So my questions are:
1) Is gcc doing something wrong? What is the point of copying the value
?
2) I have benchmarked that code and for some reason the profiler shows that in gcc's version 73% of time is wasted on instruction add edx, 1
, 13% on mov DWORD PTR value[rip], eax
and 13% on cmp edi, edx
. Am I interpreting this results wrong? Why other addition and move instructions take less than 1% of the time?
3) Why can performance differ on gcc/clang in such a primitive code?