Compilers aren't magical. They need help to generate the best code.
Here we want to compute the following expression:
mask = 0xffffffff;
2 * (a & mask) * (b & mask)
The most efficient way to do this looks like this:
u64 al = (u32)a; // Truncate
u64 bl = (u32)b; // Truncate
u64 x = al * bl; // 32->64 bits multiply
u64 2x = x << 1; // shift
return 2x;
My compiler doesn't pick up on this, and perform a slower alternative
instead. Either the multiply by two uses an actual multiply instead of a
shift, or the shift is done first, forcing a more expensive 64->64
multiply. More naive compilers may even do both.
Whatever the cause, I got 5% faster code on GCC 11.3.