Fixes #156
This modulo operation is implemented on software in many 32-bits
processors, such as the Cortex-M3. This causes the generated binary to
depend a standard library routine that is often not present on such
small machines. This hurts portability and convenience.
Thankfully, this particular modulo is not needed, and can be replaced by
a simple test and subtraction. This is not constant time, but we don't
care: the index we are computing does not depend on any secret, so a
variable timing won't expose anything.
Performance seems to very slightly increase on x86-64. 32-bit machines
may benefit more.
u32 start_pos = first_pass ? 0 : next_slice;
// Generate offset from J1 (no need for J2, there's only one lane)
- u64 j1 = ctx->b.a[index] & 0xffffffff; // pseudo-random number
- u64 x = (j1 * j1) >> 32;
- u64 y = (area_size * x) >> 32;
- u64 z = (area_size - 1) - y;
- return (start_pos + z) % ctx->nb_blocks;
+ u64 j1 = ctx->b.a[index] & 0xffffffff; // pseudo-random number
+ u64 x = (j1 * j1) >> 32;
+ u64 y = (area_size * x) >> 32;
+ u64 z = (area_size - 1) - y;
+ u64 ref = start_pos + z; // ref < 2 * nb_blocks
+ return ref < ctx->nb_blocks ? ref : ref - ctx->nb_blocks;
}
// Main algorithm