This modulo operation is implemented on software in many 32-bits
processors, such as the Cortex-M3. This causes the generated binary to
depend a standard library routine that is often not present on such
small machines. This hurts portability and convenience.
Thankfully, this particular modulo is not needed, and can be replaced by
a simple test and subtraction. This is not constant time, but we don't
care: the index we are computing does not depend on any secret, so a
variable timing won't expose anything.
Performance seems to very slightly increase on x86-64. 32-bit machines
may benefit more.