By actually *rolling* the loading code. I haven't looked at the
assembly, but I suspect the loop is easier for the compiler to
vectorise.
This results in a 5% speed increase on my machine (Intel i5 Skylake
laptop, gcc 7.3.0).
This fix was made possible by @Sadoon-AlBader on GitHub, who submitted
pull request #118
// Process the message block by block
size_t nb_blocks = message_size >> 4;
FOR (i, 0, nb_blocks) {
- ctx->c[0] = load32_le(message + 0);
- ctx->c[1] = load32_le(message + 4);
- ctx->c[2] = load32_le(message + 8);
- ctx->c[3] = load32_le(message + 12);
+ FOR (i, 0, 4) {
+ ctx->c[i] = load32_le(message + i*4);
+ }
poly_block(ctx);
message += 16;
}