The multiplication chain used in those two function is probably optimal,
but it is also kind of black magic, and takes quite a bit of code.
TweetNaCl has a much shorter, much easier to read, much slower addition
chain. I figured maybe a middle ground were possible.
Turns out it's difficult. I couldn't come up with a nice multiplication
chain on my own. But I did notice a relationship between 2^252 - 3 and
2^255 - 23 (the latter is used to invert): they start with the same bit
pattern. More specifically:
2^255 - 23 = (2^252 - 3) * 8 + 3
I can use the same multiplication chain for both function, and just
finish the job for the inversion.
The cost of this patch compared to the ref10 multiplication chain is
five field multiplications, three of which are squaring. The effect on
the benchmark is so small that we don't even notice the difference.
The benefit is 10 meaty lines of code, and a corresponding decrease in
binary size.