Note: I hoped for a 6% speedup, barely observed 3.5%. I suspect this is
because additions, even from pre-computed tables, cost a little more
than doublings.
Note: we could save an additional addition by assuming scalars modulo L
all fit in 252 bits. They do not, but if we pick one at random, they
will in practice (with 2^-128 probability of being wrong, i.e. never).
This would work well in EdDSA, where the scalar is a hash of the private
key. Finding a private key that makes the scalar overflow 252 bits is
about as hard as breaking a 128-bit key, which we can safely assume will
never happen by accident.
However, this scalar multiplication is also used in the dirty public key
generation, which we use to create hidden X25519 keys with Elligator
(from Edwards, because it's twice as fast as the Montgomery ladder).
Here, users provide the scalar directly. Overflowing 252 bits *can*
happen by accident, if users shoot themselves in the foot with a
non-random scalar.
The risk is small, but a measly 1% performance is not worth leaving
ourselves open to subtle corner cases like that.