Loup Vaillant [Thu, 16 Aug 2018 19:29:13 +0000 (21:29 +0200)]
Added tests for HChacha20
Not that it needed any (XChacha20 were enough), but it's easier to
communicate to outsiders that HChacha20 is correct when we have explicit
test vectors.
Loup Vaillant [Wed, 15 Aug 2018 18:02:03 +0000 (20:02 +0200)]
Properly prevent S malleability
S malleability was mostly prevented in a previous commit, for reasons
that had nothing to do with S malleability. This mislead users into
thinking Monocypher was not S malleable.
To avoid confusion, I properly verify that S is strictly lower than L
(the order of the curve). S malleability is no longer a thing.
We still have nonce malleability, but that one can't be helped.
Also added Wycheproof test vectors about malleability.
Loup Vaillant [Sat, 11 Aug 2018 18:05:28 +0000 (20:05 +0200)]
Signed sliding windows for EdDSA
Signed sliding windows are effectively one bit wider than their unsigned
counterparts, without doubling the size of the corresponding look up
table. Going from 4-bit unsigned to 5-bit signed allowed us to gain
almost 17 additions on average.
This gain is less impressive than it sounds: the whole operation still
costs 254 doublings and 56 additions, and going signed made window
construction and look up a bit slower. Overall, we barely gained 2.5%.
We could gain a bit more speed still by precomputing the look up table
for the base point, but the gains would be similar, and the costs in
code size and complexity would be even bigger.
Loup Vaillant [Sat, 11 Aug 2018 16:19:35 +0000 (18:19 +0200)]
Reduced EdDSA malleability for sliding windows
Signed sliding windows can overflow the initial scalar by one bit. This
is not a problem when the scalar is reduced modulo L, which is smaller
than 2^253. The second half of the signature however is controlled by
the attacker, and can be any value.
Legitimate signatures however always reduce modulo L. They don't really
have to, but this helps with determinism, and enables test vectors. So
we can safely reject any signature whose second half exceeds L.
This patch rejects anything above 2^253-1, thus guaranteeing that the
three most significant bits are cleared. This eliminate s-malleability
in most cases, but not all. Besides, there is still nonce malleability.
Users should still assume signatures are malleable.
Loup Vaillant [Sat, 11 Aug 2018 15:36:14 +0000 (17:36 +0200)]
EdDSA sliding windows now indicate the number
This is in preparation for signed sliding windows. Instead of choosing
-1 for doing nothing, and an index to point to the table, we write how
much we add directly (that means 0 for nothing). We divide the number
by 2 to get the index.
The double scalarmult routine doesn't handle negative values yet.
Loup Vaillant [Wed, 8 Aug 2018 21:24:25 +0000 (23:24 +0200)]
Signed comb with unsigned table
Or, bitwiseshiftleft saves the day. The current code is hacky as hell,
but it works, and it cleared up my confusion. Turns out a suitable
signed comb is quite different from an unsigned one: the table itself
should represent -1 and 1 bits, instead of 0 and 1 bits.
Right now the same effect is achieved with 2 additions (more precisely,
an addition and a subtraction). With the proper table, it can be one
operation.
Loup Vaillant [Sat, 4 Aug 2018 19:47:40 +0000 (21:47 +0200)]
Avoids the first doubling for EdDSA signatures
The overhead of this first multiplication is not much, but it's
measurable.
Note the use of a macro for the constant time lookup and addition. It
could have been a function, but the function call overhead eats up all
the gains (I guess there are too many arguments to push to and pop from
the stack).
Loup Vaillant [Sat, 4 Aug 2018 19:37:14 +0000 (21:37 +0200)]
Avoids the first few doublings in EdDSA verification
Legitimate scalars with EdDSA verification are at most 253-bit long.
That's 3 bits less than the full 256 bits. By starting the loop at the
highest bit set, we can save a couple doublings. It's not much, but
it's measurable.
Loup Vaillant [Sat, 4 Aug 2018 19:08:53 +0000 (21:08 +0200)]
Comb for EdDSA signatures in Niels coordinates
While it takes a bit more space to encode, this also avoids some initial
overhead, and significantly reduces stack size.
Note: we could do away with the T2 coordinate to reduce the overhead of
constant time lookup, but this would also require more work per point
addition. Experiments suggest the bigger table is a little faster.
Loup Vaillant [Sat, 4 Aug 2018 13:30:54 +0000 (15:30 +0200)]
All field element constants have the proper invariants
A number of pre-computed constant didn't follow the ideal invariants set
forth by the carry propagation logic. This increased the risk of limb
overflow.
Now all such constants are generated with fe_frombytes(), which
guarantees they can withstand the same number of additions and
subtraction before needing carry propagation. This reduces the risks,
and simplifies the analysis of code using field arithmetic.
Turns out this commit was a huge blunder. Carry propagation works by
minimising the absolute value of each limb. The reverted patch did not
do that, resulting in limbs that were basically twice as big as they
should be.
While it could still work, this would at least reduce the margin for
error. Better safe than sorry, and keep the more versatile loading
routine we had before.
Likewise, constants should minimise the absolute value of their limbs.
Failing to do so caused what was described in issue #107.
Loup Vaillant [Fri, 3 Aug 2018 21:25:55 +0000 (23:25 +0200)]
Cleaner fe_frombytes() (loading field elements)
The old version of fe_frombytes() from the ref10 implementation was not
as clean as I wanted it to be: instead of loading exactly the right
bytes, it played fast and loose, then used a carry operation to
compensate.
It works, but there's a more direct, simpler, and I suspect faster
approach: put the right bits in the right place to begin with.
Loup Vaillant [Fri, 3 Aug 2018 17:28:31 +0000 (19:28 +0200)]
Specialised adding code for EdDSA signatures
- Saved one multiplication by assuming Z=1
- Hoisted wipes out of loops
- Removed wipes for variable time additions
This made both signatures and verification a bit faster. (Note: current
signature verification speed is only 23% slower than key exchange. I
didn't think it could be that fast.)
Loup Vaillant [Fri, 3 Aug 2018 16:47:15 +0000 (18:47 +0200)]
Full pre-computed table for EdDSA signatures
The main gain for now comes from reducing the amount of constant time
lookup. We could reduce the table's size even further, *or* save a few
multiplications.
I'm currently a little suspicious of the way I generated the table. If
it passes the tests, it shouldn't have any error, but it still requires
some checking.
Point addition used to use 8 intermediate variables. That's 6 more than
what was needed. Removing them made wiping faster, and shrank the stack
by 240 bytes. (Stack size may matter in embedded systems.)
This will require some tweaking yet. We could use signed combs to add 1
bit to the table (currently 4) without making it any bigger. We should
hoist buffer wipes out of the double and add operations. We could
consider other means of adding and doubling, to save multiplications and
to reduce table sizes (smaller tables have faster constant time lookup).
But we're faster than key exchange already. We're on the right track.
There are two reasons: the first is that there are 2 special cases where
the ladder or conversion gives the wrong results (the ladder is correct,
it was just hijacked for another purpose).
The second reason is, fixed based scalar multiplication can be made even
faster in Twisted Edwards space, by using a comb algorithm. The current
patch uses a window, so it is slower. It will get faster again.
Reduces the number of additions in ge_double_scalarmult_vartime().
Verification is now 80% as fast as signing (in naive implementations,
it's only 50% as fast).
It could be even faster, but it's probably not worth the trouble:
- We could precompute the lookup table for the base point instead of
constructing a cache. This would save about 8 point additions total,
at the cost of 64 lines of code just to lay out the 320 precomputed
constants.
- We could use special, cheaper additions for the precomputed base
point, at the cost of an additional addition function.
- We could use *signed* sliding windows to further reduce the number of
additions, at the cost of an additional point subtraction function
(two if combined with special additions for the base point). Besides,
I don't understand how they work.
The low hanging fruits have been taken. Signature verification is
faster than ever before. This is good enough.
Fusing the two scalar multiplication together halves the number of
required doublings. The code is just as simple, speeds up signature
verification quite a bit.
General scalar multiplication for EdDSA was only used for checking, and
thus no longer needs to be constant time. Removing buffer wipes and
using a variable time ladder simplifies and speeds up checks quite a
bit.
The one that was removed for version 2.0.4, because a bogus special
cases caused it to accept forged signature (a big fat vulnerability).
To avoid the vulnerability, this optimisation is only used for signing
and public key generation. Those never multiply the base point by zero,
and as such should not hit any nasty special case. More specifically:
- Public key generation works the same as X25519: the scalar is trimmed
before it is multiplied by the base point.
- Signing multiplies the base point by a nonce, which by construction is
a random number between 1 and L-1 (L is the order of the curve).
External scrutiny will be needed to confirm this is safe.
Note: signing is now even faster than it was, because multiplying by a
known point (the base point) lets us avoid a conversion and a division.
Some users tend to rely on security properties that are not provided by
cryptographic signatures. This has lead to serious problems int the
past, such as BitCoin transaction malleability (a replay attack where
the recipient could repeat a previously existing transaction).
Mitigations for signature malleability are possible, but they're at best
easily misunderstood, and at worst incomplete. Better warn the users in
the manual than encouraging the reliance on non-standard security
properties.
Those test vectors assume SHA-512, and thus are only activated with the
-DED25519_SHA512 compilation option.
Note the omission of malleability test vectors. Monocypher will
happily accept signatures even when S is not in canonical form. This
is contrary to RFC 8032, which requires implementations to check that S
is lower than L.
I believe RFC 8032 is wrong. Non-malleability means that someone who
only knows the public key, message, and signature, cannot produce
another valid signature. It does *not* mean there is only one valid
signature. In fact, when we know the private key, we can produce a
virtually unlimited number of different, valid, canonical signatures.
Like ECDSA, EdDSA uses a nonce. Unlike ECDSA, that nonce doesn't come
from a random source, it comes from a hash of the message itself. This
determinism prevents nonce reuse, among other problems. However,
nothing prevents someone to bypass this rule, and use a random nonce
instead. This will naturally produce a different, yet valid, signature.
EdDSA signatures are not unique. The difference between this and
malleability is subtle enough that advertising non-malleability will
lead users to believe in uniqueness, and bake that faulty assumption in
their designs, which will then be insecure.
Loup Vaillant [Wed, 27 Jun 2018 09:16:56 +0000 (11:16 +0200)]
Easier tarball generation
- TARBALL_VERSION now defaults to the contents of VERSION.md
- TARBALL_DIR now defaults to the current directory
- tarball_ignore is now included in the tarball (one can use the tarball
to generate other tarballs).
Loup Vaillant [Sun, 24 Jun 2018 13:58:55 +0000 (15:58 +0200)]
Don't free() NULL pointers
The alloc() function in the test suite unconditionally succeeds when
trying to allocate zero bytes. It does so by returning NULL right away,
without exiting the program. This was for portability for platforms
that refuse to allocate zero bytes.
Unfortunately, this meant that the test suite later called free() on
those NULL pointers, which is undefined. Wrapping free() in a dealloc()
function avoids this error.
Loup Vaillant [Sat, 23 Jun 2018 18:34:48 +0000 (20:34 +0200)]
EdDSA no longer accepts all zero signatures
This fixes the critical vulnerability in commit e4cbf84384ffdce194895078c88680be0c341d76 (compute signatures in
Montgomery space (faster)), somewhere between versions 0.8 and 1.0, and
detected by the tests in the parent commit.
The fix basically reverts the optimisation, effectively halving the
performance of EdDSA.
It appears the conversion to Montgomery space and back didn't handle
every edge case correctly. That optimisation will be re-introduced once
the issue has been fully understood. This will probably require expert
advice.
Loup Vaillant [Tue, 19 Jun 2018 22:26:41 +0000 (00:26 +0200)]
Corrected failing test on 32-bit systems
When size_t is not uint64_t, converting "negative" size_t integers to
uint64_t yields nonsensical results. That is, the following isn't
portable:
size_t x = 42;
uint64_t y = -i;
Because y might be missing the high order bits if size_t is smaller than
uint64_t. Instead, we want to convert to a large sized integer *before*
we negate it:
Loup Vaillant [Sun, 17 Jun 2018 18:05:36 +0000 (20:05 +0200)]
Tests for crypto_verify*() catch more errors
The test had a symmetry that caused them to miss a possible error, where
the implementer would replace an `|` operator by an `&` operator.
Breaking that symmetry allow them to catch that error.
Loup Vaillant [Sun, 17 Jun 2018 17:47:46 +0000 (19:47 +0200)]
Faster crypto_verify*() tests
There was no need to test every possible value to catch the errors the
tests caugth. Cutting them down makes the test 64000 times faster,
which matters quite a lot when we run the TIS interpreter.
The new tests catch just as many errors as the old ones.
Loup Vaillant [Sun, 17 Jun 2018 17:19:20 +0000 (19:19 +0200)]
Run the TIS interpreter in 2 commands instead of 3.
The TIS interpreter doesn't need to be run from inside the
formal-analysis folder. We can refer to the relevant C files directly.
This simplify the README a tiny little bit.
Loup Vaillant [Sun, 17 Jun 2018 17:06:12 +0000 (19:06 +0200)]
Corrected variable sized buffer in the tests.
The p_eddsa_random() test was triggering the TIS interbreter because of
a variable sized array allocated on the stack. The test run properly
with a fixed sized buffer. (Variable size buffers are tested elsewhere,
most notably with the test vectors).
Loup Vaillant [Sat, 16 Jun 2018 10:29:34 +0000 (12:29 +0200)]
Improved the test suite
The test suite has been trimmed down a little, and improved a bit. The
main goal was to have the TIS interpreter to run the entire test suite
in less than 20 hours, so I (and others) could realistically run it on
each new release.
- We now have much less Argon2i test vectors. Only those at block size
boundaries have been kept (for instance, it is important that we test
both below and above 512 blocks, and below and above 64 bytes hashes,
to hit all code paths.
- X25519 test vectors have been cut in half. We have official test
vectors and the Monte Carlo test already, we don't need too many
vectors. The main advantage here is to reduce the size of the test
vector header file.
- The tests for chacha20_set_ctr() explore the test space more
efficiently.
- The comparison between crypto_argon2i() and crypto_argon2i_general()
is no longer repeated 128 times.
- The overlap tests for Argon2i have been cut down, and the overlap
space has been reduced to compensate (so we're still sure there will
be lots of overlap).
- The incremental tests for Blake2b and SHA-512 were cut down to a
number of iterations, and total message size, that is *not* a multiple
of a block length, so tests can fail more reliably if the MIN() macro
has an error.
- The roundtrip tests for EdDSA have now been cut down, and try several
message sizes as well.
- The random tests for EdDSA (which are mostly meant to test what
happens when the point is outside the curve), have beet cut down (from
1000 to 100), and try several message sizes as well.
Loup Vaillant [Sat, 16 Jun 2018 10:03:22 +0000 (12:03 +0200)]
Reset SHA-512 input buffer like Blake2b's
This is mostly for consistency (code that follow the same patterns
everywhere are more easily reviewed). The generated code is also a tiny
bit more efficient that way.
Loup Vaillant [Sat, 16 Jun 2018 09:35:52 +0000 (11:35 +0200)]
Fixed undefined behaviour in Blake2b
Fixes #96
The function blake2b_set_input() was reading uninitialised memory.
While this didn't matter in practice (most platforms don't have trap
representations for unsigned integers), it is undefined behaviour under
the C and C++ standards. To fix it, we reset the whole input buffer
before setting its first byte.
The fix introduces a conditional, but that conditional only depend
on an index, which itself depends on the size of the input, which is not
secret. We're still "constant time" with respect to secrets.
Loup Vaillant [Sat, 12 May 2018 16:06:18 +0000 (18:06 +0200)]
don't recomend 16 bytes for argon2i digests
Listing 16 bytes as a possible size sounds like an endorsement.
But really, 16 bytes is a bit short, and weakens the security
of the subsequent symmetric crypto. You kind have to know what
you are doing to select such a size, so let's not list it.
This is C we're talking about. Functions that return void cannot fail
only if they're used correctly. Incorrect inputs can still trigger
undefined behaviour. In this sense, those functions _can_ fail.
Returning void should be an obvious enough hint that the function
requires no error handling. At least it is if you're familiar enough
with C. (If one is not, one is not qualified to use a crypto library in
an unsafe language.)
An unqualified "cannot fail" give any more information than `void`, and
may even mislead some users. Better stay on the safe side.