From: Loup Vaillant <loup@loup-vaillant.fr>
Date: Sat, 10 Nov 2018 12:59:38 +0000 (+0100)
Subject: Added -DBLAKE2_NO_UNROLLING preprocessor option
X-Git-Url: https://git.codecow.com/?a=commitdiff_plain;h=2174e60ef8aaf2c9779c12e4caaab087a4e5ab35;p=Monocypher.git

Added -DBLAKE2_NO_UNROLLING preprocessor option

Less bloat, faster on some embedded platforms.
---

diff --git a/README.md b/README.md
index 4fc4938..63604e5 100644
--- a/README.md
+++ b/README.md
@@ -126,22 +126,24 @@ TweetNaCl (the default is `-O3Â march=native`):
 Customisation
 -------------
 
-For simplicity, compactness, and performance reasons, Monocypher
-signatures default to EdDSA with curve25519 and Blake2b.  This is
-different from the more mainstream Ed25519, which uses SHA-512
-instead.
-
-If you need Ed25519 compatibility, you must do the following:
-
-- Compile Monocypher.c with option -DED25519_SHA512.
-- Link the final program with a suitable SHA-512 implementation.  You
-  can use the `sha512.c` and `sha512.h` files provided in
-  `src/optional`.
-
-Note that even though the default hash (Blake2b) is not "standard",
-you can still upgrade to faster implementations if you really need to.
-The Donna implementations of ed25519 for instance can use a custom
-hash âone test does just that.
+Monocypher has two preprocessor flags: `ED25519_SHA512` and
+`BLAKE2_NO_UNROLLING`, which are activated by compiling monocypher.c
+with the options `-DED25519_SHA512` and `-DBLAKE2_NO_UNROLLING`
+respectively.
+
+The `-DED25519_SHA512` option is a compatibility feature for public key
+signatures.  The default is EdDSA with Curve25519 and Blake2b.
+Activating the option replaces it by Ed25519 (EdDSA with Curve25519 and
+SHA-512).  When this option is activated, you will need to link the
+final program with a suitable SHA-512 implementation.  You can use the
+`sha512.c` and `sha512.h` files provided in `src/optional`.
+
+The `-DBLAKE2_NO_UNROLLING` option is a performance tweak.  By default,
+Monocypher unrolls the Blake2b inner loop, because it is over 25% faster
+on modern processors.  On some embedded processors however, unrolling
+the loop makes it _slower_ (the unrolled loop is 5KB bigger, and may
+strain the instruction cache).  If you're using an embedded platform,
+try this option.  The binary will be smaller, perhaps even faster.
 
 
 Contributor notes
diff --git a/src/monocypher.c b/src/monocypher.c
index 40bb999..6ea9832 100644
--- a/src/monocypher.c
+++ b/src/monocypher.c
@@ -482,6 +482,8 @@ static void blake2b_compress(crypto_blake2b_ctx *ctx, int is_last_block)
         { 13, 11,  7, 14, 12,  1,  3,  9,  5,  0, 15,  4,  8,  6,  2, 10 },
         {  6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5 },
         { 10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13,  0 },
+        {  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 },
+        { 14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3 },
     };
 
     // init work vector
@@ -511,9 +513,15 @@ static void blake2b_compress(crypto_blake2b_ctx *ctx, int is_last_block)
     BLAKE2_G(v, 2, 7,  8, 13, input[sigma[i][12]], input[sigma[i][13]]);\
     BLAKE2_G(v, 3, 4,  9, 14, input[sigma[i][14]], input[sigma[i][15]])
 
+#ifdef BLAKE2_NO_UNROLLING
+    FOR (i, 0, 12) {
+        BLAKE2_ROUND(i);
+    }
+#else
     BLAKE2_ROUND(0);  BLAKE2_ROUND(1);  BLAKE2_ROUND(2);  BLAKE2_ROUND(3);
     BLAKE2_ROUND(4);  BLAKE2_ROUND(5);  BLAKE2_ROUND(6);  BLAKE2_ROUND(7);
     BLAKE2_ROUND(8);  BLAKE2_ROUND(9);  BLAKE2_ROUND(0);  BLAKE2_ROUND(1);
+#endif
 
     // update hash
     ctx->hash[0] ^= v0 ^ v8;