Though it requires a (safe because it's all aligned) cast at one point,
it makes the code simpler and significantly speeds up non-aligned
incremental hashes.
Surprisingly, foregoing word-by-word loading at the begining of the
update doesn't slow anything down, but forgoing it at the end *does*.
So while we align with block boundaries directly, we end up copying the
remaining words first, then the remaining bytes.