The width of the pre-computed window affects the program size. It has
been set to 5 (8 elements) so we can approach maximum performance
without bloating the program too much.
The width of the cached window affects the *stack* size. It has been set
to 3 (2 elements) to avoid blowing up the stack (this matters most on
embedded environments). The performance hit is measurable, yet very
reasonable.
Footgun wielders can adjust those widths as they see fit.