Increased performance by 30% for RGBA and 45% for Gray images, minor
performance increase for 16-bit images.
The start offset calculated by createWeights are stored in a slice and
passed to the resize functions to prevent duplication of effort.
This is a complete rewrite! The tight scaling loop needs data locality for optimal performance. The old version used lots of pointer redirections to access image data which was bad for data locality. By providing the complete loop for each image type, this problem is solved. Unfortunately this increases code duplication but the result should be worth it: I could measure a ~6x speed-up for certain test cases!
For each row the convolution kernel is evaluated at fixed positions around "u". By calculating these values once for each row, a huge speedup is achieved.
Profiling the resize operation using Lanczos kernels showed that most of
the time was spent in the Sin function, which is called twice per
evaluation of the Lanczos kernel. Generating a lookup table and doing a
linear interpolation between table entries speeds up the resize by a
factor of 4.
Filter kernels should yield Zero if they are evaluted outside their
intended size. Though filterModel.Interpolate doesn't do this by design,
it's better to include it anyways.