We were computing abs(sum(signed_pixels)), while the comment says
that sum(abs(signed_pixels)) works well. Change the code to match
the comment. (While I did tweak this code recently, as far as I can
tell the code hasn't matched the comment from when it was originally
added.)
Using the benchmarking script and same inputs as in #24819, just changed
to write .png files with different --png-compression-level values:
level 0: 390M -> 390M (no change)
level 1: 83M -> 78M (6% smaller)
level 2: 73M -> 69M (5.8% smaller at default compression level)
level 3: 71M -> 67M (5.6% smaller)
Sizes as computed by `du -hAs`. (`-A` prints apparent size on macOS.
It's important to use that flag to get consistent results. On Linux,
this flag is spelled `-b` instead.)
The size of sunset_retro.png goes up a tiny bit, but but less than
0.4% at all sizes. At level 2, the size goes from 908K to 911K, for
example.
The size of Tests/LibGfx/test-inputs/jpg/big_image.jpg encoded as PNG
goes down by about 2.7%, but it's the 2.7% that gets us over an MB
boundary at levels 1 and 2. At level 1, from 14M to 13M; at level 2
from 13M to 12M. (Exact numbers: 14417809 bytes to 13429605 at level 1,
14076443 bytes to 13088791 at level 2.) For comparison, sips writes a
15M (15610049 bytes) file. So we were already writing a smaller file,
and now we're even better. (We need 778 ms at level 1 while
sips needs 723ms. So it's a bit faster, but not a ton.)
The size of wow.apng goes from 606K to 584K (3.6% smaller).
Perf-wise, this is close to a wash. Worst case, it's maybe 2-3% slower,
but the numbers are pretty noisy, even with many repetitions in
`hyperfine`. I'm guessing `ministat` would claim that no significant
difference can be shown, but I haven't tried it. For example, for
sunset_retro.png at level 2, time goes from 179.3 ms ± 2.5 ms to
182.8 ms ± 1.9 ms, which would be a 2% slowdown. At level 0, where
the effect is relatively largest, it goes from 21.8 ms ± 0.7 ms to
22.6 ms ± 0.7 ms, a 3.6% slowdown (but with huge error bars).
For big_image.jpg level 1, time goes from 768.5 ms ± 8.4 ms to
777.9 ms ± 6.0 ms, 1.2% slower.
Previously, we were swapping red and blue before doing filtering.
The filters don't care about channel order, so instead only do
this when writing the PNG data.
In theory, this saves the work of channel swizzling when figuring
out which filter is best. In practice, it's perf-neutral:
swizzling is basically free. But it's still conceptually simpler.
No behavior change.
Brings wow.apng from 1.2M to 606K, while reducing encoding time from
233 ms to 167 ms.
(For comparison, writing wow.webp currently takes 88ms and produces
a 255K file. The input wow.gif is 184K.)
Before, we would compute and store the output of each predictor,
then pick the best one, and then copy its data.
Now, we compute the output of each predictor but only compute its
score and do not store the predicted data. We then pick the best
one, and do a second pass that re-computes the output of the best
predictor, and stores it.
Instead of computing the output of the 5 different predictors, we now
compute the output of the 5 different predictors, and then the output of
one of them again. In exchange, we only write each output row once
instead of 5 times. (We also have to read the input row twice instead of
once, but the second time round it'll come from L1 or L2.)
Making the simplifying assumption that each predictor takes the same
time to compute, this increases compute to 6/5th, and reduces memory
bandwidth to 3/6th. (Before: 1 input row read, 5 output row writes;
after: 2 input row reads, 1 output row write.)
Produces exactly the same output, but is faster:
image -o sunset_retro.png sunset_retro.bmp --png-compression-level 0
34.8 ms ± 0.9 ms -> 22.7 ms ± 0.8 ms (34.7% faster)
image -o sunset_retro.png sunset_retro.bmp --png-compression-level 1
64.2 ms ± 4.9 ms -> 50.5 ms ± 0.5 ms (31.3% faster)
image -o sunset_retro.png sunset_retro.bmp --png-compression-level 2
190.3 ms ± 1.6 ms -> 179.0 ms ± 2.8 ms (5.8% faster)
image -o sunset_retro.png sunset_retro.bmp --png-compression-level 3
646.5 ms ± 4.7 ms -> 635.3 ms ± 4.4 ms (3.3% faster)
Compression level 2 is the default, so about a 6% speedup in practice.
`sips` still needs 49.9 ms ± 3.0 ms to convert sunset_retro.bmp to
sunset_retro.png at its default compression level 1.
We used to take 1.27x as long as sips, now we take 1.01x as long,
while producing a smaller output :^)
(For other, larger, input files sips is still faster and produces
smaller output.)
Produces exactly the same output, but a bit faster.
The speedup is relatively bigger for worse compression:
image -o sunset_retro.png sunset_retro.bmp --png-compression-level 0
56.8 ms ± 1.5 ms -> 34.8 ms ± 0.9 ms (38.7% faster)
image -o sunset_retro.png sunset_retro.bmp --png-compression-level 1
84.6 ms ± 1.7 ms -> 64.2 ms ± 4.9 ms (24.1% faster)
image -o sunset_retro.png sunset_retro.bmp --png-compression-level 2
212.1 ms ± 2.5 ms -> 190.3 ms ± 1.6 ms (10.3% faster)
image -o sunset_retro.png sunset_retro.bmp --png-compression-level 3
671.4 ms ± 12.3 ms -> 646.5 ms ± 4.7 ms (3.7% faster)
Compression level 2 is the default, so about a 10% speedup in practice.
For comparison, `sips` needs 49.9 ms ± 3.0 ms to convert
sunset_retro.bmp to sunset_retro.png, and judging from the output file
size, it uses something similar to our compression level 1.
We used to take 1.7x as long as sips, now we take 1.29x as long.
Using the same two benchmarks as in the previous commit:
1.
n | time | size
--+--------------------+--------
0 | 56.5 ms ± 0.9 ms | 2.3M
1 | 88.2 ms ± 14.0 ms | 962K
2 | 214.8 ms ± 5.6 ms | 908K
3 | 670.8 ms ± 3.6 ms | 903K
Compared to the numbers in the previous commit:
n = 0: 17.3% faster, 23.3% smaller
n = 1: 12.9% faster, 12.5% smaller
n = 2, 24.9% faster, 9.2% smaller
n = 3: 49.6% faster, 9.6% smaller
For comparison,
`sips -s format png -o sunset_retro_sips.png sunset_retro.bmp` writes
a 1.1M file (i.e. it always writes RGBA, not RGB when not necessary),
and it needs 49.9 ms ± 3.0 ms for that (also using a .bmp input). So
our output file size is competitive! We have to get a bit faster though.
For another comparison, `image -o sunset_retro.webp sunset_retro.bmp`
writes a 730K file and needs 32.1 ms ± 0.7 ms for that.
2.
n | time | size
--+----------------+------
0 | 11.334 total | 390M
1 | 13.640 total | 83M
2 | 15.642 total | 73M
3 | 48.643 total | 71M
Compared to the numbers in the previous commit:
n = 0: 15.8% faster, 25.0% smaller
n = 1: 15.5% faster, 7.7% smaller
n = 2: 24.0% faster, 5.2% smaller
n = 3: 29.2% faster, 5.3% smaller
So a relatively bigger speed win for higher levels, and
a bigger size win for lower levels.
Also, the size at n = 2 with this change is now lower than it
was at n = 3 previously.
This warning is triggered when one accepts or returns vectors from a
function (that is not marked with [[gnu::target(...)]]) which would have
been otherwise passed in register if the current translation unit had
been compiled with more permissive flags wrt instruction selection (i.
e. if one adds -mavx2 to cmdline). This will never be a problem for us
since we (a) never use different instruction selection options across
ABI boundaries; (b) most of the affected functions are actually
TU-local.
Moreover, even if we somehow properly annotated all of the SIMD helpers,
calling them across ABI (or target) boundaries would still be very
dangerous because of inconsistent and bogus handling of
[[gnu::target(...)]] across compilers. See
https://github.com/llvm/llvm-project/issues/64706 and
https://www.reddit.com/r/cpp/comments/17qowl2/comment/k8j2odi .