So, in this blog post I will show my results on which compiler option makes the biggest impact on performance.
But first, I decided to test a bigger file with Brotli.
I ran a 1.6 GB CSV file containing mostly lorum isum:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-O0 | |
Real: 27m45.129s | |
User: 27m42.538s | |
Sys: 0m1.039s | |
-O1 | |
Real: 8m53.567 | |
User: 8m52.045s | |
Sys: 0m0.839s | |
-O2 | |
Real: 7m53.465s | |
User: 7m52.098s | |
Sys: 0m0.709s | |
-O3 | |
Real: 7m55.876 | |
User: 7m54.602s | |
Sys: 0m0.609s |
This falls in line with my previous results. –O2
gives the best performance and –O3
lags slightly behind –O2
in time.
Anyways, to find the best compiler option I needed to run through all the possible results.
To do this I created a python script that went through and generated a list with all the possible options.
I then created a bash script that went through each option and reported on the time.
I decided to capture the best times for total time and best time in user mode.
One thing to note about the tests is that scripts did not fully complete their testing and as a result some of the data might be skewed. Most of the scripts have gone through a decent amount of options that I have confidence that further testing would not be needed.
Here is a summary of the results:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
*Aarchie with 180mb file | |
-Just with -O2: | |
real 2m51.523s | |
user 2m51.247s | |
Number of options tested: 1284 | |
Real best time: 2m50.799s (CMAKE_CXX_FLAGS_DEBUG:STRING= -O2 -ftree-loop-distribute-patterns -floop-interchange -ftree-slp-vectorize -fipa-cp-clone) | |
User best time: 2m50.367s (CMAKE_CXX_FLAGS_DEBUG:STRING= -O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fvect-cost-model -ftree-partial-pre) | |
************************************************************************************************************************** | |
*Bbetty with 180mb file | |
Just with –O2: | |
real 0m52.384s | |
user 0m52.153s | |
Number of options tested: 1331 | |
Real best time: 52.252s (CMAKE_C_FLAGS_DEBUG:STRING= -O2 -fgcse-after-reload -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops) | |
User best time: 51.924s (CMAKE_C_FLAGS_DEBUG:STRING= -O2 -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops) | |
************************************************************************************************************************** | |
*Ccharlie with 1.7gb file | |
Just with –O2: | |
real 7m53.515s | |
user 7m52.077s | |
Number of options tested: 345 | |
Real best time: 7m52.480s (CMAKE_C_FLAGS_DEBUG:STRING= -O2 -ftree-partial-pre -fpeel-loops -fipa-cp-clone) | |
User best time: 7m51.016s (CMAKE_C_FLAGS_DEBUG:STRING= -O2 -ftree-loop-distribution -floop-interchange -ftree-partial-pre -fpeel-loops -fipa-cp-clone) |
The best results usually did a second better than the normal –O2.
Another thing that is interesting is the most repeating flags:
-ftree-loop-distribution
(5 times) This option allows for better loop optimization and vectorization.-ftree-slp-vectorize
(5 times) Performs vectorization on trees.-fpeel-loops
(4 times) Does loop peeling if there is a good amount of information and turns on complete loop peeling.-fsplit-paths
(4 times) Used to split paths to loop backedges. It can reduce dead code elimination.
Reference: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
I find it interesting the best performing options have to deal with looping. Most of Brotli’s expensive methods have a lot of looping in them. For example, BrotliIsMostlyUTF8
takes up 25 percent of the time and involves a lot of looping. It can be viewed here https://github.com/google/brotli/blob/35e69fc7cf9421ab04ffc9d52cb36d07fa12984a/c/enc/utf8_util.c. It would make sense that the faster options would look to optimize looping as the program does a lot it.
Once I got 6 different options on ARM based systems, I decided to test each one individually on xerxes because I wanted to confirm that the results on ARM system worked on X86 architecture.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Benchmark -O2(180mb file): | |
Run#1 | |
Real 0m41.442s | |
User 0m41.200s | |
Run#2 | |
Real 0m41.404s | |
User 0m41.136s | |
AVG: | |
Real 0m41.423s | |
User 0m41.168s | |
-O2 -ftree-loop-distribute-patterns -floop-interchange -ftree-slp-vectorize -fipa-cp-clone | |
Run#1 | |
Real 0m41.502s | |
User 0m41.246s | |
Run#2 | |
Real 0m41.402s | |
User 0m41.143s | |
AVG | |
Real 41.452s | |
User 41.195s | |
Relative to Real Benchmark Percent: 0.0700094% increase | |
Relative to User Benchmark Percent: 0.0655849% increase | |
-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fvect-cost-model -ftree-partial-pre | |
Run#1 | |
Real 0m41.433s | |
User 0m41.165s | |
Run#2 | |
Real 0m41.433s | |
User 0m41.169s | |
AVG | |
Real 41.433s | |
User 41.167s | |
Relative to Real Benchmark Percent: 0.0241412% increase | |
Relative to User Benchmark Percent: 0.00242907% decrease | |
-O2 -fgcse-after-reload -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops | |
Run#1 | |
Real 0m41.379s | |
User 0m41.109s | |
Run#2 | |
Real 0m41.372s | |
User 0m41.113s | |
AVG | |
Real 41.376s | |
User 41.111s | |
Relative to Real Benchmark Percent: 0.113464% decrease | |
Relative to User Benchmark Percent: 0.138457% decrease | |
-O2 -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops | |
Run#1 | |
Real 0m41.984s | |
User 0m41.735s | |
Run#2 | |
Real 0m41.367s | |
User 0m41.123s | |
AVG | |
Real 41.676s | |
User 41.429s | |
Relative to Real Benchmark Percent: 0.610772% increase | |
Relative to User Benchmark Percent: 0.633988% increase | |
-O2 -ftree-partial-pre -fpeel-loops -fipa-cp-clone | |
Run#1 | |
Real 0m41.407s | |
User 0m41.151s | |
Run#2 | |
Real 0m41.369s | |
User 0m41.136s | |
AVG | |
Real 41.388s | |
User 41.144s | |
Relative to Real Benchmark Percent: 0.0844941% decrease | |
Relative to User Benchmark Percent: 0.0582977% decrease | |
-O2 -ftree-loop-distribution -floop-interchange -ftree-partial-pre -fpeel-loops -fipa-cp-clone | |
Run#1 | |
Real 0m41.337s | |
User 0m41.065s | |
Run#2 | |
Real 0m41.337s | |
User 0m41.087s | |
AVG | |
Real 41.337s | |
User 41.076s | |
Relative to Real Benchmark Percent: 0.207614% decrease | |
Relative to User Benchmark Percent: 0.223475% decrease | |
*Increases are worse times and Decreases are better times |
I was surprised at the results of my testing. It seems like some of the options don’t work nearly as well on X86 architecture. Two of the 6 combinations did worse than the benchmark. The other 4 managed to beat the benchmark. In addition, looking at the percentage relative to time differences are miniscule. The best times only managed to beat the benchmark by less than 1% but the options that did worse did not exceed 1% either. Adding options on xerxes didn’t really impact the performance too heavily.
Also, I wanted to compare the differences between Aarchie and Bbetty. Both of them have different microarchitectures and I want to see how these different options react to them. I am planning on using a 180mb file for this testing.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Aarchie: | |
Just with –O2: | |
Run #1 | |
Real 0m49.034s | |
User 0m48.777s | |
Run #2 | |
Real 0m49.065s | |
User 0m48.691s | |
AVG | |
Real 49.050s | |
User 48.734s | |
-O2 -ftree-loop-distribute-patterns -floop-interchange -ftree-slp-vectorize -fipa-cp-clone | |
Run #1 | |
Real 0m49.201s | |
User 0m48.911s | |
Run #2 | |
Real 0m49.176s | |
User 0m48.888s | |
AVG | |
Real 49.189s | |
User 48.900s | |
Relative to Real Benchmark Percent: 0.283384% increase | |
Relative to User Benchmark Percent: 0.340625% increase | |
-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fvect-cost-model -ftree-partial-pre | |
Run #1 | |
Real 0m49.536s | |
User 0m49.313s | |
Run #2 | |
Real 0m49.541s | |
User 0m49.363s | |
AVG | |
Real 49.539s | |
User 49.338s | |
Relative to Real Benchmark Percent: 0.996942% increase | |
Relative to User Benchmark Percent: 1.23938% increase | |
-O2 -fgcse-after-reload -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops | |
Run #1 | |
Real 0m49.271s | |
User 0m49.054s | |
Run #2 | |
Real 0m49.047s | |
User 0m48.728s | |
AVG | |
Real 49.159 | |
User 48.891 | |
Relative to Real Benchmark Percent: 0.222222% increase | |
Relative to User Benchmark Percent: 0.322157% increase | |
-O2 -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops | |
Run #1 | |
Real 0m49.062s | |
User 0m48.838 | |
Run #2 | |
Real 0m48.924s | |
User 0m48.677s | |
AVG | |
Real 48.993 | |
User 48.758 | |
Relative to Real Benchmark Percent: 0.116208% decrease | |
Relative to User Benchmark Percent: 0.0492469% increase | |
-O2 -ftree-partial-pre -fpeel-loops -fipa-cp-clone | |
Run #1 | |
Real 0m48.967s | |
User 0m48.729s | |
Run #2 | |
Real 0m49.039s | |
User 0m48.802s | |
AVG | |
Real 49.003 | |
User 48.766 | |
Relative to Real Benchmark Percent: 0.0958206% decrease | |
Relative to User Benchmark Percent: 0.0656626% increase | |
-O2 -ftree-loop-distribution -floop-interchange -ftree-partial-pre -fpeel-loops -fipa-cp-clone | |
Run #1 | |
Real 0m49.026s | |
User 0m48.833s | |
Run #2 | |
Real 0m48.989s | |
User 0m48.649s | |
AVG | |
Real 48.930 | |
User 48.741 | |
Relative to Real Benchmark Percent: 0.244648% decrease | |
Relative to User Benchmark Percent: 0.0143637% increase | |
Bbetty: | |
Just with –O2: | |
Run#1 | |
Real 0m52.795s | |
User 0m52.556s | |
Run#2 | |
Real 0m52.760s | |
User 0m52.476s | |
AVG | |
Real 52.776 | |
User 52.516 | |
-O2 -ftree-loop-distribute-patterns -floop-interchange -ftree-slp-vectorize -fipa-cp-clone | |
Run #1 | |
Real 0m53.189s | |
User 0m52.881s | |
Run #2 | |
Real 0m53.224s | |
User 0m52.196s | |
AVG | |
Real 53.207 | |
User 52.539 | |
Relative to Real Benchmark Percent: 0.816659% increase | |
Relative to User Benchmark Percent: 0.0437962% increase | |
-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fvect-cost-model -ftree-partial-pre | |
Run #1 | |
Real 0m54.139s | |
User 0m53.926s | |
Run #2 | |
Real 0m54.198s | |
User 0m53.975s | |
AVG | |
Real 54.169s | |
User 53.951s | |
Relative to Real Benchmark Percent: 2.63946% increase | |
Relative to User Benchmark Percent: 2.7325% increase | |
-O2 -fgcse-after-reload -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops | |
Run #1 | |
Real 0m52.880s | |
User 0m52.629s | |
Run #2 | |
Real 0m52.774s | |
User 0m52.559s | |
AVG | |
Real 52.827s | |
User 52.594s | |
Relative to Real Benchmark Percent: 0.0966348% increase | |
Relative to User Benchmark Percent: 0.148526% increase | |
-O2 -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops | |
Run #1 | |
Real 0m52.853s | |
User 0m52.562s | |
Run #2 | |
Real 0m52.847s | |
User 0m52.537s | |
AVG | |
Real 52.850s | |
User 52.550s | |
Relative to Real Benchmark Percent: 0.140215% increase | |
Relative to User Benchmark Percent: 0.0647422% increase | |
-O2 -ftree-partial-pre -fpeel-loops -fipa-cp-clone | |
Run #1 | |
Real 0m52.828s | |
User 0m52.534s | |
Run #2 | |
Real 0m52.848s | |
User 0m52.548s | |
AVG | |
Real 52.838s | |
User 52.541s | |
Relative to Real Benchmark Percent: 0.117478% increase | |
Relative to User Benchmark Percent: 0.0476045% increase | |
-O2 -ftree-loop-distribution -floop-interchange -ftree-partial-pre -fpeel-loops -fipa-cp-clone | |
Run #1 | |
Real 0m52.780s | |
User 0m52.578s | |
Run #2 | |
Real 0m52.795s | |
User 0m52.543s | |
AVG | |
Real 52.788s | |
User 52.561s | |
Relative to Real Benchmark Percent: 0.0227376% increase | |
Relative to User Benchmark Percent: 0.0856882% increase |
An interesting thing to note is that Aarchie has a slight performance advantage over Bbetty. I believe that is because Aarchie has a better CPU. Also, the results between Aarchie and Bbetty are also very inconsistent with each other. The results on Bbetty all fail to beat the benchmark and some of them did 2% worse than benchmark. While Archie had 3 options that had slightly better times then the benchmark. It looks like the slight differences in microarchitecture make a big impact on Brotli.
In my opinion the results indicate that the –O3
compiler options make a minimal impact on Brotli’s performance and should not be used as default because it does little to affect the time.