SPO600 Stage 2 Part 3: A quest to find the best compiler option

So, in this blog post I will show my results on which compiler option makes the biggest impact on performance.

But first, I decided to test a bigger file with Brotli.

I ran a 1.6 GB CSV file containing mostly lorum isum:


-O0
Real: 27m45.129s
User: 27m42.538s
Sys: 0m1.039s
-O1
Real: 8m53.567
User: 8m52.045s
Sys: 0m0.839s
-O2
Real: 7m53.465s
User: 7m52.098s
Sys: 0m0.709s
-O3
Real: 7m55.876
User: 7m54.602s
Sys: 0m0.609s

view raw

data.txt

hosted with ❤ by GitHub

This falls in line with my previous results. –O2 gives the best performance and –O3 lags slightly behind –O2 in time.

Anyways, to find the best compiler option I needed to run through all the possible results.

To do this I created a python script that went through and generated a list with all the possible options.

I then created a bash script that went through each option and reported on the time.

I decided to capture the best times for total time and best time in user mode.

One thing to note about the tests is that scripts did not fully complete their testing and as a result some of the data might be skewed. Most of the scripts have gone through a decent amount of options that I have confidence that further testing would not be needed.

Here is a summary of the results:


*Aarchie with 180mb file
-Just with -O2:
real 2m51.523s
user 2m51.247s
Number of options tested: 1284
Real best time: 2m50.799s (CMAKE_CXX_FLAGS_DEBUG:STRING= -O2 -ftree-loop-distribute-patterns -floop-interchange -ftree-slp-vectorize -fipa-cp-clone)
User best time: 2m50.367s (CMAKE_CXX_FLAGS_DEBUG:STRING= -O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fvect-cost-model -ftree-partial-pre)
**************************************************************************************************************************
*Bbetty with 180mb file
Just with –O2:
real 0m52.384s
user 0m52.153s
Number of options tested: 1331
Real best time: 52.252s (CMAKE_C_FLAGS_DEBUG:STRING= -O2 -fgcse-after-reload -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops)
User best time: 51.924s (CMAKE_C_FLAGS_DEBUG:STRING= -O2 -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops)
**************************************************************************************************************************
*Ccharlie with 1.7gb file
Just with –O2:
real 7m53.515s
user 7m52.077s
Number of options tested: 345
Real best time: 7m52.480s (CMAKE_C_FLAGS_DEBUG:STRING= -O2 -ftree-partial-pre -fpeel-loops -fipa-cp-clone)
User best time: 7m51.016s (CMAKE_C_FLAGS_DEBUG:STRING= -O2 -ftree-loop-distribution -floop-interchange -ftree-partial-pre -fpeel-loops -fipa-cp-clone)

view raw

stage2.txt

hosted with ❤ by GitHub

The best results usually did a second better than the normal –O2.

Another thing that is interesting is the most repeating flags:

  • -ftree-loop-distribution (5 times) This option allows for better loop optimization and vectorization.
  • -ftree-slp-vectorize (5 times) Performs vectorization on trees.
  • -fpeel-loops (4 times) Does loop peeling if there is a good amount of information and turns on complete loop peeling.
  • -fsplit-paths (4 times) Used to split paths to loop backedges. It can reduce dead code elimination.

Reference: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

I find it interesting the best performing options have to deal with looping. Most of Brotli’s expensive methods have a lot of looping in them. For example, BrotliIsMostlyUTF8 takes up 25 percent of the time and involves a lot of looping. It can be viewed here https://github.com/google/brotli/blob/35e69fc7cf9421ab04ffc9d52cb36d07fa12984a/c/enc/utf8_util.c. It would make sense that the faster options would look to optimize looping as the program does a lot it.

Once I got 6 different options on ARM based systems, I decided to test each one individually on xerxes because I wanted to confirm that the results on ARM system worked on X86 architecture.


Benchmark -O2(180mb file):
Run#1
Real 0m41.442s
User 0m41.200s
Run#2
Real 0m41.404s
User 0m41.136s
AVG:
Real 0m41.423s
User 0m41.168s
-O2 -ftree-loop-distribute-patterns -floop-interchange -ftree-slp-vectorize -fipa-cp-clone
Run#1
Real 0m41.502s
User 0m41.246s
Run#2
Real 0m41.402s
User 0m41.143s
AVG
Real 41.452s
User 41.195s
Relative to Real Benchmark Percent: 0.0700094% increase
Relative to User Benchmark Percent: 0.0655849% increase
-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fvect-cost-model -ftree-partial-pre
Run#1
Real 0m41.433s
User 0m41.165s
Run#2
Real 0m41.433s
User 0m41.169s
AVG
Real 41.433s
User 41.167s
Relative to Real Benchmark Percent: 0.0241412% increase
Relative to User Benchmark Percent: 0.00242907% decrease
-O2 -fgcse-after-reload -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops
Run#1
Real 0m41.379s
User 0m41.109s
Run#2
Real 0m41.372s
User 0m41.113s
AVG
Real 41.376s
User 41.111s
Relative to Real Benchmark Percent: 0.113464% decrease
Relative to User Benchmark Percent: 0.138457% decrease
-O2 -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops
Run#1
Real 0m41.984s
User 0m41.735s
Run#2
Real 0m41.367s
User 0m41.123s
AVG
Real 41.676s
User 41.429s
Relative to Real Benchmark Percent: 0.610772% increase
Relative to User Benchmark Percent: 0.633988% increase
-O2 -ftree-partial-pre -fpeel-loops -fipa-cp-clone
Run#1
Real 0m41.407s
User 0m41.151s
Run#2
Real 0m41.369s
User 0m41.136s
AVG
Real 41.388s
User 41.144s
Relative to Real Benchmark Percent: 0.0844941% decrease
Relative to User Benchmark Percent: 0.0582977% decrease
-O2 -ftree-loop-distribution -floop-interchange -ftree-partial-pre -fpeel-loops -fipa-cp-clone
Run#1
Real 0m41.337s
User 0m41.065s
Run#2
Real 0m41.337s
User 0m41.087s
AVG
Real 41.337s
User 41.076s
Relative to Real Benchmark Percent: 0.207614% decrease
Relative to User Benchmark Percent: 0.223475% decrease
*Increases are worse times and Decreases are better times

view raw

xerxes.txt

hosted with ❤ by GitHub

I was surprised at the results of my testing. It seems like some of the options don’t work nearly as well on X86 architecture. Two of the 6 combinations did worse than the benchmark. The other 4 managed to beat the benchmark. In addition, looking at the percentage relative to time differences are miniscule. The best times only managed to beat the benchmark by less than 1% but the options that did worse did not exceed 1% either. Adding options on xerxes didn’t really impact the performance too heavily.

Also, I wanted to compare the differences between Aarchie and Bbetty. Both of them have different microarchitectures and I want to see how these different options react to them. I am planning on using a 180mb file for this testing.


Aarchie:
Just with –O2:
Run #1
Real 0m49.034s
User 0m48.777s
Run #2
Real 0m49.065s
User 0m48.691s
AVG
Real 49.050s
User 48.734s
-O2 -ftree-loop-distribute-patterns -floop-interchange -ftree-slp-vectorize -fipa-cp-clone
Run #1
Real 0m49.201s
User 0m48.911s
Run #2
Real 0m49.176s
User 0m48.888s
AVG
Real 49.189s
User 48.900s
Relative to Real Benchmark Percent: 0.283384% increase
Relative to User Benchmark Percent: 0.340625% increase
-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fvect-cost-model -ftree-partial-pre
Run #1
Real 0m49.536s
User 0m49.313s
Run #2
Real 0m49.541s
User 0m49.363s
AVG
Real 49.539s
User 49.338s
Relative to Real Benchmark Percent: 0.996942% increase
Relative to User Benchmark Percent: 1.23938% increase
-O2 -fgcse-after-reload -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops
Run #1
Real 0m49.271s
User 0m49.054s
Run #2
Real 0m49.047s
User 0m48.728s
AVG
Real 49.159
User 48.891
Relative to Real Benchmark Percent: 0.222222% increase
Relative to User Benchmark Percent: 0.322157% increase
-O2 -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops
Run #1
Real 0m49.062s
User 0m48.838
Run #2
Real 0m48.924s
User 0m48.677s
AVG
Real 48.993
User 48.758
Relative to Real Benchmark Percent: 0.116208% decrease
Relative to User Benchmark Percent: 0.0492469% increase
-O2 -ftree-partial-pre -fpeel-loops -fipa-cp-clone
Run #1
Real 0m48.967s
User 0m48.729s
Run #2
Real 0m49.039s
User 0m48.802s
AVG
Real 49.003
User 48.766
Relative to Real Benchmark Percent: 0.0958206% decrease
Relative to User Benchmark Percent: 0.0656626% increase
-O2 -ftree-loop-distribution -floop-interchange -ftree-partial-pre -fpeel-loops -fipa-cp-clone
Run #1
Real 0m49.026s
User 0m48.833s
Run #2
Real 0m48.989s
User 0m48.649s
AVG
Real 48.930
User 48.741
Relative to Real Benchmark Percent: 0.244648% decrease
Relative to User Benchmark Percent: 0.0143637% increase
Bbetty:
Just with –O2:
Run#1
Real 0m52.795s
User 0m52.556s
Run#2
Real 0m52.760s
User 0m52.476s
AVG
Real 52.776
User 52.516
-O2 -ftree-loop-distribute-patterns -floop-interchange -ftree-slp-vectorize -fipa-cp-clone
Run #1
Real 0m53.189s
User 0m52.881s
Run #2
Real 0m53.224s
User 0m52.196s
AVG
Real 53.207
User 52.539
Relative to Real Benchmark Percent: 0.816659% increase
Relative to User Benchmark Percent: 0.0437962% increase
-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fvect-cost-model -ftree-partial-pre
Run #1
Real 0m54.139s
User 0m53.926s
Run #2
Real 0m54.198s
User 0m53.975s
AVG
Real 54.169s
User 53.951s
Relative to Real Benchmark Percent: 2.63946% increase
Relative to User Benchmark Percent: 2.7325% increase
-O2 -fgcse-after-reload -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops
Run #1
Real 0m52.880s
User 0m52.629s
Run #2
Real 0m52.774s
User 0m52.559s
AVG
Real 52.827s
User 52.594s
Relative to Real Benchmark Percent: 0.0966348% increase
Relative to User Benchmark Percent: 0.148526% increase
-O2 -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops
Run #1
Real 0m52.853s
User 0m52.562s
Run #2
Real 0m52.847s
User 0m52.537s
AVG
Real 52.850s
User 52.550s
Relative to Real Benchmark Percent: 0.140215% increase
Relative to User Benchmark Percent: 0.0647422% increase
-O2 -ftree-partial-pre -fpeel-loops -fipa-cp-clone
Run #1
Real 0m52.828s
User 0m52.534s
Run #2
Real 0m52.848s
User 0m52.548s
AVG
Real 52.838s
User 52.541s
Relative to Real Benchmark Percent: 0.117478% increase
Relative to User Benchmark Percent: 0.0476045% increase
-O2 -ftree-loop-distribution -floop-interchange -ftree-partial-pre -fpeel-loops -fipa-cp-clone
Run #1
Real 0m52.780s
User 0m52.578s
Run #2
Real 0m52.795s
User 0m52.543s
AVG
Real 52.788s
User 52.561s
Relative to Real Benchmark Percent: 0.0227376% increase
Relative to User Benchmark Percent: 0.0856882% increase

An interesting thing to note is that Aarchie has a slight performance advantage over Bbetty. I believe that is because Aarchie has a better CPU. Also, the results between Aarchie and Bbetty are also very inconsistent with each other. The results on Bbetty all fail to beat the benchmark and some of them did 2% worse than benchmark. While Archie had 3 options that had slightly better times then the benchmark. It looks like the slight differences in microarchitecture make a big impact on Brotli.

In my opinion the results indicate that the –O3 compiler options make a minimal impact on Brotli’s performance and should not be used as default because it does little to affect the time.

Leave a comment