SPO600 Stage 2 Part 3: A quest to find the best compiler option

So, in this blog post I will show my results on which compiler option makes the biggest impact on performance.

But first, I decided to test a bigger file with Brotli.

I ran a 1.6 GB CSV file containing mostly lorum isum:

	-O0
	Real: 27m45.129s
	User: 27m42.538s
	Sys: 0m1.039s

	-O1
	Real: 8m53.567
	User: 8m52.045s
	Sys: 0m0.839s

	-O2
	Real: 7m53.465s
	User: 7m52.098s
	Sys: 0m0.709s

	-O3
	Real: 7m55.876
	User: 7m54.602s
	Sys: 0m0.609s

view raw

data.txt

hosted with ❤ by GitHub

This falls in line with my previous results. –O2 gives the best performance and –O3 lags slightly behind –O2 in time.

Anyways, to find the best compiler option I needed to run through all the possible results.

To do this I created a python script that went through and generated a list with all the possible options.

I then created a bash script that went through each option and reported on the time.

I decided to capture the best times for total time and best time in user mode.

One thing to note about the tests is that scripts did not fully complete their testing and as a result some of the data might be skewed. Most of the scripts have gone through a decent amount of options that I have confidence that further testing would not be needed.

Here is a summary of the results:

	*Aarchie with 180mb file
	-Just with -O2:
	real 2m51.523s
	user 2m51.247s

	Number of options tested: 1284
	Real best time: 2m50.799s (CMAKE_CXX_FLAGS_DEBUG:STRING= -O2 -ftree-loop-distribute-patterns -floop-interchange -ftree-slp-vectorize -fipa-cp-clone)
	User best time: 2m50.367s (CMAKE_CXX_FLAGS_DEBUG:STRING= -O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fvect-cost-model -ftree-partial-pre)

	**************************************************************************************************************************

	*Bbetty with 180mb file
	Just with –O2:
	real 0m52.384s
	user 0m52.153s

	Number of options tested: 1331
	Real best time: 52.252s (CMAKE_C_FLAGS_DEBUG:STRING= -O2 -fgcse-after-reload -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops)
	User best time: 51.924s (CMAKE_C_FLAGS_DEBUG:STRING= -O2 -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops)

	**************************************************************************************************************************

	*Ccharlie with 1.7gb file
	Just with –O2:
	real 7m53.515s
	user 7m52.077s

	Number of options tested: 345
	Real best time: 7m52.480s (CMAKE_C_FLAGS_DEBUG:STRING= -O2 -ftree-partial-pre -fpeel-loops -fipa-cp-clone)
	User best time: 7m51.016s (CMAKE_C_FLAGS_DEBUG:STRING= -O2 -ftree-loop-distribution -floop-interchange -ftree-partial-pre -fpeel-loops -fipa-cp-clone)

view raw

stage2.txt

hosted with ❤ by GitHub

The best results usually did a second better than the normal –O2.

Another thing that is interesting is the most repeating flags:

-ftree-loop-distribution (5 times) This option allows for better loop optimization and vectorization.
-ftree-slp-vectorize (5 times) Performs vectorization on trees.
-fpeel-loops (4 times) Does loop peeling if there is a good amount of information and turns on complete loop peeling.
-fsplit-paths (4 times) Used to split paths to loop backedges. It can reduce dead code elimination.

Reference: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

I find it interesting the best performing options have to deal with looping. Most of Brotli’s expensive methods have a lot of looping in them. For example, BrotliIsMostlyUTF8 takes up 25 percent of the time and involves a lot of looping. It can be viewed here https://github.com/google/brotli/blob/35e69fc7cf9421ab04ffc9d52cb36d07fa12984a/c/enc/utf8_util.c. It would make sense that the faster options would look to optimize looping as the program does a lot it.

Once I got 6 different options on ARM based systems, I decided to test each one individually on xerxes because I wanted to confirm that the results on ARM system worked on X86 architecture.

	Benchmark -O2(180mb file):
	Run#1
	Real 0m41.442s
	User 0m41.200s

	Run#2
	Real 0m41.404s
	User 0m41.136s

	AVG:
	Real 0m41.423s
	User 0m41.168s

	-O2 -ftree-loop-distribute-patterns -floop-interchange -ftree-slp-vectorize -fipa-cp-clone
	Run#1
	Real 0m41.502s
	User 0m41.246s

	Run#2
	Real 0m41.402s
	User 0m41.143s

	AVG
	Real 41.452s
	User 41.195s
	Relative to Real Benchmark Percent: 0.0700094% increase
	Relative to User Benchmark Percent: 0.0655849% increase

	-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fvect-cost-model -ftree-partial-pre
	Run#1
	Real 0m41.433s
	User 0m41.165s

	Run#2
	Real 0m41.433s
	User 0m41.169s

	AVG
	Real 41.433s
	User 41.167s
	Relative to Real Benchmark Percent: 0.0241412% increase
	Relative to User Benchmark Percent: 0.00242907% decrease

	-O2 -fgcse-after-reload -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops
	Run#1
	Real 0m41.379s
	User 0m41.109s

	Run#2
	Real 0m41.372s
	User 0m41.113s

	AVG
	Real 41.376s
	User 41.111s
	Relative to Real Benchmark Percent: 0.113464% decrease
	Relative to User Benchmark Percent: 0.138457% decrease

	-O2 -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops
	Run#1
	Real 0m41.984s
	User 0m41.735s

	Run#2
	Real 0m41.367s
	User 0m41.123s

	AVG
	Real 41.676s
	User 41.429s
	Relative to Real Benchmark Percent: 0.610772% increase
	Relative to User Benchmark Percent: 0.633988% increase

	-O2 -ftree-partial-pre -fpeel-loops -fipa-cp-clone
	Run#1
	Real 0m41.407s
	User 0m41.151s

	Run#2
	Real 0m41.369s
	User 0m41.136s

	AVG
	Real 41.388s
	User 41.144s
	Relative to Real Benchmark Percent: 0.0844941% decrease
	Relative to User Benchmark Percent: 0.0582977% decrease

	-O2 -ftree-loop-distribution -floop-interchange -ftree-partial-pre -fpeel-loops -fipa-cp-clone
	Run#1
	Real 0m41.337s
	User 0m41.065s

	Run#2
	Real 0m41.337s
	User 0m41.087s

	AVG
	Real 41.337s
	User 41.076s
	Relative to Real Benchmark Percent: 0.207614% decrease
	Relative to User Benchmark Percent: 0.223475% decrease

	*Increases are worse times and Decreases are better times

view raw

xerxes.txt

hosted with ❤ by GitHub

I was surprised at the results of my testing. It seems like some of the options don’t work nearly as well on X86 architecture. Two of the 6 combinations did worse than the benchmark. The other 4 managed to beat the benchmark. In addition, looking at the percentage relative to time differences are miniscule. The best times only managed to beat the benchmark by less than 1% but the options that did worse did not exceed 1% either. Adding options on xerxes didn’t really impact the performance too heavily.

Also, I wanted to compare the differences between Aarchie and Bbetty. Both of them have different microarchitectures and I want to see how these different options react to them. I am planning on using a 180mb file for this testing.

	Aarchie:
	Just with –O2:

	Run #1
	Real 0m49.034s
	User 0m48.777s
	Run #2
	Real 0m49.065s
	User 0m48.691s

	AVG
	Real 49.050s
	User 48.734s

	-O2 -ftree-loop-distribute-patterns -floop-interchange -ftree-slp-vectorize -fipa-cp-clone
	Run #1
	Real 0m49.201s
	User 0m48.911s

	Run #2
	Real 0m49.176s
	User 0m48.888s

	AVG
	Real 49.189s
	User 48.900s
	Relative to Real Benchmark Percent: 0.283384% increase
	Relative to User Benchmark Percent: 0.340625% increase

	-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fvect-cost-model -ftree-partial-pre

	Run #1
	Real 0m49.536s
	User 0m49.313s

	Run #2
	Real 0m49.541s
	User 0m49.363s

	AVG
	Real 49.539s
	User 49.338s
	Relative to Real Benchmark Percent: 0.996942% increase
	Relative to User Benchmark Percent: 1.23938% increase

	-O2 -fgcse-after-reload -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops
	Run #1
	Real 0m49.271s
	User 0m49.054s

	Run #2
	Real 0m49.047s
	User 0m48.728s

	AVG
	Real 49.159
	User 48.891
	Relative to Real Benchmark Percent: 0.222222% increase
	Relative to User Benchmark Percent: 0.322157% increase

	-O2 -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops
	Run #1
	Real 0m49.062s
	User 0m48.838

	Run #2
	Real 0m48.924s
	User 0m48.677s

	AVG
	Real 48.993
	User 48.758
	Relative to Real Benchmark Percent: 0.116208% decrease
	Relative to User Benchmark Percent: 0.0492469% increase

	-O2 -ftree-partial-pre -fpeel-loops -fipa-cp-clone
	Run #1
	Real 0m48.967s
	User 0m48.729s

	Run #2
	Real 0m49.039s
	User 0m48.802s

	AVG
	Real 49.003
	User 48.766
	Relative to Real Benchmark Percent: 0.0958206% decrease
	Relative to User Benchmark Percent: 0.0656626% increase

	-O2 -ftree-loop-distribution -floop-interchange -ftree-partial-pre -fpeel-loops -fipa-cp-clone
	Run #1
	Real 0m49.026s
	User 0m48.833s

	Run #2
	Real 0m48.989s
	User 0m48.649s

	AVG
	Real 48.930
	User 48.741
	Relative to Real Benchmark Percent: 0.244648% decrease
	Relative to User Benchmark Percent: 0.0143637% increase

	Bbetty:
	Just with –O2:

	Run#1
	Real 0m52.795s
	User 0m52.556s

	Run#2
	Real 0m52.760s
	User 0m52.476s

	AVG
	Real 52.776
	User 52.516

	-O2 -ftree-loop-distribute-patterns -floop-interchange -ftree-slp-vectorize -fipa-cp-clone
	Run #1
	Real 0m53.189s
	User 0m52.881s

	Run #2
	Real 0m53.224s
	User 0m52.196s

	AVG
	Real 53.207
	User 52.539
	Relative to Real Benchmark Percent: 0.816659% increase
	Relative to User Benchmark Percent: 0.0437962% increase

	-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fvect-cost-model -ftree-partial-pre
	Run #1
	Real 0m54.139s
	User 0m53.926s

	Run #2
	Real 0m54.198s
	User 0m53.975s

	AVG
	Real 54.169s
	User 53.951s
	Relative to Real Benchmark Percent: 2.63946% increase
	Relative to User Benchmark Percent: 2.7325% increase

	-O2 -fgcse-after-reload -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops
	Run #1
	Real 0m52.880s
	User 0m52.629s

	Run #2
	Real 0m52.774s
	User 0m52.559s

	AVG
	Real 52.827s
	User 52.594s
	Relative to Real Benchmark Percent: 0.0966348% increase
	Relative to User Benchmark Percent: 0.148526% increase

	-O2 -ftree-loop-distribution -fsplit-paths -ftree-slp-vectorize -fpeel-loops
	Run #1
	Real 0m52.853s
	User 0m52.562s

	Run #2
	Real 0m52.847s
	User 0m52.537s

	AVG
	Real 52.850s
	User 52.550s
	Relative to Real Benchmark Percent: 0.140215% increase
	Relative to User Benchmark Percent: 0.0647422% increase

	-O2 -ftree-partial-pre -fpeel-loops -fipa-cp-clone
	Run #1
	Real 0m52.828s
	User 0m52.534s

	Run #2
	Real 0m52.848s
	User 0m52.548s

	AVG
	Real 52.838s
	User 52.541s
	Relative to Real Benchmark Percent: 0.117478% increase
	Relative to User Benchmark Percent: 0.0476045% increase

	-O2 -ftree-loop-distribution -floop-interchange -ftree-partial-pre -fpeel-loops -fipa-cp-clone
	Run #1
	Real 0m52.780s
	User 0m52.578s

	Run #2
	Real 0m52.795s
	User 0m52.543s

	AVG
	Real 52.788s
	User 52.561s
	Relative to Real Benchmark Percent: 0.0227376% increase
	Relative to User Benchmark Percent: 0.0856882% increase

view raw

aarchie vs ccharlie data.txt

hosted with ❤ by GitHub

An interesting thing to note is that Aarchie has a slight performance advantage over Bbetty. I believe that is because Aarchie has a better CPU. Also, the results between Aarchie and Bbetty are also very inconsistent with each other. The results on Bbetty all fail to beat the benchmark and some of them did 2% worse than benchmark. While Archie had 3 options that had slightly better times then the benchmark. It looks like the slight differences in microarchitecture make a big impact on Brotli.

In my opinion the results indicate that the –O3 compiler options make a minimal impact on Brotli’s performance and should not be used as default because it does little to affect the time.

SPO600 Stage 2 Part 3: A quest to find the best compiler option

Published by mattprogrammingblog

Leave a comment Cancel reply

Share this:

Related

Published by mattprogrammingblog

Leave a comment Cancel reply