Acovea: Analysis of Compiler Options via Evolutionary Algorithm

Using Natural Selection to Investigate Software Complexities


by Scott Robert Ladd
30 May 2004
 


Related Articles
Acovea Overview
Analysis: Acovea 4.0.0 Pentium 4 and Opteron
Acovea FAQ
A Description of the Genetic Algorithm
Analysis: Acovea 3.3.0 on Pentium 4 (Dec 2003)

Abstract

ACOVEA (Analysis of Compiler Options via Evolutionary Algorithm) implements a genetic algorithm to find the "best" options for compiling programs with the GNU Compiler Collection (GCC) C and C++ compilers. "Best", in this context, is defined as those options that produce the fastest executable program from a given source code. Acovea is a C++ framework that can be extended to test other programming languages and non-GCC compilers.

I envision Acovea as an optimization tool, similar in purpose to profiling. Traditional function-level profiling identifies the algorithms most influential in a program's performance; Acovea is then applied to those algorithms to find the compiler flags and options that generate the fastest code. Acovea is also useful for testing combinations of flags for pessimistic interactions, and for testing the reliability of the compiler.

This article describes the application of Acovea to the analysis of the GNU C compiler optimization flags; details about the underlying genetic algorithm can be found in the companion article,  A Description of the Evolutionary Algorithm.

 

Modern software is difficult to understand and verify by traditional means. Millions of lines of code produce applications containing intricate interactions, defying simple description or brute-force investigation. A guided, deterministic approach to testing relies on human testers to envision every possible combination of actions -- an unrealistic proposition given software complexity. Yet, despite that complexity, we need answers to important questions about modern, large-scale software.

What sort of important questions? Consider the GNU Compiler Collection. I write articles that benchmark code generation, a task fraught with difficulties due to the myriad options provided by different compilers. For my benchmarks to have any meaning, I need to know which combination of options produces the fastest code for a given application.

Finding the "best" set of options sounds like a simple task, given the extent of GCC documentation and the conventional wisdom of the GCC developer community. Ah, if it were only so easy! The GCC documentation, while extensive, is also honestly imprecise. I appreciate this style of documentation; unlike many commercial vendors, who make absolute statements about the "quality" of their products, GCC's documenters admit uncertainties in how various options alter code generation. Indeed, code generation is entirely dependent on the type of application being compiled and the target platform. An option that produces fast executable code for one source code may be detrimental to the performance of another program.

"Conventional wisdom" arrives in my inbox whenever I publish a new article. Ranging from the polite to the insistent to the rude, these e-mails contain contradictory suggestions for producing fast code. In the vast majority of cases, such anecdotal assertions lack any formal proof of their validity, and, more often than not, the suggested "improvement" is ineffective or detrimental. It has become increasingly obvious that no one --myself included -- knows precisely how all these GCC options work together in generating program code.

I seek the Holy Grail of Optimization -- but exactly what is optimization? Understanding the problem is the first step in finding a solution.

What is Optimization?

Optimization attempts to produce the "best" machine code from source code. "Best" means different things to different applications; a database shovels chunks of information, while a scientific application is concerned with fast and accurate results; the first concern for an embedded system may be code size. And it is quite possible that small code is fast, or fast code accurate. Optimization is far from being an exact science, given the diversity of hardware and software configurations.

An optimization algorithm may be as simple as removing a loop invariant, or as complex as examining an entire program to eliminate global common sub-expressions. Many optimizations change what the programmer wrote into a more efficient form, producing the same result while altering underlying details for efficiency; other "optimizations" produce code that uses specific characteristics of the underlying hardware, such as special instruction sets.

Memory architectures, pipelines, on- and off-chip caches -- all affect code performance in ways that are not obvious to programmers using a high-level language. An optimization that may seem to produce faster code may, in fact, create large code that causes more cache misses, thus degrading performance.

 Table 1: Options implied by composite optimization switches 
Type Optimization level Implied Options
Speed -O1
also -O
-fmerge-constants
-fdefer-pop
-fthread-jumps
-fdelayed-branch
-fomit-frame-pointer
-fguess-branch-prob
-fcprop-registers
-fif-conversion
-fif-conversion2
-fdelayed-branch
-floop-optimize
-fcrossjumping
-O2
including -O1
-foptimize-sibling-calls
-fcse-follow-jumps
-fcse-skip-blocks
-fgcse
-fexpensive-optimizations
-fstrength-reduce
-frerun-cse-after-loop
-frerun-loop-opt
-fcaller-saves
-fforce-mem
-fpeephole2
-fschedule-insns
-fschedule-insns-after-reload
-fregmove
-fstrict-aliasing
-fdelete-null-pointer-checks
-freorder-blocks
-funit-at-a-time
-falign-loops
-falign-jumps
-falign-labels
-falign-functions
-O3
including -O2
-finline-functions
-frename-registers
-fweb
additional
options
  -funroll-loops
-funroll-all-loops
-ffloat-store
-fprefetch-loop-arrays
-fmove-all-movables
-freduce-all-givs
-fno-inline
-fpeel-loops
-ftracer
-fnew-ra
-funswitch-loops
-fbranch-target-load-optimize
-fbranch-target-load-optimize2
Floating
Point
-ffast-math -funsafe-math-optimizations
-fno-math-errno
-fno-trapping-math
-ffinite-math-only
-no-signalling-nans
Pentium 4 all compiles assume
-march=pentium4
which implies
-mmmx, -msse, and
-msse2
-mieee-fp
-malign-double
-mno-push-args
-maccumulate-outgoing-args
-momit-leaf-frame-pointer
-minline-all-stringops
-mfpmath=
UltraSparc all compiles assume
-mcpu=ultrasparc
-mno-app-regs
-mflat

Even the best hand-tuned C code contains areas of interpretation; there is no absolute, one-to-one correspondence between C statements and machine instructions. Almost any sequence of source code can be compiled into different -- but functionally equivalent -- machine instruction streams with different sizes and performance characteristics. Inlining functions is a classic example of this phenomena: replacing a call to a function with the function code itself may produce a faster program, but may also increase program size. Increased program size, may, in turn, prevent an algorithm from fitting inside high-speed cache memory, thus slowing a program due to cache misses. Notice my use of the weasel word "may" -- inlining small functions sometimes allows other optimization algorithms a chance to further improve code for local conditions, producing faster and smaller code.

Optimization is not simple or obvious, and combinations of algorithms can lead to unexpected results. Which brings me back to the question: For any given application, what are the most effective optimization options?

And the answer is... inscrutable

As you can see from Table 1, the GNU C compiler supports many, many options, dozens of which influence code generation on the ubiquitous 32-bit Intel Architecture (ia32). While the GCC documentation gives a short description of each option, not much (if anything) is said about how option -mAAA affects option -mBBB. Nor do all the side effects get listed; for example, I have yet to find any "official" documentation that lists the optimization switches invoked by the catch-all -O? (what I call "composite") options.

When first approaching this article, I perused the GCC documentation, studied the compiler source code, and compiled a list of potential options. Realizing that I didn't know which options were implied by those "composite" options, I looked in the GCC source file toplev.c; the result of my investigation is Table 1. -O3 includes all options implied by -O2, which in turn includes everything defined by -O1. The -ffast-math option is also a composite of several flags. Documentation for Table 1's options can be found here, at the official GCC web site.

Note: I've been working on this article off-and-on since late 2002; during that time, freshmeat.net published Joao Seabra's GCC Myths and Facts, which also lists the options enabled by the -O flags. I created Table 1 directly from the original GCC 3.4 source code, without knowing about Joao's article. It's good to know that I'm not the only person who's investigating how the GNU tools work.

Are the settings for -O1, -O2, and -O3 reasonable, or should they be changed? The only way to find an empirical answer is to run another set of tests with various combinations of individual optimization algorithms. Even if we assume that -O3 is the "optimal" set of options, which of the experimental and target-specific options will be most effective? Again, the only way to find out is trial and error, a tedious process for humans. It seems to me that most people guess -- since they're using a Pentium 4, they assume that -O3 -mfpmath=sse will generate faster code than will a plain -O3. I've made those sorts of guesses in the past, but they haven't always been correct.

GCC supports different switch sets for different languages and processors; in the case of the Pentium 4, for example, more than 60 different switches affect code optimization and generation. Testing every possible combination of those switches is simply impractical. Even at the unattainable theoretical speed of one test per second, it would take over hundreds of billions of years to run the complete set of possibilities!

My computer, GCC (and I!) will be quite obsolete before such an experiment could be finished. Clearly, a brute-force approach is out of the question. And it isn't possible to test options in isolation, either; optimization algorithms act on each other, requiring examination of different combinations. I might be able to make some educated guesses about specific options, but it is highly unlikely that I can predict the consequences of complex option interactions. 

Complexity is associated with the problem domain! At times like these, I consider using a genetic algorithm.

A What?

I've been using genetic algorithms (GA) for over a decade in various ways, and even wrote a book about them way back in 1996. You'll find GAs being applied across the computing spectrum, from network optimization to the design of aircraft wings. For the most part, applications of GAs tend to be rather specialized; they are only rarely seen in more pedestrian, general-purpose applications.

A particular efficacy of genetic algorithms is their ability to search large solution spaces. Clearly, quadrillions (in U.s. reckoning) of possibilities can be considered "large", and a GA looked to be a natural tool for satiating my curiosity. I've recently begun using GAs in experiments that test software reliability and performance; the GCC "option" question provided an excellent opportunity to put my theories to work on a practical problem.

I'm not going to introduce genetic algorithms in this article; details about the underlying genetic algorithm can be found in the companion article, Acovea: Analysis of Compiler Options via Genetic Algorithm. In a nutshell, a genetic algorithm is an iterative process based loosely on biological evolution through natural selection. A population of "organisms" is tested against a given problem and assigned a "fitness" value based on a level of success; a new generation of "organisms" is then generated, with higher-fitness "organisms" having the best chance of "breeding." Mutations may be introduced into members of the new population. Then the process repeats, until a specific number of generations are run or a desired outcome is reached. If you have difficulty understanding the mechanism of an evolutionary algorithm, please read my articles on the subject, or read Melanie Mitchell's excellent book, An Introduction to Genetic Algorithms (MIT, 1996).

The C++ code for this article is based on the latest version of my Evocosm class library.

The Acovea GA itself is straight-forward: The initial population is built from random sets of compiler options (organisms), as provided by a compiler object. As of Avocea 4.0.0, a compiler is defined by an XML file; the "mechanics" section below contains more detail on how this works.

In each generation, gccacovea compiles an executable image for each set of options. When run, the benchmarks time their critical loop internally using clock_gettime, piping the result back to gccacovea as their "fitness". Smaller fitnesses (faster run times) are "more fit", and more likely to reproduce; mutation and migration between populations introduces variations to prevent populations from stagnating. In essence, finding the best options for GCC is a classic minimization problem. 

Building Acovea

Configuration and building proceeds along the usual lines:

    ./configure
    ./make
    ./make install

To build Acovea, you will also need the Coyotl and Evocosm libraries.

The installation creates a program, runacovea, installed by default to /usr/local/bin. Running this program without arguments will display a command summary.

For parsing the compiler configuration files, which are XML, Acovea uses the expat library. Your distribution likely installed expat; if not, its home page is http://expat.sourceforge.net/

The installation directories can be changed using the --prefix option when invoking configure.

To run Acovea, use a command of the form:

    runacovea -config gcc34_opteron.acovea -bench huffbench.c

XML Compiler Configurations

The configuration file is an XML document that sets the default and evolvable options and commands for a given compiler and platform. For example, the gcc34_opteron.acovea configuration is for GCC version 3.4 running on a 64-bit AMD Opteron system.

A sample set of GCC C compiler configuration files is located (by default) in /usr/local/share/acovea/config. If the given configuration file can not be found based on the explicit name given, Acovea will attempt to locate the file in the installation directory above.

I provide configurations gcc and g++ for Pentium 3, Pentium 4, and Opteron processors using GCC 3.3 and 3.4. As I develop more configuration files (or if I received them from other users), I'll post the additional .acovea files on the web site.

Acovea-style Benchmarks

The benchmark file is a program in a language appropriate for the chosen configuration file; it must write its fitness value to standard output, where it will be read via pipe by the Acovea framework.

A sample set of C benchmark programs is located (by default) in /usr/local/share/acovea/benchmarks If the given benchmark program file can not be found based on the explicit name given, Acovea will attempt to locate the benchmark in the installation directory above.

Acovea was designed with "spikes" in mind -- short programs that implement a limited algorithm that runs quickly (2-10 seconds per run). Using the default evolutionary settings, Acovea will perform 4,000 compiles and runs of the benchmark; even with a very small and fast benchmark, this can take several hours. If your program takes a full minute to compiler, and another minute to run, Acovea will require 8,000 minutes, or more than 5.5 days to evolve optimal option sets!

However, due to popular demand, a future version of Acovea will support the use of Makefiles and an optional timing mechanism for large applications.

About Analysis

As the genetic algorithm cycles through the generations, it refines the best set of options through natural selection; options that produce fast code will occur more often, while adverse option will tend to be winnowed away. Options that have no influence on code speed will occur, due to randomness, about 50% of the time. We can't simply say that the "best" flags are those enabled by the best organism, because it might include "null" options that don't affect performance.

So, how do we tell which options are truly important amid the noise of "null" options?

As it runs, Acovea counts the number of times an option is enabled by the best chromosome in each generation of each population. By default, Acovea evolves five populations of forty individuals for twenty generations; if an option is enable by every organism in every generation, it will accumulate a count of 50; the higher the count, the more often the option was enabled, and the more important it is for producing fast code on the given benchmark. Conversely, an option that is detrimental will appear very few times (or not at all), while neutral options (those that have no effect on code speed) should occur an average number of times (because it doesn't matter whether they are enabled or not).

When a run of Acovea is finished, I take the final totals and calculate a "z score" for each option. The z score measures the distance of a value from a population's average, in units of standard deviation. If an option's z score is greater than 1, I assume it may be beneficial, while a z score of -1 or less indicates a detrimental option. In general, larger runs (more, bigger populations run over longer periods of time) will accentuate the differences between "good" and "bad" options, with some attaining z scores of 3 or more.

After running gccacovea, I first test the fastest set of options from the last generation. I then hand-modify that option set based on the final statistical analysis and a check of options common to all populations in the final generation. It takes a few minutes to verify and refine the results of a run. In the vast majority of cases, gccacovea evolves option sets that perform better than -O3 alone, either by evolving a reduced set of options (when evolving from -O1), or by finding effective experimental and target-specific options for combination with -O3.

When analyzing Acovea runs, I look at three aspects:

  1. Which optimization level produces the best code? For example, does -O3 provide any real benefit over -O1 and -O2?
  2. Can Acovea find a set of option that is better than the existing "predefined" settings of -O1, -O2, and -O3?
  3. Which options does Acovea identify as important in producing the fastest code for a given algorithm?

Drawing Conclusions

Back in December 2003, I published my original presentation of Acovea results. Since then, I have refined the program in several ways while testing it on new hardware.

Results & Analysis: Acovea 4.0.0 on Pentium 4 and Opteron
Published 30 March 2004, this article describes results obtained when running Acovea 4.0.0 with a prerelease version of GCC 3.4.0 on a 32-bit Pentium 4 and a 64-bit Opteron system; both computers were running Linux kernel 2.6.3.

Results & Analysis: Acovea 3.3.0 on Pentium 4
Published in December, 2003, this article describes results obtained when running Acovea 4.0.0 with a prerelease version of GCC 3.4.0 on a 32-bit Pentium 4, which was running a development version of the 2.6 Linux kernel.

As always, I look forward to considered comments.

-- Scott


Copyright 2004
Scott Robert Ladd
All rights reserved.