This page describes how to setup and use PerfTools. See also some related pages:

Intro:

PerfTools help one to identify spots in a program that are responsible for CPU consumption. This is done by periodically taking a "snapshot" of the program's call stack throughout its execution and collecting statistics on the number of times each function is seen on the stack, which function is calling it and which functions it is calling. These snapshots are collected into a graph showing, statistically, which paths of execution are most expensive. These graphs can then be examined to gain an understanding of where a program is spending its time. From this, it may be possible to optimize computationally expensive algorithms and thus speed up the running of the program.

This method differs from other code profilers such as gprof in that no special compilation is needed, more of the functions are seen and the results are easier to interpret.. However, its running is limited to Intel x86 (and compatible) platforms.

The PerfTools package is composed of two parts. The first (and most important) was written by Jim Kowalkowski and consists of:

The second part was writen by Brett Viren and consists of a main program (ptrun.py) and a suite of Python classes which: The rest of this will describe the use of PerfTools through the running of ptrun.py in the context of MINOS SRT (although the dependency on an SRT environment is very weak).

Prerequisites

  1. ELFIO library

    The ELFIO library is used by Jim's code to produce the "snapshots" of the call stack. You can get it from http://elfio.sourceforge.net/ It is recommended to install this into /usr/local (the default).

  2. Python

    Python is either already installed or comes packaged for you GNU/Linux distribution. If necessary the source can be had from http://www.python.org/. The development was done with Python 2.3, but 2.1+ may work.

    It is assumed that python will be installed in /usr/bin. If it is installed elsewhere, you will have to change the first line of ptrun.py to match.

  3. Graphviz

    Graphviz may also be packaged for your system, otherwise get the source from http://www.graphviz.org/. Version 1.10 was used for development.

    Note the license for graphviz is less than Free Software and probably isn't even Open Source (despite AT&T's Graphviz group's claim to the contrary). However, for our purposes, you should be okay to download, compile and use it. However, if you intend to re-distribute it, read the FAQ.txt in the top level directory of the distribution.

  4. SWIG

    SWIG is optional as the generated Python glue code is committed. It is only needed for development.

Getting PerfTools

PerfTools is located in the PerfTools package of the Minos CVS repository. The code from Jim is currently being mirrored from his repository. This means that everything in PerfTools/src and other directories is NOT definitive. The stuff in PerfTools/python contains Brett's high level Python classes and IS definitive. At some point this will (may) be unified and generalized. For those w/out SSH access, package can be retrieved via a CVSROOT of :pserver:anonymous@minos1.fnal.gov:/cvs/minoscvs/rep1. The CVS module is PerfTools.

Building

Part of PerfTools is interpreted Python, but there are two parts that must be built, Jim;s underlying profiling tools and the Python-GraphViz interface module.

  1. cd PerfTools

  2. Edit and source the file sourceme.build

  3. make This will build everything and install libProfLogger.so, ProfParse and ptrun.py into ../{bin,lib}/$SRT_SUBDIR.

Running

To run, the PerfTools/../{bin,lib}/$SRT_SUBDIR directories must be in your PATH and LD_LIBRARY_PATH, respectively. If your normal SRT environment is setup, this will be handled for you.

In addition, python needs to be able to find the classes that ptrun.py relies on. To do that do:

    # bash:
    export PYTHONPATH=/path/to/PerfTools/python

    # tcsh:
    setenv PYTHONPATH /path/to/PerfTools/python

You can now run ptrun.py in a variety of ways as described by the help message produced with running ptrun.py w/out any flags:

shell> ptrun.py
 
usage: ptrun.py [options]
 
There are three steps to running:
         
Three steps:
 
1) Run command under the proifiler to get prof.out.PID and
prof_libs.out.PID files
 
2) Postprocess these files to get BASE_{names,fullnames,paths,edges}
files.
 
3) Use BASE_{...} files to generate call graphs and explore these
graphs to find potential spots for improvement.
 
These steps can be run with separate invocations of ptrun.py or all at
once depending on what options are given.
 
To run step 1 include:
 
        -c <command to run plus args>
        -l<logfile>
 
To run step 2 include:
 
        -n <PID or "guess" to use more recent>
        -b <basename for post process files>
 
To run step 3 include:
 
        -g
        -b <basename for post process files>
 

Navigating the GUI

You can:

Understanding the Call Graphs

This section gives an example of using these tools. It will profile and optimize the program in the file cpuhog.cc. While this is a contrived example, it is based on what was found using these tools on real code.

This is a self contained program. To compile do:

shell> g++ -o cpuhog cpuhog.cc
You can add in any compiler optimizations you want.

Before starting, get some gross measurements:

shell> time ./cpuhog > cpuhog.out
 
real    0m11.190s
user    0m11.010s
sys     0m0.130s
Now, run it under the profiler. This example breaks the running into three steps.
shell> ptrun.py -c ./cpuhog -l cpuhog.out
Running LD_PRELOAD=libProfLogger.so ./cpuhog > cpuhog.out 2>&1
prof_libs.out.30809 prof.out.30809
shell> ptrun.py -n guess -b cpuhog
Running ProfParse prof_libs.out.30809 prof.out.30809 cpuhog. >> cpuhog.postproc-log 2>&1
shell> ptrun.py -b cpuhog -g
These can be combined in a single step
shell> ptrun.py -c ./cpuhog -l cpuhog.out -n guess -b cpuhog -g
Running LD_PRELOAD=libProfLogger.so ./cpuhog > cpuhog.out 2>&1
prof_libs.out.30946 prof.out.30946
Running ProfParse prof_libs.out.30946 prof.out.30946 cpuhog. >> cpuhog.postproc-log 2>&1
Note: you will need to put your command in quotes if it consisted of multiple words. For example:
    ptrun.py -c "md5sum /usr/include/*" -l md5sum.out
What you will see is a partial view of the call graph:

This graph shows the following:

Interactive things to try: The full call graph:

This example is a very simple call graph. When profiling actual analysis code, one can expect 10-100x more nodes. But, even with this simple program, it is hard to see everything. To help with this, one can pick a node and right click (button 3) on it which will open a menu, from which you can select "Spawn Sub Graph". After exploring this graph, node #27 (cpu_hog(int,int)) looks interesting. Selecting "Spawn Sub Graph" from this node's popup menu and playing a bit with zoom and window size, one sees:

Note at the top of these sub graphs there are three cuts:

Here is what the #27 sub graph looks like after some cuts are applied:

So, what we learn is that while #27, cpu_hog() is seen on the stack a substantial number of times, it is actually not doing anything else but calling others. We could follow down stream until we find what is using the most CPU directly but that just leads deap into STL land. But, the fact that cpu_hog() seems to be a major switch yard for other execution implies it might be a site for optimization.

Since this is a contrived example, this is indeed the case. The problem is the two lines marked below as BAD_IDEA:

void cpu_hog(int min, int max)
{
    CpuHog ch;

    ch.fill_data();

    ch.sort_data();            // BAD_IDEA
    ch.select_data(min,max);   // BAD_IDEA

    ch.dump_data();
}
This causes the entire data to be sorted before a selection is made. Reversing these should then lead to a fast improvement. Again, since this is contrived, the fix has been made easy. Recompile like:
shell> g++ -DOPTIMIZE -o cpuhog cpuhog.cc
Now, the gross running time is thus:
shell> time ./cpuhog > cpuhog.out
 
real    0m5.068s
user    0m4.870s
sys     0m0.190s
And rerunning the profiler produces the following graph around cpu_hog() (now node #30):

It can be seen that the total time on the stack is now much reduced and the time spent calling sort_data() (node 37 before and node 42 after the optimization) is also greatly reduced.


Last Modified: $Date: 2004/02/13 23:14:33 $