This page describes how to setup and use PerfTools. See also some related pages:
CPUPROFILE=gp.out LD_PRELOAD=/path/to/libprofiler.so.0 <your_command>
pprof --ps <your_command> gp.out > gp.ps
gv gp.ps
Run "pprof -h" for more options. Not, directly running gv with
"--gv" fails for me.
Here is an example that Sergei generated on a minimal job that just does I/O on Candidate files.
This method differs from other code profilers such as gprof in that no special compilation is needed, more of the functions are seen and the results are easier to interpret.. However, its running is limited to Intel x86 (and compatible) platforms.
The PerfTools package is composed of two parts. The first (and most important) was written by Jim Kowalkowski and consists of:
The ELFIO library is used by Jim's code to produce the "snapshots" of the call stack. You can get it from http://elfio.sourceforge.net/ It is recommended to install this into /usr/local (the default).
Python is either already installed or comes packaged for you GNU/Linux distribution. If necessary the source can be had from http://www.python.org/. The development was done with Python 2.3, but 2.1+ may work.
It is assumed that python will be installed in /usr/bin. If it is installed elsewhere, you will have to change the first line of ptrun.py to match.
Graphviz may also be packaged for your system, otherwise get the source from http://www.graphviz.org/. Version 1.10 was used for development.
Note the license for graphviz is less than Free Software and probably isn't even Open Source (despite AT&T's Graphviz group's claim to the contrary). However, for our purposes, you should be okay to download, compile and use it. However, if you intend to re-distribute it, read the FAQ.txt in the top level directory of the distribution.
SWIG is optional as the generated Python glue code is committed. It is only needed for development.
In addition, python needs to be able to find the classes that ptrun.py relies on. To do that do:
# bash:
export PYTHONPATH=/path/to/PerfTools/python
# tcsh:
setenv PYTHONPATH /path/to/PerfTools/python
You can now run ptrun.py in a variety of ways as described by the help message produced with running ptrun.py w/out any flags:
shell> ptrun.py
usage: ptrun.py [options]
There are three steps to running:
Three steps:
1) Run command under the proifiler to get prof.out.PID and
prof_libs.out.PID files
2) Postprocess these files to get BASE_{names,fullnames,paths,edges}
files.
3) Use BASE_{...} files to generate call graphs and explore these
graphs to find potential spots for improvement.
These steps can be run with separate invocations of ptrun.py or all at
once depending on what options are given.
To run step 1 include:
-c <command to run plus args>
-l<logfile>
To run step 2 include:
-n <PID or "guess" to use more recent>
-b <basename for post process files>
To run step 3 include:
-g
-b <basename for post process files>
This is a self contained program. To compile do:
shell> g++ -o cpuhog cpuhog.ccYou can add in any compiler optimizations you want.
Before starting, get some gross measurements:
shell> time ./cpuhog > cpuhog.out real 0m11.190s user 0m11.010s sys 0m0.130sNow, run it under the profiler. This example breaks the running into three steps.
shell> ptrun.py -c ./cpuhog -l cpuhog.out Running LD_PRELOAD=libProfLogger.so ./cpuhog > cpuhog.out 2>&1 prof_libs.out.30809 prof.out.30809 shell> ptrun.py -n guess -b cpuhog Running ProfParse prof_libs.out.30809 prof.out.30809 cpuhog. >> cpuhog.postproc-log 2>&1 shell> ptrun.py -b cpuhog -gThese can be combined in a single step
shell> ptrun.py -c ./cpuhog -l cpuhog.out -n guess -b cpuhog -g Running LD_PRELOAD=libProfLogger.so ./cpuhog > cpuhog.out 2>&1 prof_libs.out.30946 prof.out.30946 Running ProfParse prof_libs.out.30946 prof.out.30946 cpuhog. >> cpuhog.postproc-log 2>&1Note: you will need to put your command in quotes if it consisted of multiple words. For example:
ptrun.py -c "md5sum /usr/include/*" -l md5sum.out
What you will see is a partial view of the call graph:
This graph shows the following:
#FID: (NTOP/NSTACK) function_name(args)
The FID and function name are self-explanitory, but the two
numbers in parenthesis mean:
These two numbers do not change due to any node cuts (again, see below).
This example is a very simple call graph. When profiling actual analysis code, one can expect 10-100x more nodes. But, even with this simple program, it is hard to see everything. To help with this, one can pick a node and right click (button 3) on it which will open a menu, from which you can select "Spawn Sub Graph". After exploring this graph, node #27 (cpu_hog(int,int)) looks interesting. Selecting "Spawn Sub Graph" from this node's popup menu and playing a bit with zoom and window size, one sees:
Note at the top of these sub graphs there are three cuts:
So, what we learn is that while #27, cpu_hog() is seen on the stack a substantial number of times, it is actually not doing anything else but calling others. We could follow down stream until we find what is using the most CPU directly but that just leads deap into STL land. But, the fact that cpu_hog() seems to be a major switch yard for other execution implies it might be a site for optimization.
Since this is a contrived example, this is indeed the case. The problem is the two lines marked below as BAD_IDEA:
void cpu_hog(int min, int max)
{
CpuHog ch;
ch.fill_data();
ch.sort_data(); // BAD_IDEA
ch.select_data(min,max); // BAD_IDEA
ch.dump_data();
}
This causes the entire data to be sorted before a selection is made.
Reversing these should then lead to a fast improvement. Again, since
this is contrived, the fix has been made easy. Recompile like:
shell> g++ -DOPTIMIZE -o cpuhog cpuhog.ccNow, the gross running time is thus:
shell> time ./cpuhog > cpuhog.out real 0m5.068s user 0m4.870s sys 0m0.190sAnd rerunning the profiler produces the following graph around cpu_hog() (now node #30):
It can be seen that the total time on the stack is now much reduced and the time spent calling sort_data() (node 37 before and node 42 after the optimization) is also greatly reduced.