Overview
hpmcount is a simple-to-use stand-alone utility that starts
an application and provides summary utilization data for the entire run.
libhpm, rather, is an interface that can be used to obtain
utilization statistics for certain regions of code. The libhpm
interface stores output in two files, one that is a plain text file that
looks similar to the hpmcount output and another that is designed
to be visualized by the hpmviz utility.
If a large number of regions of code are instrumented with libhpm
calls, then it may be easier to use the visualization tool to see the
statistics. It does a nice job of condensing the information, but it
does not add any utility to digesting a few regions of instrumented
code data.
hpmcount
The hpmcount utility starts an application and provides summary
utilization data for the run. Note that this tool provides
summary data on the entire run. If you need utilization statistics for certain regions of code, then this tool is not appropriate. You can use the
libhpm interface instead.
The hpmcount utility is simple to use. It provides wall-clock
time, hardware-performance-counters statistics, and utilization information.
This utility supports both serial and parallel (MPI, threaded, and mixed-mode)
applications written in Fortran, C, and C++.
Usage:
hpmcount [-h] [-o filename] [-s set] [-e ev[,ev]*] program
or, for parallel jobs:
poe hpmcount [-h] [-o filename] [-s set] [-e ev[,ev]*] program
See the HPM "README" file for an explanation
of options.
Example: (matrix-matrix multiplication)
% hpmcount matmul
adding counter 5 event 12 Cycles
adding counter 0 event 1 Instructions completed
adding counter 7 event 0 TLB misses
adding counter 2 event 9 Stores completed
adding counter 3 event 5 Loads completed
adding counter 4 event 5 FPU 0 instructions
adding counter 1 event 35 FPU 1 instructions
adding counter 6 event 9 FMAs executed
ndim = 2000
time taken is 17.3199996948242188
mflops is 923.787545145357171
STOP done
hpmcount (V 2.3.1) summary
Total execution time (wall clock time): 17.890108 seconds
######## Resource Usage Statistics ########
Total amount of time in user mode : 16.720000 seconds
Total amount of time in system mode : 0.980000 seconds
Maximum resident set size : 94008 Kbytes
Average shared memory use in text segment : 21228 Kbytes*sec
Average unshared memory use in data segment : 149030560 Kbytes*sec
Number of page faults without I/O activity : 23512
Number of page faults with I/O activity : 0
Number of times process was swapped out : 0
Number of times file system performed INPUT : 0
Number of times file system performed OUTPUT : 0
Number of IPC messages sent : 0
Number of IPC messages received : 0
Number of signals delivered : 0
Number of voluntary context switches : 0
Number of involuntary context switches : 1876
####### End of Resource Statistics ########
PM_CYC (Cycles) : 6221755432
PM_INST_CMPL (Instructions completed) : 16525717994
PM_TLB_MISS (TLB misses) : 4193504
PM_ST_CMPL (Stores completed) : 103816592
PM_LD_CMPL (Loads completed) : 6805124281
PM_FPU0_CMPL (FPU 0 instructions) : 4783352231
PM_FPU1_CMPL (FPU 1 instructions) : 3253672191
PM_EXEC_FMA (FMAs executed) : 8024880769
Utilization rate : 92.743 %
Avg number of loads per TLB miss : 1622.778
Load and store operations : 6908.941 M
Instructions per load/store : 2.392
MIPS : 923.735
Instructions per cycle : 2.656
HW Float points instructions per Cycle : 1.292
Floating point instructions + FMAs : 16061.905 M
Float point instructions + FMA rate : 897.809 Mflip/s
FMA percentage : 99.924 %
Computation intensity : 2.325
By default (since no events were specified) hpmcount loaded
event set
1, which is seen in the first 8 lines of output. Counters 0 through
7 are loaded (not in that order). The next four lines of output come
from the application itself, which is a matrix-matrix multiplication.
Notice that the order of the matrices is 2000, and that it took approximately
17 seconds to do the multiplication and did it at a rate of 923.8 Mflop/s.
The remainder of the output
is summary information from hpmcount.
The summary starts with a total execution time (wall clock) in seconds.
It then shows resource usage statistics. After the resource statistics,
it shows the information gathered by the event set; in this example
it shows the totals gathered by the 8 hardware counters mentioned earlier.
The last three counters (in this set) show the total number of floating-point
instructions executed. From this and the total execution time,
hpmcount provides a floating-point instruction rate (Mflip/s).
The Mflip/s rate is one of the derived statistics. There are others:
utilization rate, average number of loads per TLB miss, instructions
per cycle, and total number of floating point instructions including fused
multiply adds (FMAs).
libhpm
The libhpm interface provides a method for instrumenting regions
of code to obtain hardware-performance-counter information, derived metrics,
and utilization statistics. Note that this interface could be used
to provide data on an entire run, but the simpler-to-use tool hpmcount
could be used to do that instead without having to edit and recompile code.
The libhpm interface outputs its information into files that can
be visualized with the hpmviz utility.
The libhpm interface can be used to provide wall-clock time,
hardware-performance-counters statistics, and utilization information on regions
of code. This interface supports both serial and parallel (MPI, threaded,
and mixed-mode) applications written in Fortran, C, and C++.
libhpm supports multiple instrumentation sections and nested
instrumentation,
and each instrumented section can be called multiple times. libhpm
supports OpenMP and threaded applications, but the thread-safe
version of the library (libhpm_r) should be used. Also, 64-bit
applications
can be linked with the 64-bit versions of the library (libhpm64 and
libhpm64_r).
libhpm collects information and performs summarization
during run-time. Thus, there could be a considerable overhead if instrumentation
sections are inserted inside inner loops.
libhpm uses the same set of hardware-counters events used
by hpmcount.
See the HPM "README" file for a detailed
explanation of libhpm.
Example: (matrix-matrix multiplication)
This is the same example as used in the hpmcount introduction, with
the exception that the code has been instrumented to call libhpm functions
to gather statistics only for the matrix-matrix multiply and nothing else.
#include <f_hpm.h>
....
call f_hpminit(taskID,"matmul-inst")
call f_hpmstart(1,"f90 matmul")
C = matmul( A, B )
call f_hpmstop(1)
call f_hpmterminate(taskID)
....
Compile the code with the following path settings:
HPM_DIR = /usr/local/HPM_V2.3
HPM_INC = -I$(HPM_DIR)/include
HPM_LIB = -L$(HPM_DIR)/lib -lhpm -lpmapi -lm -lessl
FFLAGS = -qsuffix=cpp=f -O4
FF = xlf
$(FF) $(FFLAGS) $(HPM_INC) matmul-inst.f $(HPM_LIB) -o matmul-inst.x
Note that "-qsuffix=cpp=f" is used to tell the compiler to run the C preprocessor
on ".f" files. (It automatically runs the C preprocessor on ".F" files.)
When the code is run, it creates two output files, one with a name of
the form "perfhpmtaskID.pid", which contains readable performance data
(similar to hpmcount) and the other of the form "hpmtask_progname_pid.viz",
for use with hpmviz. The default format of the output files can be
overridden by setting the "LIBHPM_OUTPUT_NAME" variable. Further, the
".viz" file is avoided if the "LIBHPM_VIZ_OUTPUT" variable is set to
"FALSE".
The "perfhpmtaskID.pid" file for the above code segment looks
like the following:
libhpm (Version 2.3.1) summary - running on POWER3-II
Total execution time of instrumented code (wall time): 16.710888 seconds
######## Resource Usage Statistics ########
Total amount of time in user mode : 16.690000 seconds
Total amount of time in system mode : 0.450000 seconds
Maximum resident set size : 94248 Kbytes
Average shared memory use in text segment : 150568 Kbytes*sec
Average unshared memory use in data segment : 144400668 Kbytes*sec
Number of page faults without I/O activity : 23591
Number of page faults with I/O activity : 0
Number of times process was swapped out : 0
Number of times file system performed INPUT : 0
Number of times file system performed OUTPUT : 0
Number of IPC messages sent : 0
Number of IPC messages received : 0
Number of signals delivered : 0
Number of voluntary context switches : 8
Number of involuntary context switches : 1745
####### End of Resource Statistics ########
Instrumented section: 1 - Label: f90 matmul - process: 0
file: matmul-inst.f, lines: 24 <--> 26
Count: 1
Wall Clock Time: 16.710615 seconds
Total time in user mode: 16.4863835747709 seconds
PM_CYC (Cycles) : 6182207225
PM_INST_CMPL (Instructions completed) : 16481641172
PM_TLB_MISS (TLB misses) : 4133943
PM_ST_CMPL (Stores completed) : 95806437
PM_LD_CMPL (Loads completed) : 6805048888
PM_FPU0_CMPL (FPU 0 instructions) : 4822034722
PM_FPU1_CMPL (FPU 1 instructions) : 3181448245
PM_EXEC_FMA (FMAs executed) : 8000011908
Utilization rate : 98.658 %
Avg number of loads per TLB miss : 1646.140
Load and store operations : 6900.855 M
Instructions per load/store : 2.388
MIPS : 986.298
Instructions per cycle : 2.666
HW Float points instructions per Cycle : 1.295
Floating point instructions + FMAs : 16003.495 M
Float point instructions + FMA rate : 957.684 Mflip/s
FMA percentage : 99.978 %
Computation intensity : 2.319