NCCS | User Info | search  

HPM Overview


Contents


Overview

hpmcount is a simple-to-use stand-alone utility that starts an application and provides summary utilization data for the entire run.

libhpm, rather, is an interface that can be used to obtain utilization statistics for certain regions of code. The libhpm interface stores output in two files, one that is a plain text file that looks similar to the hpmcount output and another that is designed to be visualized by the hpmviz utility. If a large number of regions of code are instrumented with libhpm calls, then it may be easier to use the visualization tool to see the statistics. It does a nice job of condensing the information, but it does not add any utility to digesting a few regions of instrumented code data.


hpmcount

The hpmcount utility starts an application and provides summary utilization data for the run. Note that this tool provides summary data on the entire run. If you need utilization statistics for certain regions of code, then this tool is not appropriate. You can use the libhpm interface instead.

The hpmcount utility is simple to use. It provides wall-clock time, hardware-performance-counters statistics, and utilization information. This utility supports both serial and parallel (MPI, threaded, and mixed-mode) applications written in Fortran, C, and C++.

Usage:

hpmcount [-h] [-o filename] [-s set] [-e ev[,ev]*] program
or, for parallel jobs:
poe hpmcount [-h] [-o filename] [-s set] [-e ev[,ev]*] program

See the HPM "README" file for an explanation of options.

Example: (matrix-matrix multiplication)

% hpmcount matmul

adding counter 5 event 12 Cycles
adding counter 0 event 1 Instructions completed
adding counter 7 event 0 TLB misses
adding counter 2 event 9 Stores completed
adding counter 3 event 5 Loads completed
adding counter 4 event 5 FPU 0 instructions
adding counter 1 event 35 FPU 1 instructions
adding counter 6 event 9 FMAs executed

ndim = 2000
time taken is 17.3199996948242188
mflops is 923.787545145357171
STOP done

hpmcount (V 2.3.1) summary

Total execution time (wall clock time): 17.890108 seconds

######## Resource Usage Statistics ########

Total amount of time in user mode : 16.720000 seconds
Total amount of time in system mode : 0.980000 seconds
Maximum resident set size : 94008 Kbytes
Average shared memory use in text segment : 21228 Kbytes*sec
Average unshared memory use in data segment : 149030560 Kbytes*sec
Number of page faults without I/O activity : 23512
Number of page faults with I/O activity : 0
Number of times process was swapped out : 0
Number of times file system performed INPUT : 0
Number of times file system performed OUTPUT : 0
Number of IPC messages sent : 0
Number of IPC messages received : 0
Number of signals delivered : 0
Number of voluntary context switches : 0
Number of involuntary context switches : 1876

####### End of Resource Statistics ########

PM_CYC (Cycles) : 6221755432
PM_INST_CMPL (Instructions completed) : 16525717994
PM_TLB_MISS (TLB misses) : 4193504
PM_ST_CMPL (Stores completed) : 103816592
PM_LD_CMPL (Loads completed) : 6805124281
PM_FPU0_CMPL (FPU 0 instructions) : 4783352231
PM_FPU1_CMPL (FPU 1 instructions) : 3253672191
PM_EXEC_FMA (FMAs executed) : 8024880769

Utilization rate : 92.743 %
Avg number of loads per TLB miss : 1622.778
Load and store operations : 6908.941 M
Instructions per load/store : 2.392
MIPS : 923.735
Instructions per cycle : 2.656
HW Float points instructions per Cycle : 1.292
Floating point instructions + FMAs : 16061.905 M
Float point instructions + FMA rate : 897.809 Mflip/s
FMA percentage : 99.924 %
Computation intensity : 2.325 

By default (since no events were specified) hpmcount loaded event set 1, which is seen in the first 8 lines of output. Counters 0 through 7 are loaded (not in that order). The next four lines of output come from the application itself, which is a matrix-matrix multiplication. Notice that the order of the matrices is 2000, and that it took approximately 17 seconds to do the multiplication and did it at a rate of 923.8 Mflop/s.

The remainder of the output is summary information from hpmcount. The summary starts with a total execution time (wall clock) in seconds. It then shows resource usage statistics. After the resource statistics, it shows the information gathered by the event set; in this example it shows the totals gathered by the 8 hardware counters mentioned earlier. The last three counters (in this set) show the total number of floating-point instructions executed. From this and the total execution time, hpmcount provides a floating-point instruction rate (Mflip/s).

The Mflip/s rate is one of the derived statistics. There are others: utilization rate, average number of loads per TLB miss, instructions per cycle, and total number of floating point instructions including fused multiply adds (FMAs).


libhpm

The libhpm interface provides a method for instrumenting regions of code to obtain hardware-performance-counter information, derived metrics, and utilization statistics. Note that this interface could be used to provide data on an entire run, but the simpler-to-use tool hpmcount could be used to do that instead without having to edit and recompile code. The libhpm interface outputs its information into files that can be visualized with the hpmviz utility.

The libhpm interface can be used to provide wall-clock time, hardware-performance-counters statistics, and utilization information on regions of code. This interface supports both serial and parallel (MPI, threaded, and mixed-mode) applications written in Fortran, C, and C++.

libhpm supports multiple instrumentation sections and nested instrumentation, and each instrumented section can be called multiple times. libhpm supports OpenMP and threaded applications, but the thread-safe version of the library (libhpm_r) should be used. Also, 64-bit applications can be linked with the 64-bit versions of the library (libhpm64 and libhpm64_r).

libhpm collects information and performs summarization during run-time. Thus, there could be a considerable overhead if instrumentation sections are inserted inside inner loops.

libhpm uses the same set of hardware-counters events used by hpmcount.

    See the HPM "README" file for a detailed explanation of libhpm.

    Example: (matrix-matrix multiplication)

    This is the same example as used in the hpmcount introduction, with the exception that the code has been instrumented to call libhpm functions to gather statistics only for the matrix-matrix multiply and nothing else.
    #include <f_hpm.h>
    ....
    call f_hpminit(taskID,"matmul-inst")
    call f_hpmstart(1,"f90 matmul")
    C = matmul( A, B )
    call f_hpmstop(1)
    call f_hpmterminate(taskID)
    ....

    Compile the code with the following path settings:

    HPM_DIR = /usr/local/HPM_V2.3
    HPM_INC = -I$(HPM_DIR)/include
    HPM_LIB = -L$(HPM_DIR)/lib -lhpm -lpmapi -lm -lessl
    FFLAGS = -qsuffix=cpp=f -O4
    FF = xlf
    $(FF) $(FFLAGS) $(HPM_INC) matmul-inst.f $(HPM_LIB) -o matmul-inst.x

    Note that "-qsuffix=cpp=f" is used to tell the compiler to run the C preprocessor on ".f" files. (It automatically runs the C preprocessor on ".F" files.)

    When the code is run, it creates two output files, one with a name of the form "perfhpmtaskID.pid", which contains readable performance data (similar to hpmcount) and the other of the form "hpmtask_progname_pid.viz", for use with hpmviz. The default format of the output files can be overridden by setting the "LIBHPM_OUTPUT_NAME" variable. Further, the ".viz" file is avoided if the "LIBHPM_VIZ_OUTPUT" variable is set to "FALSE".

    The "perfhpmtaskID.pid" file for the above code segment looks like the following:

    libhpm (Version 2.3.1) summary - running on POWER3-II
    
    Total execution time of instrumented code (wall time): 16.710888 seconds
    
    ######## Resource Usage Statistics ########
    
    Total amount of time in user mode : 16.690000 seconds
    Total amount of time in system mode : 0.450000 seconds
    Maximum resident set size : 94248 Kbytes
    Average shared memory use in text segment : 150568 Kbytes*sec
    Average unshared memory use in data segment : 144400668 Kbytes*sec
    Number of page faults without I/O activity : 23591
    Number of page faults with I/O activity : 0
    Number of times process was swapped out : 0
    Number of times file system performed INPUT : 0
    Number of times file system performed OUTPUT : 0
    Number of IPC messages sent : 0
    Number of IPC messages received : 0
    Number of signals delivered : 0
    Number of voluntary context switches : 8
    Number of involuntary context switches : 1745
    
    ####### End of Resource Statistics ########
    
    Instrumented section: 1 - Label: f90 matmul - process: 0
    file: matmul-inst.f, lines: 24 <--> 26
    Count: 1
    Wall Clock Time: 16.710615 seconds
    Total time in user mode: 16.4863835747709 seconds
    
    PM_CYC (Cycles) : 6182207225
    PM_INST_CMPL (Instructions completed) : 16481641172
    PM_TLB_MISS (TLB misses) : 4133943
    PM_ST_CMPL (Stores completed) : 95806437
    PM_LD_CMPL (Loads completed) : 6805048888
    PM_FPU0_CMPL (FPU 0 instructions) : 4822034722
    PM_FPU1_CMPL (FPU 1 instructions) : 3181448245
    PM_EXEC_FMA (FMAs executed) : 8000011908
    
    Utilization rate : 98.658 %
    Avg number of loads per TLB miss : 1646.140
    Load and store operations : 6900.855 M
    Instructions per load/store : 2.388
    MIPS : 986.298
    Instructions per cycle : 2.666
    HW Float points instructions per Cycle : 1.295
    Floating point instructions + FMAs : 16003.495 M
    Float point instructions + FMA rate : 957.684 Mflip/s
    FMA percentage : 99.978 %
    Computation intensity : 2.319

hpmviz

If you prefer to use a gui tool to see the performance data, you can use hpmviz. Just invoke it like this:

hpmviz hpmtaskID_progname_pid.viz
Once the GUI appears, you can right-click on any of the labels you defined in the hpmstart call to see the statistics.

ornl | nccs | ccs | computers | disclaimer

URL http://www.ccs.ornl.gov/ibm/hpm.html
Updated: Tuesday, 26-Mar-2002 08:07:32 EST
consult@ccs.ornl.gov