Using Performance Co-Pilot (PCP) on COSMA
or how to find out what resources your job is using...
Performance Co-Pilot (PCP) is a system for monitoring the status of computers, and is installed on the compute nodes of COSMA. Some of the information that it monitors (so called metrics) is available for remote access, so you can use it to check things like the load and memory use of any node, without any need for special privilege (logging into compute nodes is not allowed, unless you are a member of the developers group).
There are a lot of commands that PCP offers, somewhat overwhelming at first glance (and documentation is usually focused on system configuration), so we just offer a somewhat arbitrary number of hopefully useful commands to get started.
To get an overview of memory and cpu usage on a compute node, use the command pmstat with the -h option to query:
# pmstat -h m7014 loadavg memory swap io system cpu 1 min swpd free buff cache pi po bi bo in cs us sy id 27.00 0 478g 177040 2249m 0 0 0 0 29K 2924 47 1 52 27.00 0 478g 177040 2249m 0 0 0 0 28K 1831 47 1 52
This is updated every 5 seconds until interrupted. You can change the update interval to get a faster or slower rate using the -t flag:
# pmstat -h m7014 -t 1min # pmstat -h m7014 -t .5sec
You can also query more than one node at a time:
# pmstat -h m7014 -h m7250 node loadavg memory swap io system cpu 1 min swpd buff cache pi po bi bo in cs us sy id m7014.p 27.00 0 478g 2249m 0 0 0 0 28K 742 47 1 52 m7250.p 28.00 0 442g 20604m 0 0 0 0 29K 1262 49 1 50 m7014.p 27.00 0 478g 2249m 0 0 0 11 29K 3774 47 1 52 m7250.p 28.00 0 442g 20604m 0 0 0 25 29K 1250 49 1 50 m7014.p 27.00 0 478g 2249m 0 0 0 0 28K 1793 47 1 52 m7250.p 28.00 0 442g 20604m 0 0 0 20 29K 1146 49 1 50
but that seems to have bug (buff should obviously be free).
The pmstat man page describes the various fields, but loadavg is probably the first thing to look at, it is roughly the number of runnable processes (not quite, for more details see the uptime man page). In general this should not exceed the number of physical cores on the machine, so an optimal load for COSMA7 is 28 and for COSMA6/COSMA5 16. If your code uses hyper-threads then these numbers may double. In this case we can see a job using fewer cores, this may be necessary to achieve the memory use per core. Hybrid codes will also not in general show loads equal to the numbers of cores as they will not be continually runnable (it would be nice if they were, however don't be fooled into thinking that is bad, most MPI codes that use a single rank per core are also not continually busy, they are usually just spinning the CPU waiting for communication).
You can get the 1min, 5min and 15min loads using the pcp dstat sub-commands:
# pcp -h m7001 dstat -l ---load-avg--- 1m 5m 15m 27.9 27.8 27.7 27.9 27.8 27.7 27.9 27.8 27.7 27.9 27.8 27.7
dstat also has options to report cpu and memory use as well. See the pcp-dstat man page.
Detailed memory use
The memory profile of the whole machine is important when understanding if your code is making optimal use of the memory. There are various commands that report memory use, but a good one is:
# pmrep -h m7420 :sar-r kbmemfree kbmemavai kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty 17:44:30 337693080 386612632 190087944 36.02 128736 49919624 127783768 23.84 146485992 29643324 188 17:44:31 337693080 386612632 190087944 36.02 128736 49919624 127783768 23.84 146485992 29643324 188 17:44:32 337693080 386612632 190087944 36.02 128736 49919624 127783768 23.84 146485992 29643324 188 17:44:33 337685304 386610904 190095720 36.02 128736 49926392 127783768 23.84 146488008 29647356 10716
This is a clone of the output from the sar -r command, and reports much of the node memory information from the /proc/meminfo file. The pmrep command, like pcp dstat has a large number of possible reports, these defined in the file /etc/pcp/pmrep/pmrep.conf
A well behaved batch job will distribute itself over the all the cores and sockets of the machine. A classic issue is sharing a single core with more than one pinned thread, so checking that both the CPUs and cores are loaded as expected can be a good idea. The socket balance can be seen using:
# pmrep -h m7452 :numa-per-node-cpu NUMA n %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle 17:47:09 node0 49.97 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 50.01 17:47:09 node1 50.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 50.01 17:47:10 node0 50.00 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 49.92 17:47:10 node1 49.92 0.00 0.14 0.00 0.00 0.00 0.00 0.00 0.00 50.00
In this case the balance is good, each NUMA node has the same work and all the cores are in use as 50% equals all the physical cores -- the other 50% would be used if hyper-threading was required. Note that nodes in this case are a equivalent to a socket holding a CPU, some architectures allow CPUs to be split into more NUMA regions (COSMA8 has 4 per socket), usually depending on how the memory is accessed.
The activity per core can be seen using:
# pmrep -h m7452 :mpstat-P-ALL
the output can be very long (256 lines for COSMA8), but looking closely you can see which CPUs are active. If you have pinning enabled these should remain fixed.
The only IO of interest on a typical node is how much traffic is being generated on the infiniband fabric, there are potentially a lot of metrics that can be reported about this, but the command:
# pmrep -h m7001 infiniband.port.total.bytes -t 3s -b MB
will report the total megabytes per second averaged over 3 seconds.
The metrics that can be reported from a node can be seen using the command:
# pminfo -h m7452
each of these can be queried using the pmval or pmrep commands, they can also be visualized in a graph using the pmchart command. This is very flexible and can be used to see the metrics of more than one node at a time, as in:
# pmchart -h m7001 m7002 m7003
you can then select the metrics of interest from the nodes for comparison and can also capture the information. There is fuller help at:
But at the time of writing pmchart has a problem when it resizes or samples quickly, it is unclear if affected by network speed, but that could also be a consideration. If you do not resize the window that seems to help, so the recommendation is to use startup commands like:
# pmchart -t 2 -geometry 800x400 -c CPU -h m7452
The CPU value defines a view which are some predefined plots. Other useful views are Memory and Overview, but if using Overview increase the vertical size to 1000. On COSMA the infiniband read and write rates are also available in the IB view.
As was noted at the beginning, and if you have looked at the output of pminfo, PCP will allow you to monitor a lot of details of the compute nodes, but there are some limitations. The metrics that make reports at the process level require privilege that is not available using remote access, so if you try them they will fail (a shame as pcp atop, would be very nice, like a remote top command). All metrics are available locally, just leave the -h part out.
It is also not possible to see any archival material, so you need to check things when your job is running. In the longer term we may start to capture values that can be used to see how a job ran, but this will be a limited set, in terms of the metrics available and the time sampling as that is a lot of data to capture.