Using perf to measure Linux command performance

Recently I explored a few options to measure the performance of compression of directory files on Linux.
Usually, if I would want to measure the time consumed; appending the keyword ‘time’ helped. However, this time I was suggested to check the libraries being invoked underneath. This led me to ‘perf’ command which we are going to look into today.

First, let’s go back in ‘time‘. For our experiment, we are going to create a simple while loop in bash as follows:

Using “time” command on the above script:

[root@centosstream9 ~]# time sh command_loop.sh
0
1
..

..
9
Number 10!
Done!

real    0m0.520s
user    0m0.007s
sys     0m0.000s

In the above output:
real: total time command took to execute
user: Time spent in user-space
sys: Time spent in kernel space


If we want to really dig deep into how our command is spending time on Linux system calls, CPU cycles, etc.; we should use perf utility. I have Centos 9 machine where I have installed this utility using yum install perf -y

A quick look at the help of the perf command gives us this:

[root@centosstream9 ~]# perf --help

 usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]

 The most commonly used perf commands are:
   annotate        Read perf.data (created by perf record) and display annotated code
   archive         Create archive with object files with build-ids found in perf.data file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a perf.data file
   c2c             Shared Data C2C/HITM Analyzer.
   config          Get and set variables in a configuration file.
   daemon          Run record sessions on background
   data            Data file related processing
   diff            Read perf.data files and display the differential profile
   evlist          List the event names in a perf.data file
   ftrace          simple wrapper for kernel's ftrace functionality
   inject          Filter to augment the events stream with additional information
   iostat          Show I/O performance metrics
   kallsyms        Searches running kernel for symbols
   kmem            Tool to trace/measure kernel memory properties
   kvm             Tool to trace/measure kvm guest os
   kwork           Tool to trace/measure kernel work properties (latencies)
   list            List all symbolic event types
   lock            Analyze lock events
   mem             Profile memory accesses
   record          Run a command and record its profile into perf.data
   report          Read perf.data (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   script          Read perf.data (created by perf record) and display trace output
   stat            Run a command and gather performance counter statistics
   test            Runs sanity tests.
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   version         display the version of perf binary
   probe           Define new dynamic tracepoints
   trace           strace inspired tool

 See 'perf help COMMAND' for more information on a specific command.

Examples of perf command:

1. Simple usage of command:
perf record -o <PERFORMANCE_FILE_DATA> <LINUX COMMAND TO EXECUTE>
Executing the above on our bash script:

[root@centosstream9 ~]# perf record -o command_script_execution_record.data sh command_loop.sh
0
1
...
...
9
Number 10!
Done!
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.012 MB command_script_execution_record.data (24 samples) ]
[root@centosstream9 ~]# ll
total 28
-rw-------. 1 root root  1134 Dec  5  2021 anaconda-ks.cfg
-rw-r--r--. 1 root root   217 May 21 07:44 command_loop.sh
-rw-------. 1 root root 20080 May 21 08:43 command_script_execution_record.data

2. Check what data is captured in the performance data file:
perf report -i <performance_data_file> –header.
The output of the above command for our performance file:

# ========
# captured on    : Sun May 21 09:07:38 2023
# header version : 1
# data offset    : 256
# data size      : 12008
# feat offset    : 12264
# hostname : centosstream9.linuxvmimages.local
# os release : 5.14.0-22.el9.x86_64
# perf version : 5.14.0-312.el9.x86_64
# arch : x86_64
# nrcpus online : 1
# nrcpus avail : 1
# cpudesc : 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
# cpuid : GenuineIntel,6,140,1
# total memory : 1817896 kB
# cmdline : /usr/bin/perf record -o command_script_execution_record.data sh command_loop.sh 
# event : name = cpu-clock:pppH, , id = { 19 }, type = 1, size = 128, { sample_period, sample_freq } = 4000, sample_type = IP|TID|TIME|PERIOD, read_format = ID, disabled = 1, inherit = 1, mmap = 1, comm = 1, freq = 1, enable_on_exec = 1, task = 1, precise_ip = 3, sample_id_all = 1, exclude_guest = 1, mmap2 = 1, comm_exec = 1, ksymbol = 1, bpf_event = 1
# CPU_TOPOLOGY info available, use -I to display
# NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: software = 1, uprobe = 7, breakpoint = 5, tracepoint = 2, kprobe = 6, msr = 8
# CACHE info available, use -I to display
# time of first sample : 7411.707292
# time of last sample : 7412.232653
# sample duration :    525.361 ms
# MEM_TOPOLOGY info available, use -I to display
# bpf_prog_info 22: bpf_prog_40ddf486530245f5 addr 0xffffffffc02660ac size 309
# bpf_prog_info 23: bpf_prog_6deef7357e7b4530 addr 0xffffffffc026888c size 54
# bpf_prog_info 24: bpf_prog_6deef7357e7b4530 addr 0xffffffffc026ac78 size 54
# bpf_prog_info 25: bpf_prog_ca8e50a3c7fb034b addr 0xffffffffc02624a4 size 307
# bpf_prog_info 26: bpf_prog_6deef7357e7b4530 addr 0xffffffffc02cc6ec size 54
# bpf_prog_info 27: bpf_prog_6deef7357e7b4530 addr 0xffffffffc02d2738 size 54
# bpf_prog_info 28: bpf_prog_3a0ef5414c2f6fca addr 0xffffffffc0869a80 size 447
# bpf_prog_info 29: bpf_prog_6deef7357e7b4530 addr 0xffffffffc086bd1c size 54
# bpf_prog_info 30: bpf_prog_6deef7357e7b4530 addr 0xffffffffc086d010 size 54
# bpf_prog_info 31: bpf_prog_ee0e253c78993a24 addr 0xffffffffc08ad840 size 255
# missing features: TRACING_DATA BRANCH_STACK GROUP_DESC AUXTRACE STAT CLOCKID DIR_FORMAT COMPRESSED CPU_PMU_CAPS CLOCK_DATA HYBRID_TOPOLOGY
# ========
#
#
# Total Lost Samples: 0
#
# Samples: 24  of event 'cpu-clock:pppH'
# Event count (approx.): 6000000
#
# Overhead  Command  Shared Object         Symbol
# ........  .......  ....................  .....................................
#
    29.17%  sh       [kernel.kallsyms]     [k] finish_task_switch.isra.0
    20.83%  sleep    ld-linux-x86-64.so.2  [.] intel_check_word.constprop.0
    16.67%  sh       ld-linux-x86-64.so.2  [.] intel_check_word.constprop.0
     8.33%  sleep    ld-linux-x86-64.so.2  [.] init_cpu_features.constprop.0
     4.17%  sh       ld-linux-x86-64.so.2  [.] get_common_indices.constprop.0
     4.17%  sh       ld-linux-x86-64.so.2  [.] handle_intel.constprop.0
     4.17%  sh       ld-linux-x86-64.so.2  [.] init_cpu_features.constprop.0
     4.17%  sleep    [kernel.kallsyms]     [k] finish_task_switch.isra.0
     4.17%  sleep    ld-linux-x86-64.so.2  [.] get_common_cache_info.constprop.0
     4.17%  sleep    ld-linux-x86-64.so.2  [.] handle_intel.constprop.0

The “Symbol” field points to the functions from the kernel source code. Shared Object points to libraries on the system which run your command.

3. Export the data into sqlite db, postgresql db, or generate a flame-graph with the performance data file.
How to do it? Just run: perf report -i <perf_data_file> and then you will see the following events:

and then:

and then select the option you want to do.

Note: To generate a flame-graph, you need to have the js-d3-flame-graph library installed and the name of the file has to be perf.data(just rename your generated perf data file). This generated an HTML file, which looks like this:

4. See all the events from performance data:
perf script -i <perf_data_file>
The output of above command on our performance data file:

[root@centosstream9 ~]# perf script -i ./command_script_execution_record.data                                                                                
              sh 13623  7411.707292:     250000 cpu-clock:pppH:      7f37c7e173fa get_common_indices.constprop.0+0x7a (/usr/lib64/ld-linux-x86-64.so.2)
              sh 13623  7411.707562:     250000 cpu-clock:pppH:      7f37c7e18e39 init_cpu_features.constprop.0+0xcf9 (/usr/lib64/ld-linux-x86-64.so.2)
              sh 13623  7411.707782:     250000 cpu-clock:pppH:      7f37c7e179f4 intel_check_word.constprop.0+0x124 (/usr/lib64/ld-linux-x86-64.so.2)       ◆
              sh 13623  7411.708069:     250000 cpu-clock:pppH:      7f37c7e17b92 handle_intel.constprop.0+0x52 (/usr/lib64/ld-linux-x86-64.so.2)
              sh 13623  7411.708319:     250000 cpu-clock:pppH:      7f37c7e179f4 intel_check_word.constprop.0+0x124 (/usr/lib64/ld-linux-x86-64.so.2)       ▒
              sh 13623  7411.708567:     250000 cpu-clock:pppH:      7f37c7e179f4 intel_check_word.constprop.0+0x124 (/usr/lib64/ld-linux-x86-64.so.2)
              sh 13623  7411.708801:     250000 cpu-clock:pppH:      7f37c7e179f4 intel_check_word.constprop.0+0x124 (/usr/lib64/ld-linux-x86-64.so.2)       ▒
              sh 13623  7411.710569:     250000 cpu-clock:pppH:  ffffffffb3116f0b finish_task_switch.isra.0+0xbb ([kernel.kallsyms])
              sh 13623  7411.713140:     250000 cpu-clock:pppH:  ffffffffb3116f0b finish_task_switch.isra.0+0xbb ([kernel.kallsyms])
              sh 13623  7411.716419:     250000 cpu-clock:pppH:  ffffffffb3116f0b finish_task_switch.isra.0+0xbb ([kernel.kallsyms])
              sh 13623  7411.720121:     250000 cpu-clock:pppH:  ffffffffb3116f0b finish_task_switch.isra.0+0xbb ([kernel.kallsyms])                         ▒
              sh 13623  7411.721515:     250000 cpu-clock:pppH:  ffffffffb3116f0b finish_task_switch.isra.0+0xbb ([kernel.kallsyms])
              sh 13623  7411.724388:     250000 cpu-clock:pppH:  ffffffffb3116f0b finish_task_switch.isra.0+0xbb ([kernel.kallsyms])
           sleep 13625  7411.727006:     250000 cpu-clock:pppH:      7ff8d52e8170 init_cpu_features.constprop.0+0x30 (/usr/lib64/ld-linux-x86-64.so.2)
           sleep 13625  7411.727284:     250000 cpu-clock:pppH:      7ff8d52e8e87 init_cpu_features.constprop.0+0xd47 (/usr/lib64/ld-linux-x86-64.so.2)
           sleep 13625  7411.727571:     250000 cpu-clock:pppH:      7ff8d52e79f4 intel_check_word.constprop.0+0x124 (/usr/lib64/ld-linux-x86-64.so.2)       ▒
           sleep 13625  7411.727857:     250000 cpu-clock:pppH:      7ff8d52e79f4 intel_check_word.constprop.0+0x124 (/usr/lib64/ld-linux-x86-64.so.2)
           sleep 13625  7411.728062:     250000 cpu-clock:pppH:      7ff8d52e79f4 intel_check_word.constprop.0+0x124 (/usr/lib64/ld-linux-x86-64.so.2)
           sleep 13625  7411.728279:     250000 cpu-clock:pppH:      7ff8d52e7b92 handle_intel.constprop.0+0x52 (/usr/lib64/ld-linux-x86-64.so.2)
           sleep 13625  7411.728527:     250000 cpu-clock:pppH:      7ff8d52e79f4 intel_check_word.constprop.0+0x124 (/usr/lib64/ld-linux-x86-64.so.2)
           sleep 13625  7411.728775:     250000 cpu-clock:pppH:      7ff8d52e79f4 intel_check_word.constprop.0+0x124 (/usr/lib64/ld-linux-x86-64.so.2)
           sleep 13625  7411.729058:     250000 cpu-clock:pppH:      7ff8d52e7608 get_common_cache_info.constprop.0+0xe8 (/usr/lib64/ld-linux-x86-64.so.2)
           sleep 13625  7412.230184:     250000 cpu-clock:pppH:  ffffffffb3116f0b finish_task_switch.isra.0+0xbb ([kernel.kallsyms])
              sh 13623  7412.232653:     250000 cpu-clock:pppH:  ffffffffb3116f0b finish_task_switch.isra.0+0xbb ([kernel.kallsyms])

5. Generate Time chart data:
Perf tool even allows you to zoom into kernel calls and get call graphs. To generate time chart data, run the command, run perf with timechart options: perf timechart record <Linux command here>. This generates an SVG file that looks like this:


6. Record kernel memory: use perf kmem (you need to interrupt it to record the data to a file) or you can give “sleep <timeout>” at the end:

[root@centosstream9 ~]# perf kmem --slab record
[ perf record: Woken up 0 times to write data ]
Warning:
Processed 24165116 events and lost 6 chunks!

Check IO/CPU overload!

[ perf record: Captured and wrote 1416.506 MB perf.data (16131228 samples) ]


[root@centosstream9 ~]# perf kmem stat
Warning:
Processed 24165116 events and lost 6 chunks!
Check IO/CPU overload!

SUMMARY (SLAB allocator)
========================
Total bytes requested: 9,590,832
Total bytes allocated: 9,700,896
Total bytes freed:     6,271,416
Net total bytes allocated: 3,429,480
Total bytes wasted on internal fragmentation: 110,064
Internal fragmentation: 1.134576%
Cross CPU allocations: 129/24,912

7. Record CPU cycles for 10 seconds:
perf record -a -o <perf_data_file> sleep 10


Output:
# Samples: 27K of event 'cpu-clock:pppH'
# Event count (approx.): 6906000000
#
# Overhead  Command          Shared Object         Symbol
# ........  ...............  ....................  .................................................
#
    99.56%  swapper          [kernel.kallsyms]     [k] default_idle
     0.11%  swapper          [kernel.kallsyms]     [k] __softirqentry_text_start
     0.07%  kworker/0:2-eve  [kernel.kallsyms]     [k] vmw_goal_waiter_remove
     0.05%  kworker/0:2-eve  [kernel.kallsyms]     [k] vmw_cmdbuf_header_submit
     0.04%  kworker/0:2-eve  [kernel.kallsyms]     [k] __raw_callee_save___native_queued_spin_unlock
     0.02%  kworker/0:1H-xf  [kernel.kallsyms]     [k] ata_scsi_queuecmd
     0.02%  kworker/0:1-ata  [kernel.kallsyms]     [k] __softirqentry_text_start
     0.02%  kworker/0:1-ata  [kernel.kallsyms]     [k] e1000_update_stats
     0.02%  kworker/0:1H-xf  [kernel.kallsyms]     [k] __softirqentry_text_start
     0.02%  sleep            ld-linux-x86-64.so.2  [.] intel_check_word.constprop.0
     0.01%  kworker/0:1-ata  [kernel.kallsyms]     [k] e1000_clean
     0.01%  kworker/0:2-eve  [kernel.kallsyms]     [k] ata_scsi_queuecmd
     0.01%  kworker/0:1-ata  [kernel.kallsyms]     [k] ata_sff_pio_task
     0.01%  sleep            ld-linux-x86-64.so.2  [.] init_cpu_features.constprop.0
     0.01%  swapper          [kernel.kallsyms]     [k] e1000_clean
     0.00%  kworker/0:2-eve  [kernel.kallsyms]     [k] __softirqentry_text_start
     0.00%  kworker/0:2-eve  [kernel.kallsyms]     [k] drm_mode_object_put
     0.00%  swapper          [kernel.kallsyms]     [k] e1000_alloc_rx_buffers
     0.00%  swapper          [kernel.kallsyms]     [k] e1000_xmit_frame

There are several other use cases of the perf command for which this blog would not be enough. I learned something new with this exploration and I hope you learned something too.

Interested in exploring tools to send concurrent requests to a web server, check out my previous blog post.

Leave a Reply

Your email address will not be published. Required fields are marked *