Command memlat is a web-based browser for memory load latency profiles.
Memory stalls and conflicts are increasingly important for software performance. Memory load latency profiles can give deep insights in to these problems; however, the richness of these profiles makes them difficult to interpret using traditional profiling tools and techniques.
memlat is a profile browser built for understanding and interpreting memory load latency profiles. The central concept is a "latency distribution", which is a statistical distribution of the number of cycles spent in memory load or store operations. For example, if there are 10 loads that take 10 cycles and 2 loads that takes 100 cycles, the latency distribution consists of a 100 cycle spike at 10 cycles and a 200 cycle spike at 100 cycles. The total weight of this distribution accounts for the total cycles spent waiting on memory loads or stores.
memlat presents a profile as a multidimensional latency distribution and provides a tools for viewing and filtering this distribution on each dimension, such as by function, by source line, by data source (L1 hit, TLB miss, etc), by address, etc. Each tab in the UI browses the profile on a different dimension and clicking on a row filters the profile down to just that function, source line, etc. An active filters can be removed by clicking on it in the filter bar at the top.
For example, suppose we want to understand the primary source of memory latency in a profile. Select the "By source line" tab and click on the top row to filter to the source line that contributed the most total memory latency. You can select the "Source annotation" tab to see the text of this line. To drill down to a particular memory address, select the "By address" tab to see the memory addresses touched by this source line. Click the top one to further filter to the hottest address touched by this source line. Then, click the source line filter in the filter bar at the top to remove the source line filter. Finally, select the "Source annotation" tab to see the other source code lines that touch this hot address.
Note that the latency reported by the hardware is the time from instruction issue to retire. Hence, a "fast load" (say, an L1 hit) that happens immediately after a "slow load" (say, an LLC miss), will have a high reported latency because it has to wait for the slow load, even though the actual memory operation for the fast load is fast.
To download and install memlat, run
go get github.com/aclements/go-perf/cmd/memlat
memlat works with the memory load latency profiles recorded by the Linux perf tool. This requires hardware support that has been available since Intel Nehalem. To record a memory latency profile, use perf's "mem" subcommand. For example,
perf mem record <command> # Record a memory profile for command perf mem record -a # Record a system-wide memory profile
This will write the profile to a file called perf.data. Then, simply start memlat with
memlat will parse and symbolize the profile and start a web server listening by default on localhost;8001.