I recently made some optimizations at work that saved more than 40% CPU at peak time. In this post, I go over the details about how I identified and fixed the bottleneck.

Preparation

The first step is to identify the bottleneck. There might or might not be a single big bottleneck in a system. Usually, it’s not straightforward to replicate the production workload on a staging or developer machine. Instead, I focused on profiling the system in production without disrupting live workloads.

Erlang provides a range of tools to find memory or CPU bottlenecks. However, most of them are only suitable to run on a developer machine because of the performance overhead. Erlang also provides support for a Linux profiler called perf. I added +JPperf true to the rel/vm.args.eex file to enable perf support.

Data collection

sudo perf record -g -F 99 --pid 607343 --call-graph=fp -- sleep 10
sudo perf script --header > out.stacks

Let’s break down the steps. The first command perf record collects data, and the second command just reformats the collected data. -F controls the sampling frequency; the default is usually set to 1000, but I typically reduce this to 99 on a production system to reduce the profiling overhead.

~/FlameGraph/stackcollapse-perf.pl < /tmp/out.stacks | \
  sed -e 's/^[0-9]\+_//' -e 's/^erts_\([^_]\+\)_[0-9]\+/erts_\1/' | \
  ~/FlameGraph/flamegraph.pl --hash > /tmp/flame.svg

Copy the out.stacks file to your local machine and use FlameGraph to convert it to a flamegraph.

Hypothesis

The percentage of the Elixir stack was so small it didn’t even show up anywhere unless one started to zoom in. There were a lot of [unknown] stacks as well, which I have’t yet figured out how to fix. Playing around with the flamegraph for a bit, I noticed the binary_to_term_1 with 3.02% usage and another place with exactly the same children, just with the name binary_to_term_1 missing. The total usage added up to 31.87%. I assumed both were the same and started to look into our codebase for binary_to_term usage.

$global::call_light_bif_shared > binary_to_term_1 > [beam.smp] > [beam.smp] > erts_atom_put > erts_atom_put_index
$global::call_bif_shared > beam_jit_call_bif > [beam.smp] > [beam.smp] > erts_atom_put > erts_atom_put_index

We use Redis heavily for our caching use cases and use :erlang.term_to_binary(term, compressed: 6) for serialization and :erlang.binary_to_term(data) for deserialization. Switching to any other serialization format is out of the question since it’s unlikely to be faster than the native serialization format.

Using an in-memory cache is another viable approach. ETS-based cache is one of the straightforward ways to achieve in-memory caching. But there is one caveat: unlike other programming languages, Erlang uses a per-process heap, so whenever you fetch something from ETS, it’s going to copy the term into the current process’s heap. So there is some amount of copying involved. The question is, how does ETS copying compare with binary_to_term?

I ran a local benchmark for a very large object (7 MB compressed), and the results were impressive. I chose a large object to make sure the deserialization would be the bottleneck.

Name            ips        average  deviation         median         99th %
ets           11.77       84.93 ms     ±7.99%       83.92 ms      111.97 ms
redis          1.83      546.06 ms     ±5.44%      537.49 ms      626.95 ms

Comparison:
ets           11.77
redis          1.83 - 6.43x slower +461.13 ms

Experiment

cpu usage

I pushed a buggy version on day 1, which didn’t properly backfill the in-memory cache in the cache miss scenario. With the fix landed, we got more than 40% CPU reduction.

I regenerated the flamegraph again, and as expected, the binary_to_term_1 function is nearly gone from the graph. Of course, we now get ets_lookup_2, which takes up 11.92%, but this is much better than 31.87%. We also set the compress flag in ETS to true, which handles atoms and binaries more efficiently. Even though we only reduced 20%, the net CPU difference is 40%. I assume the other 20% might come from better garbage collection since we don’t keep creating duplicate binaries when we deserialize.

We still use Redis, with the in-memory cache as another layer in front of it. We use versioned cache keys, so the invalidation of the in-memory cache is not really an issue.