In my previous post, I discussed how to identify the causes of high CPU usage. In this post, I’ll focus on addressing system bottlenecks. Our production workload runs on 16-core machines, yet the CPU usage has never exceeded 70%. In other words, the system effectively uses no more than 11 cores.

I took some time to eliminate the usual suspects to ensure I wasn’t investigating the wrong issue. In my case, this involved verifying that neither Google Cloud Platform nor Kubernetes (K8S) was throttling the CPU.

The CPU utilization was low, suggesting that there weren’t enough tasks available to keep all cores active. After verifying that system load was adequate and ruling out application-level issues like connection pooling, I concluded these factors were not responsible. To identify the root cause, I needed to gather more detailed system-level insights.

lcnt - The Lock Profiler

Erlang offers a tool called lcnt to measure VM-level lock issues. Unfortunately, it is not enabled by default, and Erlang must be compiled with the --enable-lock-counter flag to activate this feature. I couldn’t find any existing Elixir Docker image with this flag enabled, so I ended up building Erlang separately and copying the relevant files.

COPY --from=acme/erlang-lcnt:26.1.2-alpine-3.18.4 \
      /usr/local/lib/erlang/erts-14.1.1/bin/*lcnt* \
     ./backend/erts-14.1.1/bin/

The --enable-lock-counter flag generated a separate emulator with lock profiling enabled. To utilize this emulator instead of the default one, I had to configure the system using the -emu_type flag.

ERL_FLAGS="-emu_type lcnt" iex -S mix phx.server

Initially, I was concerned about potential performance degradation, but it turned out there was no noticeable difference between this emulator and the default one in our case.

Lock-related information is collected continuously in the background, much like counters that are incremented over time. The :lcnt.clear() function can be used to reset all these counters to their default state. This is important because, in most cases, you would want to capture this information for a specific period rather than from the start of the VM.

A typical session would look like this:

# Reset the existing metrics
:lcnt.clear()

# Run your workload

# Collect metrics - this captures a snapshot of all counters
:lcnt.collect()

# View the counters snapshot in a tabular format
:lcnt.conflicts()

:lcnt.collect() captures a snapshot of the current metrics, which can be queried in various ways. Below is a sample result from :lcnt.conflicts(), where the data is grouped by lock class.

iex> :lcnt.conflicts()

                       lock    id    #tries  #collisions  collisions [%]  time [us]  duration [%]
                      -----   ---   ------- ------------ --------------- ---------- -------------
               db_hash_slot  5696  13510419       113487          0.8400   49404882       11.7761
             alcu_allocator    10   3330003        35775          1.0743   23423967        5.5833
                  run_queue    18 126707580      1178423          0.9300    2939175        0.7006
                  proc_main 10916  40061045       151795          0.3789    2586523        0.6165
                  port_lock  5151  15861626        13096          0.0826     686935        0.1637
                       mseg     1    151821          764          0.5032     382948        0.0913
                   pix_lock  1024     31091           13          0.0418     295868        0.0705
                     db_tab   285  80911441         8026          0.0099     258938        0.0617
                   proc_btm 10916   5601908        10564          0.1886      79911        0.0190
               drv_ev_state   128  12502249         3028          0.0242      76296        0.0182
                proc_status 10916  39289252         7236          0.0184      47459        0.0113
                  proc_msgq 10916  49683469        35247          0.0709      33712        0.0080
            port_sched_lock  5153   9761618         4078          0.0418       9071        0.0022
              process_table     1    177504           73          0.0411       2075        0.0005
 dirty_run_queue_sleep_list     2    448095           24          0.0054         72        0.0000

You can choose not to group by class and instead sort by per lock id. This provides a clearer view of which specific locks are consuming more time.

iex> :lcnt.conflicts(combine: false, print: [:name, :id, :tries, :ratio, :time, :duration])

           lock                       id  #tries  collisions [%]  time [us]  duration [%]
          -----                      --- ------- --------------- ---------- -------------
 alcu_allocator              eheap_alloc  366897          8.8891   23322582        5.5591
   db_hash_slot  prometheus_metrics_dist 1897002          3.7327   17451758        4.1598
   db_hash_slot  prometheus_metrics_dist  898806          1.4163    6976784        1.6630
   db_hash_slot  prometheus_metrics_dist 1589817          0.5860    6691033        1.5949
   db_hash_slot  prometheus_metrics_dist  864521          0.2190    6078246        1.4488
   db_hash_slot  prometheus_metrics_dist 1000479          0.3126    5075027        1.2097
   db_hash_slot  prometheus_metrics_dist 1083966          0.3056    4273343        1.0186
      proc_main  <user@127.0.0.1.4901.0> 2832033          2.3855    1229493        0.2931
   db_hash_slot         shards_partition    3947          0.7601     908902        0.2166
           mseg                        0  151821          0.5032     382948        0.0913
   db_hash_slot         shards_partition    1588          0.3778     236682        0.0564
      run_queue                        8 8737378          0.9529     211859        0.0505
      run_queue                        1 8963035          0.9492     207612        0.0495
      run_queue                        7 8751836          0.9598     204747        0.0488
      run_queue                        3 8652748          0.9513     202607        0.0483
      run_queue                       11 7993632          0.9512     198954        0.0474
      run_queue                        9 8624551          0.9514     194698        0.0464
      proc_main          user_drv_writer  461096          0.1091     192408        0.0459
      run_queue                       10 8350881          0.9302     188309        0.0449
      run_queue                       14 6502914          0.8891     186478        0.0444

:lcnt.information() displays aggregate-level information at the VM level. Below is a sample response from a different machine.

iex> :lcnt.information()

information:
              #locks : 20959
            duration : 45751741 us (45.7517 s)

summated stats:
              #tries : 8848260
              #colls : 7961
           wait time : 7111 us ( 0.0071 s)
 percent of duration : 0.0155 %
:ok

I started with alcu_allocator because it was one of the two locks with the most contention. The :lcnt.inspect/1 function could be used to inspect a specific lock class.

iex> :lcnt.inspect(:alcu_allocator)

           lock            id  #tries  #collisions  collisions [%]  time [us]  duration [%] histogram [log2(us)]
          -----           --- ------- ------------ --------------- ---------- ------------- ---------------------
 alcu_allocator   eheap_alloc  366897        32614          8.8891   23322582        5.5591 |     ..x.......xxXXXXx...     |
 alcu_allocator      sl_alloc 1829947         3044          0.1663      50913        0.0121 |    .X...xx...........        |
 alcu_allocator    temp_alloc   12700           92          0.7244      49736        0.0119 |             ..xXXXXX..       |
 alcu_allocator     fix_alloc  183316            8          0.0044        425        0.0001 |       X  X     x x           |
 alcu_allocator     ets_alloc  182897            4          0.0022        127        0.0000 |        X    XX X             |
 alcu_allocator     std_alloc  182897            3          0.0016         99        0.0000 |       X      X X             |
 alcu_allocator  binary_alloc  204975            8          0.0039         85        0.0000 |      X Xx x  xx              |
 alcu_allocator  driver_alloc  182897            1          0.0005          0        0.0000 |        X                     |
 alcu_allocator literal_alloc     579            0          0.0000          0        0.0000 |                              |
 alcu_allocator      ll_alloc  182898            1          0.0005          0        0.0000 |         X                    |

The issue seemed to be related to one specific lock. The {class, id} format could be used to inspect a specific lock, and with the location flag set, it prints the exact source location of the lock as well.

iex> :lcnt.inspect({:alcu_allocator, :eheap_alloc}, locations: true, combine: true)

                     location  #tries  #collisions  collisions [%]  time [us]  duration [%] histogram [log2(us)]
                    --------- ------- ------------ --------------- ---------- ------------- ---------------------
 'beam/erl_alloc_util.c':1809   16013         1160          7.2441    1046038        0.9328 |      .........xxXXXXx...     |
 'beam/erl_alloc_util.c':2081   28164         1142          4.0548     868892        0.7748 |     ..........xxXXXx....     |
 'beam/erl_alloc_util.c':6093   19881         1307          6.5741     831003        0.7410 |     .xX.... ..xxXXxx...      |
 'beam/erl_alloc_util.c':6559    7331          755         10.2987     242410        0.2162 |      .Xx......Xxxx.....      |
                  undefined:0     154            6          3.8961       3945        0.0035 |               x  Xxx         |

I spent some time reviewing the available documentation related to Erlang’s memory allocator. Erlang uses a separate allocator for each scheduler, so there should generally be no collision during memory allocation calls. Erlang also utilizes a dirty scheduler to handle various tasks, including garbage collection (GC). Unlike the normal scheduler, the dirty scheduler doesn’t have its own allocator; instead, it shares a single allocator. While investigating the beam/erl_alloc_util.c source code, I found several call sites related to memory deallocation, which aligns with the theory that the contention arises due to GC activity on the dirty scheduler. The results from msacc also indicated that a significant amount of CPU time is being spent on garbage collection on the dirty schedulers.

iex> :msacc.start(1000);:msacc.print()

Average thread real-time    :  1000961 us
Accumulated system run-time : 10992566 us
Average scheduler run-time  :   596752 us

        Thread emulator     port      aux check_io       gc    other    sleep

Stats per thread:
     async( 0)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
       aux( 1)    0.00%    0.00%    0.44%    0.01%    0.00%    0.07%   99.48%
dirty_cpu_( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 2)    0.37%    0.00%    0.00%    0.00%   35.73%    0.04%   63.86%
dirty_cpu_( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 4)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 5)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 6)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 7)    0.65%    0.00%    0.00%    0.00%   26.59%    0.03%   72.73%
dirty_cpu_( 8)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_( 9)    0.23%    0.00%    0.00%    0.00%   28.08%    0.03%   71.66%
dirty_cpu_(10)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(11)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(12)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(13)    0.15%    0.00%    0.00%    0.00%   21.61%    0.02%   78.23%
dirty_cpu_(14)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(15)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_cpu_(16)    0.47%    0.00%    0.00%    0.00%   13.26%    0.02%   86.24%
dirty_io_s( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_s( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_s( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_s( 4)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_s( 5)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_s( 6)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_s( 7)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_s( 8)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
dirty_io_s( 9)    2.24%    0.00%    0.00%    0.00%    0.00%    0.14%   97.61%
dirty_io_s(10)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
      poll( 0)    0.00%    0.00%    0.00%    5.88%    0.00%    0.00%   94.12%
 scheduler( 1)   56.34%    4.50%    2.80%    0.68%    1.98%    1.92%   31.78%
 scheduler( 2)   57.17%    4.20%    2.59%    0.66%    1.99%    2.23%   31.16%
 scheduler( 3)   63.33%    4.09%    2.57%    0.66%    1.95%    1.75%   25.65%
 scheduler( 4)   57.13%    3.94%    2.52%    0.70%    2.17%    1.73%   31.81%
 scheduler( 5)   58.30%    3.80%    2.40%    0.59%    1.96%    1.66%   31.29%
 scheduler( 6)   55.95%    4.18%    2.58%    0.69%    2.30%    1.76%   32.53%
 scheduler( 7)   57.02%    4.54%    2.77%    0.70%    2.26%    1.89%   30.81%
 scheduler( 8)   54.80%    4.14%    2.72%    0.66%    2.13%    1.82%   33.73%
 scheduler( 9)   58.94%    3.91%    2.70%    0.62%    1.59%    1.91%   30.33%
 scheduler(10)   57.23%    4.22%    2.62%    0.67%    2.03%    2.13%   31.10%
 scheduler(11)   56.40%    4.20%    2.61%    0.68%    1.87%    1.95%   32.29%
 scheduler(12)   55.24%    4.02%    2.59%    0.63%    2.10%    1.94%   33.48%
 scheduler(13)   54.48%    4.75%    2.63%    0.75%    2.17%    2.07%   33.15%
 scheduler(14)   41.82%    3.37%    1.99%    0.56%    1.34%    1.49%   49.43%
 scheduler(15)    9.51%    0.47%    0.87%    0.19%    0.27%    0.58%   88.12%
 scheduler(16)    0.00%    0.00%    0.33%    0.06%    0.00%    0.25%   99.36%

Stats per type:
         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%
           aux    0.00%    0.00%    0.44%    0.01%    0.00%    0.07%   99.48%
dirty_cpu_sche    0.12%    0.00%    0.00%    0.00%    7.83%    0.01%   92.05%
dirty_io_sched    0.22%    0.00%    0.00%    0.00%    0.00%    0.01%   99.76%
          poll    0.00%    0.00%    0.00%    5.88%    0.00%    0.00%   94.12%
     scheduler   49.60%    3.65%    2.33%    0.59%    1.76%    1.69%   40.38%

Fortunately, this issue has already been addressed in otp-24.2+, and it is controlled by a flag named +Mdai. Setting this flag to max provides an independent allocator for each dirty scheduler.

I also experimented with super carrier by pre-allocating 32GB of RAM. This change led to a shift in the CPU usage split, with a noticeable reduction in system CPU time compared to user CPU time. The number of page faults also decreased, which I assume is related to the reduction in system CPU time. Overall, this change seemed to save roughly more than 10% of CPU time. While it’s difficult to calculate the exact savings due to multiple variables being adjusted, we ultimately settled on the configuration: +Mdai max +MMscs 32768 +MMsco false.

Once the fix for the memory allocator was implemented, the eheap_alloc was mostly gone, and under high load, the wait duration of db_hash_slot started to increase. The issue with locks is that when one bottleneck is removed, the next one takes its place.

iex> :lcnt.conflicts()

                       lock    id    #tries  #collisions  collisions [%]  time [us]  duration [%]
                      -----   ---   ------- ------------ --------------- ---------- -------------
               db_hash_slot  6720  14550391       207204          1.4240  104478530       36.5303
                  run_queue    18 149169186      1846889          1.2381   15757389        5.5095
                  proc_main 17659  47119883       308688          0.6551   13635415        4.7675
             alcu_allocator   154   5528359         7396          0.1338    3241577        1.1334
                  erts_mmap     2   1678465       210117         12.5184    2434174        0.8511
                   pix_lock  1024     64208           31          0.0483    1130359        0.3952
                  port_lock  8538  17741611        14465          0.0815     767397        0.2683
                proc_status 17659  47387203        23613          0.0498     247990        0.0867
                     db_tab   302  89190210         8624          0.0097     240916        0.0842
                   proc_btm 17659   6399333        19711          0.3080     234541        0.0820
               drv_ev_state   128  13943321         7285          0.0522     213674        0.0747
                       mseg    17    357701           68          0.0190     141553        0.0495
                  proc_msgq 17659  58578218        50508          0.0862      66442        0.0232
            port_sched_lock  8540  10925389         4531          0.0415      14092        0.0049
              process_table     1    193629          124          0.0640       1093        0.0004
                   atom_tab     1 151663081           21          0.0000         92        0.0000
 dirty_run_queue_sleep_list     2    550178           58          0.0105         43        0.0000
              dirty_gc_info     1     83698            9          0.0108          4        0.0000
:ok

Simply looking at :lcnt.inspect(:db_hash_slot) immediately revealed the problem. We were using telemetry_metrics_prometheus, which relied on a single ets table. While ets supported concurrent read and write operations, it became a bottleneck under high load.

iex> :lcnt.inspect(:db_hash_slot)

         lock                      id  #tries  #collisions  collisions [%]  time [us]  duration [%] histogram [log2(us)]
        -----                     --- ------- ------------ --------------- ---------- ------------- ---------------------
 db_hash_slot prometheus_metrics_dist 1897002        70810          3.7327   17451758        4.1598 |     .....xXX...............  |
 db_hash_slot prometheus_metrics_dist  898806        12730          1.4163    6976784        1.6630 |      ....xxX..............   |
 db_hash_slot prometheus_metrics_dist 1589817         9317          0.5860    6691033        1.5949 |     .xxXXxx................  |
 db_hash_slot prometheus_metrics_dist  864521         1893          0.2190    6078246        1.4488 |     ...xxX.x..............   |
 db_hash_slot prometheus_metrics_dist 1000479         3127          0.3126    5075027        1.2097 |     .xxXXXx......... .....   |
 db_hash_slot prometheus_metrics_dist 1083966         3313          0.3056    4273343        1.0186 |     ..xXXx................   |
 db_hash_slot        shards_partition    3947           30          0.7601     908902        0.2166 |                 .  . ..XX.   |
 db_hash_slot        shards_partition    1588            6          0.3778     236682        0.0564 |                       X Xx   |
 db_hash_slot        shards_partition     400            4          1.0000     163255        0.0389 |                      X XXX   |
 db_hash_slot        shards_partition    1380            6          0.4348     136208        0.0325 |                       xX     |
 db_hash_slot prometheus_metrics_dist  124562          150          0.1204     119496        0.0285 |      .....xX... .   ...      |
 db_hash_slot prometheus_metrics_dist  269110          316          0.1174      95729        0.0228 |     .xxXXx........ . ...     |
 db_hash_slot prometheus_metrics_dist  105175           68          0.0647      90335        0.0215 |      xxXXXxx..  x..xXxx      |
 db_hash_slot        shards_partition     400            1          0.2500      80775        0.0193 |                          X   |
 db_hash_slot      prometheus_metrics 1085784        10155          0.9353      74510        0.0178 |     .....X.........          |
 db_hash_slot        shards_partition    1191            8          0.6717      70953        0.0169 |                   x XxxX     |
 db_hash_slot        shards_partition     351            2          0.5698      49079        0.0117 |                        X     |
 db_hash_slot        shards_partition    1141            5          0.4382      47324        0.0113 |                    x xX      |
 db_hash_slot prometheus_metrics_dist   74890           53          0.0708      45480        0.0108 |      ...Xx.Xx..... ..x.      |
 db_hash_slot        shards_partition     368            1          0.2717      43281        0.0103 |                         X    |
:ok

Fortunately, someone else had faced the exact same problem and solved it for us. The solution was to use one ets table per scheduler. Once we switched to the peep library, the db_hash_slot no longer appeared as the top bottleneck.

iex> :lcnt.conflicts()

              lock   id    #tries  #collisions  collisions [%]  time [us]  duration [%]
             -----  ---   ------- ------------ --------------- ---------- -------------
         run_queue   18 156178443      1754050          1.1231    5528276        1.1750
<127.0.0.1.4911.0>    5  21670879       152885          0.7055    2467737        0.5245
    alcu_allocator  154   4515554         2227          0.0493    1724497        0.3665
      db_hash_slot 7616  16602840           85          0.0005     811074        0.1724
         erts_mmap    2   1041183       100172          9.6210     792307        0.1684
          pix_lock 1024     17687           19          0.1074     347020        0.0738
   user_drv_writer    5   2068081         7754          0.3749     210819        0.0448
            db_tab  316 101504868        11294          0.0111     200130        0.0425
      drv_ev_state  128  16028857         6865          0.0428     143241        0.0304
   redis_shard_0_5    5   6741258        12497          0.1854      93468        0.0199
   redis_shard_0_1    5   6722557        12383          0.1842      92689        0.0197
   redis_shard_0_3    5   6745599        12483          0.1851      89651        0.0191
#Port<127.0.0.1.0>    1    990558         2325          0.2347      89150        0.0189
   redis_shard_0_4    5   6713662        12269          0.1827      88936        0.0189
   redis_shard_0_7    5   6759899        12583          0.1861      86267        0.0183
   redis_shard_0_2    5   6736890        12375          0.1837      82596        0.0176
#Port<127.0.0.1.0>    1    993415         2350          0.2366      82142        0.0175
   redis_shard_0_6    5   6743413        12384          0.1836      80555        0.0171
#Port<127.0.0.1.0>    1    994243         2220          0.2233      80386        0.0171
#Port<127.0.0.1.0>    1    994250         2323          0.2336      79686        0.0169

At this point, we were able to reach up to 95% CPU utilization. Given the improvements and the limited potential for further optimization, I decided to call it a day and not investigate further.