Skip to main content

Chasing an IO Phantom

My home server has been weird since months ago, it just becomes unresponsive occassionally. It is annoying but it happens only rarely, so normally I'd just wait or reboot it. But weeks ago I decided to get to the bottom of it.


What's Wrong

My system set up is:
  • Root: SSD, LUKS + LVM + Ext4
  • Data: HDD, LUKS + ZFS
  • 16GB RAM + 1GB swap
  • Rootless dockerd

The system may become unresponsive, when the IO on HDD is persistantly high for a while. Also:
  • Often kswapd0 has high CPU
  • High IO on root fs (SSD)
    • From dockerd and some containers
  • RAM usage is high, swap usage is low
It is very strange that IO on HDD can affect SSD. Note that when this happens, even stopping the IO on HDD does not always help.

Usually restarting dockerd does not help, but rebooting helps.


Investigation: Swap

An obvious potential root cause is the swap. High CPU on kswapd0 usually means the free memory is low and the kernel is busy exchanging data between disk and swap.

However, I tried the following steps, none of them helped
  • Disable swap
    • swapoff -a
    • remove the entries in /etc/fstab
  • Add more swap
  • Adjust vm.swappiness
While kswapd0 still looks suspicious, I decided that swap is irrelevant.


Investigation: Cache

`echo 3 > /proc/sys/vm/drop_caches` sometimes helps, sometimes not.
`free` does not show a high number of buffer/cache, so it seems irrelevant.

Investigation: Individual Docker Containers

`docker stats` shows that some docker containers have higher IO than others. Usually those are related to database.

Note that `docker stats` is known to be inaccurate when there are mapped devices, bind mounts etc. References: 1, 2. However, even with that in mind, the IO numbers are still too high.

So I chose one container and examined whether it is due to a faulty image.

Well, I tried a few modifications, none of them helped:
  • Disable unnecessary scheduled tasks that may read the disk.
  • Move container volumes from SSD to HDD.
  • Stop other containers.
Eventually I concluded that this is not a problem of individual docker containers.


Investigation: Docker

So I suspect there may be something wrong with Docker itself, but I cannot be 100% sure, because there are almost no other processes running on the server.

IO Scheduler

I found some seeming relevant discussions: 1, 2, 3. There they mentioned tuning disk IO schedulers. 

I checked that `mq-deadline` is enabled for the SSD. The SSD is M.2 NVME but connected to the motherboard via a USB enclosure, so it is recognized as sdX. According to 1 and 2, it is better to use `none` for SSD. I tried `echo none > /sys/block/sdX/queue/scheduler` but it didn't help.

Docker Configuration

I learned about issues related to logging: 12. But they are not helpful.

Currently the data-root is on SSD, which contains images, overlay fs etc. So I tried moving it on to HDD.

It kind of helped, the IO is still high, but only for dockerd, not the containters. And again, the high IO can only be triggered if there's high IO on HDD, but not SSD.

While it seems to reveal something, but I didn't realize it yet at that moment.

The Dockerd Process

Using `strace` I was able to reveal more information:
  • `strace -e trace=file` shows flooding SIGURG, which has clear correlationship with the IO issue.
    • The was a relevant bug about overlayfs for rootless docker, but had been fixed.
  • `strace` shows lots of futex timeouts.
    • I found a few discussions: 1, 2, 3 and 4. But not helpful.

/proc/DOCKERD_PID/stack shows that dockerd is often blocked at:
  • filemap_fault
  • do_fault
  • handle_mm_fault
  • epoll_wait
  • sys_poll
This suggests that maybe dockered did something wrong with mmap. While several tools can confirm that the number of page faults is very high, none can tell me which files caused the page fault.

To find the culprit file targets, I learned about the `systemtap` tool, which is very powerful, and is able to reveal the exact address that caused the page fault. References: 1, 2, 3.

Interestingly, `systemtap` didn't just work, there's some compilation error during runtime. 
I found out that it was due to an API change in the kernel, and I have to install an older 
kernel. Ref. I also learned to use `grub-reboot` to reboot to a specified kernel version.

When the high IO issue is triggered, I verified the following files caused major page faults in dockerd and containerd.

For dockerd
  • /usr/bin/dockerd
  • /home/user/.local/share/docker/buildkit/history.db
  • /home/user/.local/share/docker/network/files/local-kv.db
For containerd
  • /home/user/.local/share/docker/containerd/daemon/io.containerd.metadata.v1.bolt/meta.db
  • /usr/bin/containerd
Disappointingly, nothing looks suspicious.

Other Theories

I had a couple of more theories:
  • Maybe it is related to the USB to M.2 adapter/enclsoure, and/or related to the high temperature of the SSD
    • I ordered a PCIe to M.2 card, but have not received it yet.
  • I have special POSIX ACL on the docker directory.
    • I move the directory to a new place without POSIX ACL. It didn't help
  • Maybe a rootful dockerd would work better?
  • Is it related to ...?
    • cgroups
    • SSD garbage collection?
    • LUKS
    • LVM
    • LXC
    • Some kernel bug
    • /dev/shm
    • bind mount
  • Some containers may be leaking memory

Catching the Phantom

After all the failed attempts, I reviewed all my notes and drew the following conclusion:

Maybe this is normal after all, I just need to add more memory. kswapd is busy because of the page fault from mmap'ed files. I didn't find anything wrong with mmap, so perhaps linux just decided to free some mmap'ed regions, such that a page fault is triggered the next time they are read. I guess I was so focused on swap files and omitted the fact that mmap can also cause major page faults.

The mmap'ed regions are often related to databases (i.e. docker buildkit), which are frequently accessed by processes. So page fault will consistently happen, as the kernel keeps dropping the cache.

However some questions still remain unanswered:
  • Why does the kernel decide to drop the cache? I feel that the kernel can theoretically handle it better, especially the swapfile are almost unused.
  • Does it really need to consistently read the disk at 400MB/s?
  • Why the high IO on SSD can be triggered by high IO on HDD?
Especially for the first item, `free` shows the total memory usage, `top` and `ps` shows memory usage of individual processes. But there's a problem: the numbers do not match. `free` shows >10GB RAM usage, but the processes all together seems to use <5GB in total.

I actually noticed this effect a while ago, and until now I had no idea why it happened. I'd thought maybe I just didn't interpret the numbers correctly, e.g. due to shared memory.

But this time I'm a bit luckier, as I random surfing Google, I found the name.

ZFS ARC

I have vague understanding of ZFS ARC, I only knew that SSD can be used as L2ARC, and I'm not using it. So I always assume that ARC is not enabled either. After all, it is a "cache", sounds like an optional, nice-to-have feature, that is probably not enabled by default, unless I turn it on manually.

But obvious I've been wrong. It's enabled, and by default it can use 50% of all physical memory!

The actually usage can be found via:
  • /proc/spl/kstat/zfs/arcstats
  • arcstat
  • arc_summary
I verified that ZFS ARC is using 8GB memory!

It's known that ZFS ARC may confuse the kernel and other processes. While it is technically "file cache", but it may appear as allocated memory in `free`. Even `echo 3 > /proc/sys/vm/drop_caches` may not always free ZFS ARC, `echo 0 > /sys/module/zfs/parameters/zfs_arc_shrinker_limit
` may be needed.

So ZFS manages its own cache. I learned that this may be related to its non-Linux root.

The size of the ARC may be controlled by zfs_arc_min and zfs_arc_max. However, zfs_arc_sys_free may be better. It instrusts ZFS to make sure we should have at least this much free memory, by freeing ZFS ARC when needed. On the other hand, when the memory pressure is low, ZFS can still use as much memory as it wants to.

The default value of zfs_arc_sys_free is max(TOTAL_MEMORY/64, 512KiB), in my case that computes to 256MiB. It doesn't really match with the result shown in `free`, which shows ~800MiB free memory, but I guess the kernel already considered the pressure too high, and started kswapd. Meanwhile, ZFS may still keep freeing cache (slowly), because the amount of free memory is low. This way, the total memory usage never went very high, thus swap usage remained low.

The parameters can be modified dynamically via  /sys/module/zfs/parameters/zfs_arc_sys_free, or permanently via modprobe conf.
It immediately helps when I change the value to 1GiB. 
However, I still observed high IO after a while, so eventually I kept it at 2GiB.

Conclusion

While not 100% sure, I think ZFS ARC is the root cause. It explains all my questions:
  • ZFS ARC really uses a lot of memory. Because of the default ZFS parameters, and I have only 16GiB memory, by default ZFS may use at most 8GiB memory, as long as there is still at least 256MiB memory. This will cause a large amount of page faults when other processes are already using a lot of memory.
  • ZFS ARC is managed independently to the kernel, so the kernel may not be able to free it correctly when needed. (The cache may not be even recognized has page cache by the kernel). Therefore the kernel may decide to drop caches for mmap'ed regions.
    • For the same reason, `free` does not count memory used by ZFS ARC as buffer/cache, which can be very misleading.
    • This is also why `echo 3 > /proc/sys/vm/drop_caches` may or may not help.
  • ZFS may slowly free some cache when the free memory is lower than 256MiB, so as long as the real total usage is lower than 16GiB, I won't see much swap usage, or OOM errors.
  • Only high IO on HDD will fill (and increase) the ZFS ARC, which leads to the original high IO issue on SSD. High IO on SSD will only affect the page cache for Ext4, which seems to be handled well by the kernel.
  • Restarting dockerd may not help, because the ZFS ARC may not be cleared. On the other hand, rebooting always helps because it clears the ZFS ARC.
  • Moving docker volumes from SSD to HDD didn't help much, because they are not read that frequently.
  • Moving docker data-root from SSD to HDD helped a bit:
    • Container processes no longer have high IO, because all the binaries (in overlay fs) are in ZFS.
    • dockerd still has high IO, because /usr/bin/dockerd is still in Ext4. The IO is lower than before because some data in data-root are in ZFS.

Comments

Popular posts from this blog

[转] UTF-8 and Unicode FAQ for Unix/Linux

这几天,这个东西把我搞得很头疼 而且这篇文章好像太大了,blogger自己的发布系统不能发 只好用mail了 //原文 http://www.cl.cam.ac.uk/~mgk25/unicode.html UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn This text is a very comprehensive one-stop information resource on how you can use Unicode/UTF-8 on POSIX systems (Linux, Unix). You will find here both introductory information for every user, as well as detailed references for the experienced developer. Unicode has started to replace ASCII, ISO 8859 and EUC at all levels. It enables users to handle not only practically any script and language used on this planet, it also supports a comprehensive set of mathematical and technical symbols to simplify scientific information exchange. With the UTF-8 encoding, Unicode can be used in a convenient and backwards compatible way in environments that were designed entirely around ASCII, like Unix. UTF-8 is the way in which Unicode is used under Unix, Linux, and similar systems. It is now time to make sure that you are well familiar

Determine Perspective Lines With Off-page Vanishing Point

In perspective drawing, a vanishing point represents a group of parallel lines, in other words, a direction. For any point on the paper, if we want a line towards the same direction (in the 3d space), we simply draw a line through it and the vanishing point. But sometimes the vanishing point is too far away, such that it is outside the paper/canvas. In this example, we have a point P and two perspective lines L1 and L2. The vanishing point VP is naturally the intersection of L1 and L2. The task is to draw a line through P and VP, without having VP on the paper. I am aware of a few traditional solutions: 1. Use extra pieces of paper such that we can extend L1 and L2 until we see VP. 2. Draw everything in a smaller scale, such that we can see both P and VP on the paper. Draw the line and scale everything back. 3. Draw a perspective grid using the Brewer Method. #1 and #2 might be quite practical. #3 may not guarantee a solution, unless we can measure distances/p

Moving Items Along Bezier Curves with CSS Animation (Part 2: Time Warp)

This is a follow-up of my earlier article.  I realized that there is another way of achieving the same effect. This article has lots of nice examples and explanations, the basic idea is to make very simple @keyframe rules, usually just a linear movement, then use timing function to distort the time, such that the motion path becomes the desired curve. I'd like to call it the "time warp" hack. Demo See the Pen Interactive cubic Bezier curve + CSS animation by Lu Wang ( @coolwanglu ) on CodePen . How does it work? Recall that a cubic Bezier curve is defined by this formula : \[B(t) = (1-t)^3P_0+3(1-t)^2tP_1+3(1-t)t^2P_2+t^3P_3,\ 0 \le t \le 1.\] In the 2D case, \(B(t)\) has two coordinates, \(x(t)\) and \(y(t)\). Define \(x_i\) to the be x coordinate of \(P_i\), then we have: \[x(t) = (1-t)^3x_0+3(1-t)^2tx_1+3(1-t)t^2x_2+t^3x_3,\ 0 \le t \le 1.\] So, for our animated element, we want to make sure that the x coordiante (i.e. the "left" CSS property) is \(