Chasing an IO Phantom

My home server has been weird since months ago, it just becomes unresponsive occassionally. It is annoying but it happens only rarely, so normally I'd just wait or reboot it. But weeks ago I decided to get to the bottom of it.

What's Wrong

My system set up is:

Root: SSD, LUKS + LVM + Ext4
Data: HDD, LUKS + ZFS
16GB RAM + 1GB swap
Rootless dockerd

The system may become unresponsive, when the IO on HDD is persistantly high for a while. Also:

Often kswapd0 has high CPU
High IO on root fs (SSD)

From dockerd and some containers

RAM usage is high, swap usage is low

It is very strange that IO on HDD can affect SSD. Note that when this happens, even stopping the IO on HDD does not always help.

Usually restarting dockerd does not help, but rebooting helps.

Investigation: Swap

An obvious potential root cause is the swap. High CPU on kswapd0 usually means the free memory is low and the kernel is busy exchanging data between disk and swap.

However, I tried the following steps, none of them helped

Disable swap

swapoff -a
remove the entries in /etc/fstab

Add more swap
Adjust vm.swappiness

While kswapd0 still looks suspicious, I decided that swap is irrelevant.

Investigation: Cache

`echo 3 > /proc/sys/vm/drop_caches` sometimes helps, sometimes not.

`free` does not show a high number of buffer/cache, so it seems irrelevant.

Investigation: Individual Docker Containers

`docker stats` shows that some docker containers have higher IO than others. Usually those are related to database.

Note that `docker stats` is known to be inaccurate when there are mapped devices, bind mounts etc. References: 1, 2. However, even with that in mind, the IO numbers are still too high.

So I chose one container and examined whether it is due to a faulty image.

Well, I tried a few modifications, none of them helped:

Disable unnecessary scheduled tasks that may read the disk.
Move container volumes from SSD to HDD.
Stop other containers.

Eventually I concluded that this is not a problem of individual docker containers.

Investigation: Docker

So I suspect there may be something wrong with Docker itself, but I cannot be 100% sure, because there are almost no other processes running on the server.

IO Scheduler

I found some seeming relevant discussions: 1, 2, 3. There they mentioned tuning disk IO schedulers.

I checked that `mq-deadline` is enabled for the SSD. The SSD is M.2 NVME but connected to the motherboard via a USB enclosure, so it is recognized as sdX. According to 1 and 2, it is better to use `none` for SSD. I tried `echo none > /sys/block/sdX/queue/scheduler` but it didn't help.

Docker Configuration

I learned about issues related to logging: 1, 2. But they are not helpful.

Currently the data-root is on SSD, which contains images, overlay fs etc. So I tried moving it on to HDD.

It kind of helped, the IO is still high, but only for dockerd, not the containters. And again, the high IO can only be triggered if there's high IO on HDD, but not SSD.

While it seems to reveal something, but I didn't realize it yet at that moment.

The Dockerd Process

Using `strace` I was able to reveal more information:

`strace -e trace=file` shows flooding SIGURG, which has clear correlationship with the IO issue.

The was a relevant bug about overlayfs for rootless docker, but had been fixed.

`strace` shows lots of futex timeouts.

I found a few discussions: 1, 2, 3 and 4. But not helpful.

/proc/DOCKERD_PID/stack shows that dockerd is often blocked at:

filemap_fault
do_fault
handle_mm_fault
epoll_wait
sys_poll

This suggests that maybe dockered did something wrong with mmap. While several tools can confirm that the number of page faults is very high, none can tell me which files caused the page fault.

To find the culprit file targets, I learned about the `systemtap` tool, which is very powerful, and is able to reveal the exact address that caused the page fault. References: 1, 2, 3.

Interestingly, `systemtap` didn't just work, there's some compilation error during runtime.

I found out that it was due to an API change in the kernel, and I have to install an older

kernel. Ref. I also learned to use `grub-reboot` to reboot to a specified kernel version.

When the high IO issue is triggered, I verified the following files caused major page faults in dockerd and containerd.

For dockerd

/usr/bin/dockerd
/home/user/.local/share/docker/buildkit/history.db
/home/user/.local/share/docker/network/files/local-kv.db

For containerd

/home/user/.local/share/docker/containerd/daemon/io.containerd.metadata.v1.bolt/meta.db
/usr/bin/containerd

Disappointingly, nothing looks suspicious.

Other Theories

I had a couple of more theories:

Maybe it is related to the USB to M.2 adapter/enclsoure, and/or related to the high temperature of the SSD

I ordered a PCIe to M.2 card, but have not received it yet.

I have special POSIX ACL on the docker directory.

I move the directory to a new place without POSIX ACL. It didn't help

Maybe a rootful dockerd would work better?
Is it related to ...?

cgroups
SSD garbage collection?
LUKS
LVM
LXC
Some kernel bug
/dev/shm
bind mount

Some containers may be leaking memory

Catching the Phantom

After all the failed attempts, I reviewed all my notes and drew the following conclusion:

Maybe this is normal after all, I just need to add more memory. kswapd is busy because of the page fault from mmap'ed files. I didn't find anything wrong with mmap, so perhaps linux just decided to free some mmap'ed regions, such that a page fault is triggered the next time they are read. I guess I was so focused on swap files and omitted the fact that mmap can also cause major page faults.

The mmap'ed regions are often related to databases (i.e. docker buildkit), which are frequently accessed by processes. So page fault will consistently happen, as the kernel keeps dropping the cache.

However some questions still remain unanswered:

Why does the kernel decide to drop the cache? I feel that the kernel can theoretically handle it better, especially the swapfile are almost unused.
Does it really need to consistently read the disk at 400MB/s?
Why the high IO on SSD can be triggered by high IO on HDD?

Especially for the first item, `free` shows the total memory usage, `top` and `ps` shows memory usage of individual processes. But there's a problem: the numbers do not match. `free` shows >10GB RAM usage, but the processes all together seems to use <5GB in total.

I actually noticed this effect a while ago, and until now I had no idea why it happened. I'd thought maybe I just didn't interpret the numbers correctly, e.g. due to shared memory.

But this time I'm a bit luckier, as I was randomly surfing Google, I found the name.

ZFS ARC

I have vague understanding of ZFS ARC, I only knew that SSD can be used as L2ARC, and I'm not using it. So I always assume that ARC is not enabled either. After all, it is a "cache", sounds like an optional, nice-to-have feature, that is probably not enabled by default, unless I turn it on manually.

But obvious I've been wrong. It's enabled, and by default it can use 50% of all physical memory!

The actually usage can be found via:

/proc/spl/kstat/zfs/arcstats

See also

arcstat
arc_summary

I verified that ZFS ARC is using 8GB memory!

It's known that ZFS ARC may confuse the kernel and other processes. While it is technically "file cache", but it may appear as allocated memory in `free`. Even `echo 3 > /proc/sys/vm/drop_caches` may not always free ZFS ARC, `echo 0 > /sys/module/zfs/parameters/zfs_arc_shrinker_limit

` may be needed.

So ZFS manages its own cache. I learned that this may be related to its non-Linux root.

The size of the ARC may be controlled by zfs_arc_min and zfs_arc_max. However, zfs_arc_sys_free may be better. It instrusts ZFS to make sure we should have at least this much free memory, by freeing ZFS ARC when needed. On the other hand, when the memory pressure is low, ZFS can still use as much memory as it wants to.

The default value of zfs_arc_sys_free is max(TOTAL_MEMORY/64, 512KiB), in my case that computes to 256MiB. It doesn't really match with the result shown in `free`, which shows ~800MiB free memory, but I guess the kernel already considered the pressure too high, and started kswapd. Meanwhile, ZFS may still keep freeing cache (slowly), because the amount of free memory is low. This way, the total memory usage never went very high, thus swap usage remained low.

The parameters can be modified dynamically via /sys/module/zfs/parameters/zfs_arc_sys_free, or permanently via modprobe conf.

It immediately helps when I change the value to 1GiB.

However, I still observed high IO after a while, so eventually I kept it at 2GiB.

Conclusion

While not 100% sure, I think ZFS ARC is the root cause. It explains all my questions:

ZFS ARC really uses a lot of memory. Because of the default ZFS parameters, and I have only 16GiB memory, by default ZFS may use at most 8GiB memory, as long as there is still at least 256MiB memory. This will cause a large amount of page faults when other processes are already using a lot of memory.
ZFS ARC is managed independently to the kernel, so the kernel may not be able to free it correctly when needed. (The cache may not be even recognized has page cache by the kernel). Therefore the kernel may decide to drop caches for mmap'ed regions.

For the same reason, `free` does not count memory used by ZFS ARC as buffer/cache, which can be very misleading.
This is also why `echo 3 > /proc/sys/vm/drop_caches` may or may not help.

ZFS may slowly free some cache when the free memory is lower than 256MiB, so as long as the real total usage is lower than 16GiB, I won't see much swap usage, or OOM errors.
Only high IO on HDD will fill (and increase) the ZFS ARC, which leads to the original high IO issue on SSD. High IO on SSD will only affect the page cache for Ext4, which seems to be handled well by the kernel.
Restarting dockerd may not help, because the ZFS ARC may not be cleared. On the other hand, rebooting always helps because it clears the ZFS ARC.
Moving docker volumes from SSD to HDD didn't help much, because they are not read that frequently.
Moving docker data-root from SSD to HDD helped a bit:

Container processes no longer have high IO, because all the binaries (in overlay fs) are in ZFS.
dockerd still has high IO, because /usr/bin/dockerd is still in Ext4. The IO is lower than before because some data in data-root are in ZFS.

ESP32S3: Flash Encryption and Secure Boot

Flash encryption and secure boot are useful security features for ESP32S3 chip. While not perfect, they definitely make it harder to extract the secrets in the chip. However, it is tricky to enable both features at the same time. The topic is actually discussed in the official documentation: ESP32S3 Security Features Security Features Enablement Workflows Especially, the second one mentioned it is recommended to enable flash encryption before secure boot. But I still find the documentation confusing. In the end I was able to successfully enable both, here's my findings. My Understanding After my adventure, here's what I think could have worked. WARNING, this is untested. Follow Security Features Enablement Workflows : Burn all the keys, as long as their purpose eFuses and read/write protections Burn other security eFuses, but DO NOT burn ENABLE_SECURITY_DOWNLOAD in the middle, which is mentined at the end of the instruction for both flash encryption and secure boot. Burn...

WangLu's Notes

Search This Blog