2024-10-12

Chasing an IO Phantom

My home server has been weird since months ago, it just becomes unresponsive occassionally. It is annoying but it happens only rarely, so normally I'd just wait or reboot it. But weeks ago I decided to get to the bottom of it.


What's Wrong

My system set up is:
  • Root: SSD, LUKS + LVM + Ext4
  • Data: HDD, LUKS + ZFS
  • 16GB RAM + 1GB swap
  • Rootless dockerd

The system may become unresponsive, when the IO on HDD is persistantly high for a while. Also:
  • Often kswapd0 has high CPU
  • High IO on root fs (SSD)
    • From dockerd and some containers
  • RAM usage is high, swap usage is low
It is very strange that IO on HDD can affect SSD. Note that when this happens, even stopping the IO on HDD does not always help.

Usually restarting dockerd does not help, but rebooting helps.


Investigation: Swap

An obvious potential root cause is the swap. High CPU on kswapd0 usually means the free memory is low and the kernel is busy exchanging data between disk and swap.

However, I tried the following steps, none of them helped
  • Disable swap
    • swapoff -a
    • remove the entries in /etc/fstab
  • Add more swap
  • Adjust vm.swappiness
While kswapd0 still looks suspicious, I decided that swap is irrelevant.


Investigation: Cache

`echo 3 > /proc/sys/vm/drop_caches` sometimes helps, sometimes not.
`free` does not show a high number of buffer/cache, so it seems irrelevant.

Investigation: Individual Docker Containers

`docker stats` shows that some docker containers have higher IO than others. Usually those are related to database.

Note that `docker stats` is known to be inaccurate when there are mapped devices, bind mounts etc. References: 1, 2. However, even with that in mind, the IO numbers are still too high.

So I chose one container and examined whether it is due to a faulty image.

Well, I tried a few modifications, none of them helped:
  • Disable unnecessary scheduled tasks that may read the disk.
  • Move container volumes from SSD to HDD.
  • Stop other containers.
Eventually I concluded that this is not a problem of individual docker containers.


Investigation: Docker

So I suspect there may be something wrong with Docker itself, but I cannot be 100% sure, because there are almost no other processes running on the server.

IO Scheduler

I found some seeming relevant discussions: 1, 2, 3. There they mentioned tuning disk IO schedulers. 

I checked that `mq-deadline` is enabled for the SSD. The SSD is M.2 NVME but connected to the motherboard via a USB enclosure, so it is recognized as sdX. According to 1 and 2, it is better to use `none` for SSD. I tried `echo none > /sys/block/sdX/queue/scheduler` but it didn't help.

Docker Configuration

I learned about issues related to logging: 12. But they are not helpful.

Currently the data-root is on SSD, which contains images, overlay fs etc. So I tried moving it on to HDD.

It kind of helped, the IO is still high, but only for dockerd, not the containters. And again, the high IO can only be triggered if there's high IO on HDD, but not SSD.

While it seems to reveal something, but I didn't realize it yet at that moment.

The Dockerd Process

Using `strace` I was able to reveal more information:
  • `strace -e trace=file` shows flooding SIGURG, which has clear correlationship with the IO issue.
    • The was a relevant bug about overlayfs for rootless docker, but had been fixed.
  • `strace` shows lots of futex timeouts.
    • I found a few discussions: 1, 2, 3 and 4. But not helpful.

/proc/DOCKERD_PID/stack shows that dockerd is often blocked at:
  • filemap_fault
  • do_fault
  • handle_mm_fault
  • epoll_wait
  • sys_poll
This suggests that maybe dockered did something wrong with mmap. While several tools can confirm that the number of page faults is very high, none can tell me which files caused the page fault.

To find the culprit file targets, I learned about the `systemtap` tool, which is very powerful, and is able to reveal the exact address that caused the page fault. References: 1, 2, 3.

Interestingly, `systemtap` didn't just work, there's some compilation error during runtime. 
I found out that it was due to an API change in the kernel, and I have to install an older 
kernel. Ref. I also learned to use `grub-reboot` to reboot to a specified kernel version.

When the high IO issue is triggered, I verified the following files caused major page faults in dockerd and containerd.

For dockerd
  • /usr/bin/dockerd
  • /home/user/.local/share/docker/buildkit/history.db
  • /home/user/.local/share/docker/network/files/local-kv.db
For containerd
  • /home/user/.local/share/docker/containerd/daemon/io.containerd.metadata.v1.bolt/meta.db
  • /usr/bin/containerd
Disappointingly, nothing looks suspicious.

Other Theories

I had a couple of more theories:
  • Maybe it is related to the USB to M.2 adapter/enclsoure, and/or related to the high temperature of the SSD
    • I ordered a PCIe to M.2 card, but have not received it yet.
  • I have special POSIX ACL on the docker directory.
    • I move the directory to a new place without POSIX ACL. It didn't help
  • Maybe a rootful dockerd would work better?
  • Is it related to ...?
    • cgroups
    • SSD garbage collection?
    • LUKS
    • LVM
    • LXC
    • Some kernel bug
    • /dev/shm
    • bind mount
  • Some containers may be leaking memory

Catching the Phantom

After all the failed attempts, I reviewed all my notes and drew the following conclusion:

Maybe this is normal after all, I just need to add more memory. kswapd is busy because of the page fault from mmap'ed files. I didn't find anything wrong with mmap, so perhaps linux just decided to free some mmap'ed regions, such that a page fault is triggered the next time they are read. I guess I was so focused on swap files and omitted the fact that mmap can also cause major page faults.

The mmap'ed regions are often related to databases (i.e. docker buildkit), which are frequently accessed by processes. So page fault will consistently happen, as the kernel keeps dropping the cache.

However some questions still remain unanswered:
  • Why does the kernel decide to drop the cache? I feel that the kernel can theoretically handle it better, especially the swapfile are almost unused.
  • Does it really need to consistently read the disk at 400MB/s?
  • Why the high IO on SSD can be triggered by high IO on HDD?
Especially for the first item, `free` shows the total memory usage, `top` and `ps` shows memory usage of individual processes. But there's a problem: the numbers do not match. `free` shows >10GB RAM usage, but the processes all together seems to use <5GB in total.

I actually noticed this effect a while ago, and until now I had no idea why it happened. I'd thought maybe I just didn't interpret the numbers correctly, e.g. due to shared memory.

But this time I'm a bit luckier, as I random surfing Google, I found the name.

ZFS ARC

I have vague understanding of ZFS ARC, I only knew that SSD can be used as L2ARC, and I'm not using it. So I always assume that ARC is not enabled either. After all, it is a "cache", sounds like an optional, nice-to-have feature, that is probably not enabled by default, unless I turn it on manually.

But obvious I've been wrong. It's enabled, and by default it can use 50% of all physical memory!

The actually usage can be found via:
  • /proc/spl/kstat/zfs/arcstats
  • arcstat
  • arc_summary
I verified that ZFS ARC is using 8GB memory!

It's known that ZFS ARC may confuse the kernel and other processes. While it is technically "file cache", but it may appear as allocated memory in `free`. Even `echo 3 > /proc/sys/vm/drop_caches` may not always free ZFS ARC, `echo 0 > /sys/module/zfs/parameters/zfs_arc_shrinker_limit
` may be needed.

So ZFS manages its own cache. I learned that this may be related to its non-Linux root.

The size of the ARC may be controlled by zfs_arc_min and zfs_arc_max. However, zfs_arc_sys_free may be better. It instrusts ZFS to make sure we should have at least this much free memory, by freeing ZFS ARC when needed. On the other hand, when the memory pressure is low, ZFS can still use as much memory as it wants to.

The default value of zfs_arc_sys_free is max(TOTAL_MEMORY/64, 512KiB), in my case that computes to 256MiB. It doesn't really match with the result shown in `free`, which shows ~800MiB free memory, but I guess the kernel already considered the pressure too high, and started kswapd. Meanwhile, ZFS may still keep freeing cache (slowly), because the amount of free memory is low. This way, the total memory usage never went very high, thus swap usage remained low.

The parameters can be modified dynamically via  /sys/module/zfs/parameters/zfs_arc_sys_free, or permanently via modprobe conf.
It immediately helps when I change the value to 1GiB. 
However, I still observed high IO after a while, so eventually I kept it at 2GiB.

Conclusion

While not 100% sure, I think ZFS ARC is the root cause. It explains all my questions:
  • ZFS ARC really uses a lot of memory. Because of the default ZFS parameters, and I have only 16GiB memory, by default ZFS may use at most 8GiB memory, as long as there is still at least 256MiB memory. This will cause a large amount of page faults when other processes are already using a lot of memory.
  • ZFS ARC is managed independently to the kernel, so the kernel may not be able to free it correctly when needed. (The cache may not be even recognized has page cache by the kernel). Therefore the kernel may decide to drop caches for mmap'ed regions.
    • For the same reason, `free` does not count memory used by ZFS ARC as buffer/cache, which can be very misleading.
    • This is also why `echo 3 > /proc/sys/vm/drop_caches` may or may not help.
  • ZFS may slowly free some cache when the free memory is lower than 256MiB, so as long as the real total usage is lower than 16GiB, I won't see much swap usage, or OOM errors.
  • Only high IO on HDD will fill (and increase) the ZFS ARC, which leads to the original high IO issue on SSD. High IO on SSD will only affect the page cache for Ext4, which seems to be handled well by the kernel.
  • Restarting dockerd may not help, because the ZFS ARC may not be cleared. On the other hand, rebooting always helps because it clears the ZFS ARC.
  • Moving docker volumes from SSD to HDD didn't help much, because they are not read that frequently.
  • Moving docker data-root from SSD to HDD helped a bit:
    • Container processes no longer have high IO, because all the binaries (in overlay fs) are in ZFS.
    • dockerd still has high IO, because /usr/bin/dockerd is still in Ext4. The IO is lower than before because some data in data-root are in ZFS.

2024-10-05

Find Duplicate Photos and Videos

Last month I received a warning that there's almost no more free space left for my Google account, so I decided to delete all files in Google Photos.

But before that, I downloaded all files via Google Takeout. I decided to copy files to my local backup if I don't have a local copy, for example, panorama photos generated by Google Photos. However I don't know which exactly ones that I should copy.

There are ~27k files in total. I processed them in different steps, through each step I expect to find more files that I do not need to copy.

Binary Comparison

I can easily find binary identical files, by computing hashsums of all files.

Unfortunatley, only ~7k files have a binary identical local backup.

Binary Content Comparison

It is possible two files have slighly different metadata but exactly same content (i.e. video stream).

Using `ffmpeg -f md5 -` I can compute a hashsum of the content of a file.

Unforunately, only 1 file is found in this step. Quite disappointing!

Of course, I should probably skip the first step, and just use ffmpeg from the beginning.

ImageMagick Comparison

Using `compare -metric mse`, I can compute a "similarity metric" between two files. But before that I need to scale the two input files into the same resolution.

To identify "visually identical" image files, I need to set a threshold of the MSE metric. I also use other signals like EXIF datetime and filenames. For example, if two files have exactly the same filename and exif datetime, I can use a very permissive MSE threshold.

With a few rounds of tuning, eventually I found a good set of thresholds. The end result was pretty good. I found local visually identical files for ~18k files.

These files are not binary identical, because Google Photos would rescale and compress the input files.

In general ImageMagick works pretty well, but there are some issues:
- It cannot handle CR2 files, it depends on `ufraw-batch`, which is no longer available on Debian.
- It cannot handle video files well, it converts video files to something huge, which is quite inefficient.

Manual Comparison

For about ~500 files, some potential candidates (as local duplicate) are found, but I'm not sure yet they are valid or not. So I just created a HTML file that present all files and their candidates side by side.

It's proven quite helpful. Most candidates are actual duplicates, but I was able to identify the rest.

Final Step

After all the steps, there are ~500 files left. I just copied all of them to my local backup.

Other Tools

There are other tools I found during the process, but didn't use:

2024-09-16

Dates in Leaked Passwords

I have been playing with data from haveibeenpwned.com, especially I was exploring 8-digit passwords that look like dates.


I checked all dates between 19000101 and 20231231, where

- The date must be in the YYYYMMDD format

- Some invalid dates are also included, e.g. 19120231


Further, by default I only analyzed valid dates after 19240101.


By Year

1986 is the most popular year. It roughly forms a normal distribution. The distribution might review the distribution of the ages of the users (of the leaked services).


By Month & Day

It is clear that October, November and December are more popular than other months, but I think this is is because people may use 1912312 rather than 19120312.

We can also see a pattern for each month, especially 1st is more popular than other days.

Other than those, the passwords are roughly evenly distributed among all days.

The most popular dates are:
  • 0101, #=81516
  • 1010, #=79997
  • 1212, #=71724
  • 1111, #=62065
  • 1001, #=61109
Seems that they are just nice & easy patterns to type.

And the least popular dates are:
  • 0229, #=7075
  • 0731, #=20302
  • 0730, #=20846
  • 0531, #=21543
  • 0131, #=21646
0229 is obvious, but all other dates are all about the end of months, maybe this is not an coincidence.


By Day

31st is the least popular, which is obvious.

There's a sudden drop between 12th and 13th, which I don't really understand. This might contradict with my idea above, that people may type 9 instead of 09.



By Year & Month & Day

The most popular ones are:
  • 19870212, #=12359
  • 19821108, #=10812
  • 19941028, #=10097
  • 19810301, #=9528
  • 19960122, #=8906
The least popular ones (but appear in the database) are:
  • 19410417, #=1
  • 19410504, #=1
  • 19350824, #=1
  • 19290519, #=1
  • 19290525, #=1
Besides the popularily of the years, I cannot really draw any conclusions.


Invalid Dates

The most popular invalid dates are:
  • 19860229, #=362
  • 19820229, #=296
  • 19870229, #=228
  • 19890229, #=203
  • 19860230, #=192

Conclusions

By no means are the results scientific, it's just me having some fun.
If there's anything to remember, never use 19870212 as your password!