On Data Backup

Around 2013, every year I'd burn all my important data into a single DVD, that is 4.7GB. Nowadays I have ~5TB data, and I don't even bother optimizing 5GB data.

Background

I realized that it is the time to consider backup. I guess I have seen enough signs.

The NAS shows that the disks are quite full.
I just happened to see article and videos about data backup.
I found corrupted data in my old DVDs.
I realized that most of my important data are not properly backed up.
I have a few scripts that manage different files, which might contains bugs.

The goal is to have good coverage under acceptable cost.

The Plan

All my data are categorized into 4 classes.

Class 1: Most Important + Frequently Accessed

Rougly ~50GB in total. Average file size is ~5MB.

Examples include official documents, my artworks and source code.

The plan: sync into multpile locations to maximize robustness. Sometimes I chooes a smaller subset when I don't have enough free space.

In case some copies are down/corrupted, I can still access the data quickly.

Class 2: Important + Frequently Modified

Roughly ~500GB in total. Average file size is ~500KB.

Typically there are groups of small files, which must be used together.

Examples include source code, git repo and backup repo.

Note that it overlaps with Class 1.

The plan: hot backup with versioning/snapshots, yearly cold archives.

Class 3: Important + Frozen

Roughly ~1TB in total. Average flie size is ~50MB.

Frozen means they are never (or at least rarely) changed once created.

Most data of this class are labeled, for example /Media/Video/2020/2020-01-03.mp4

Examples include raw GoPro footages.

Note that it overlaps with Class 1.

The plan: hot backup with versioning/snapshots, labeled data are directly sync'ed to cold storage, unlabled data go to yearly cold archives.

Class 4: Unimportant

The rest of the data are not important, I wouldn't worry too much if they are lost, but I'm happy to keep them with minimum cost.

Examples include downloaded Steam games.

The plan: upload some to hot backup storage, shoud I have enough quota.

No cold archive is planned.

Thoughts

I have put lots of thought when designing the plan, and I'm happy with the result.

On the other hand, I had too many headaches throughout the process.

To name a few:

Hot Backup or Cold Archive

I had a hard time choosing between hot backups and cold archvies. Hot backups are more up-to-date but cold archives are safer.

Originally I had planned to use only one per data class (and the classes were defined slightly differently). But I just couldn't decide.

The decision is to try both then revisit later.

Format of Cold Archives

There are two possibilities:

Directly uploading the files, with the same local file structure
Create an archive and upload it. This also includes chunk-based backup methods.

Note that cold storage are special:

Files on cold storage cannot be modified/moved/renamed, or more preciely it is more expensive to do so.
There is typically cost per API/object. So we get a penalty for too many small objects.

With option 1 I can easily access individual files on the storage, but if I rename or move some files locally, it will be a diaster in the next backup cycle.

With oiption 2 there is no problem with too many files, but I have to download the whole archive in order to access a single file inside. Also I will need to make sure that the archives do not overlap (too much), or they will just waste space.

My solution is to organize and label the data, mostly by year. Good news is that most frozen data can be labeled this way, and they are often large files. This way it is mostly safe to upload them directly, since files may be added or removed, they are unlikely to be renamed or modified.

For unlabeled data, I'd just create archives every year, the size is small comparing with labeled data, so I wouldn't worry about it.

Format of Hot Backups

There are also two possibilities:

File-based. Every time a file is modified or removed, the old version is saved somewhere else.
Chunk-based. All files are broken and store in chunks. Like git repos.

There are lots of things to consider, e.g. size, speed, safety/robustness, easiness to access.

The decision is to go for #1 for all relevant data classes. My thoughts are:

In the worse case, the whole chunk-based repo may be affected by a few rotten bits. This is not the caes for file-based solutions.
I want to be able to access individual files without special tools.
Benefits of chunk-base approaches include deduplication and smaller sizes (mostly for changed files). But it does not really apply for my data. Most big files are large video files. They are rarely changed and cannot be compressed much.

On the other hand, I do plan to try out some chunk-based software in the future.

Backup for Repos

In my NAS I have a few git repos and (chunk-based) backup repos. So how should I back them up?

On one hand, the repos already saved versions of source files, so simply syncing them into cloud storage should work well enough.

On the other hand, should there be local data corruption, the cloud version will also be damaged after one data sync.

The decision is to keep versions in the repo backups as well. Fortunately they are not very big.

I plan to revisit this later. And hopefully I never need to recover a repo like this.

Backup Storage

I don't have enough extra HDDS to back up all my data. Anyways I prefer cloud storage for this task.

Both hot and cold ones need to be discussed individually.

Hot Backup

Most cloud storage services would work well as a hot backup repo. In general the files are always available for reading or writing, which make them suitable for simple rsync-alike backups, or chunk-based backups.

It is not too difficult to fit Class 1 data into free quota, although I do need to subset it.

It is tricky to choose one service for other classes, as I'd like to keep all backup data together.

I have check a number of services. I found the followings especially interesting.

Backblaze B2
Amazon S3
Google Cloud Storage
Google Storage

I'd just pick one while balancing cost, speed and reputation etc. I wouldn't worry too much about software support, since all of them are popular.

Cold Archive

I only learned recently about cold archives from Jeff Geerling's backup plan. After some readings I find the concept really interesting.

Mostly I'd narrow down to the following:

Amazon S3 Glacier Deep Archive
Google Archival Cloud Storage
Azure Archive

I remember also seeing similar storage class from Huawei and Tencent, but the software support is not as good among open source tools that I have found.

Software

I'd like to manage all backup tasks on my Raspberry Pi.

rclone, so-called the swiss army knife for cloud storages, is an easy winner. I just couldn't find another one the matches with it. The more I learn about the tool, the more I like it. To name a few observations:

Great coverage on cloud storage providers.
Comprehensive and well-written documents.
Active community.
Lots of useful features and safety checks
Output both human-readable and machine-readable information.

So I ended up writing my own scripts that calls rclone. It is always easy to execute a task like "copy all files from A to B, and in case some files in B need to be modified, save a copy in C". So I just needed to focus on defining my tasks, scoping the data and set up the routines. Well it is not trivial though, more on that later.

I also spent quite some time researching chunk-based backup tools. For example

I also checked a few others, but not as extensive as these three. Here are a few useful lists:

I pulled myself out of the rabbit hole, as soon as I realized that I don't need them at the moment. While I still cannot decide which one to use, should I need a chunk-based backup today. Here's a summary of my 2-page notes on these tools:

BorgBackup is mature (forked from Attic in 2010), but have limited support on backends.
restic is relatively new (first GitHub commit in 2014). It used to have performance issues with pruning, which seems to have been fixed. The backup format is not fixed (yet).
Duplicacy is even younger (first GitHub commit in 2016). The license is not standard, which concerns many people. Due to the lock-free design it is measured faster than others, especially when there are multiple clients connecting to the same repo. However it might waste some space in order to achieve that.

Maybe thing will change in a few years. I will keey an eye on them.

Technical Issues

I had quite a few issues when using OneDrive + WebDAV.

Limit on max path length
Limit on max file length
No checksums
No quota metrics.

Fortunatley most of them are not big problems.

Another issue, about 7zip, is that I cannot add empty directories in the archive without adding files in the directories. This is particularly important for my cold yearly archives.

Eventually I used Python and tarfile to achieve it. I probably can do the same with a 7z python library. But I used tarfile anyway because it is natively available in Python, plus I realized that most of the archives cannot be effectively compressed.

Next Steps

I probably will add a few more scripts to monitor and to verify the backups. For example, download and verify ~10GB data that is randomly selected from the backup repo.

I will also keep an eye on chunk-based solutions.

Exploring Immutable Distros and Declarative Management

My current server setup, based on Debian Stable and Docker, has served me reliably for years. It's stable, familiar, and gets the job done. However, an intriguing article I revisited recently about Fedora CoreOS, rpm-ostree, and OSTree native containers sparked my curiosity and sent me down a rabbit hole exploring alternative approaches to system management. Could there be a better way? Core Goals & Requirements Before diving into new technologies, I wanted to define what "better" means for my use case: The base operating system must update automatically and reliably. Hosted services (applications) should be updatable either automatically or manually, depending on the service. Configuration and data files need to be easy to modify, and crucially, automatically tracked and backed up. Current Setup: Debian Stable + Docker My current infrastructure consists of several servers, all running Debian Stable. System Updates are andled automatically via unattended-upgrades. Se...

WangLu's Notes

Search This Blog