2022-01-16

On Data Backup

Around 2013, every year I'd burn all my important data into a single DVD, that is 4.7GB. Nowadays I have ~5TB data, and I don't even bother optimizing 5GB data.

Background

I realized that it is the time to consider backup. I guess I have seen enough signs.
  • The NAS shows that the disks are quite full.
  • I just happened to see article and videos about data backup.
  • I found corrupted data in my old DVDs.
  • I realized that most of my important data are not properly backed up.
  • I have a few scripts that manage different files, which might contains bugs.
The goal is to have good coverage under acceptable cost.

The Plan

All my data are categorized into 4 classes.

Class 1: Most Important + Frequently Accessed

Rougly ~50GB in total. Average file size is ~5MB. 
Examples include official documents, my artworks and source code.

The plan: sync into multpile locations to maximize robustness. Sometimes I chooes a smaller subset when I don't have enough free space.
In case some copies are down/corrupted, I can still access the data quickly.

Class 2: Important + Frequently Modified

Roughly ~500GB in total. Average file size is ~500KB.
Typically there are groups of small files, which must be used together.
Examples include source code, git repo and backup repo.
Note that it overlaps with Class 1.

The plan: hot backup with versioning/snapshots,  yearly cold archives.

Class 3: Important + Frozen

Roughly ~1TB in total. Average flie size is ~50MB.
Frozen means they are never (or at least rarely) changed once created.
Most data of this class are labeled, for example /Media/Video/2020/2020-01-03.mp4
Examples include raw GoPro footages.
Note that it overlaps with Class 1.

The plan: hot backup with versioning/snapshots, labeled data are directly sync'ed to cold storage, unlabled data go to yearly cold archives.

Class 4: Unimportant

The rest of the data are not important, I wouldn't worry too much if they are lost, but I'm happy to keep them with minimum cost.
Examples include downloaded Steam games.

The plan: upload some to hot backup storage, shoud I have enough quota. 
No cold archive is planned.

Thoughts

I have put lots of thought when designing the plan, and I'm happy with the result. 
On the other hand, I had too many headaches throughout the process. 
To name a few:


Hot Backup or Cold Archive
I had a hard time choosing between hot backups and cold archvies. Hot backups are more up-to-date but cold archives are safer.

Originally I had planned to use only one per data class (and the classes were defined slightly differently). But I just couldn't decide. 

The decision is to try both then revisit later.


Format of Cold Archives
There are two possibilities:
  1. Directly uploading the files, with the same local file structure
  2. Create an archive and upload it. This also includes chunk-based backup methods.
Note that cold storage are special:
  • Files on cold storage cannot be modified/moved/renamed, or more preciely it is more expensive to do so. 
  • There is typically cost per API/object. So we get a penalty for too many small objects.
With option 1 I can easily access individual files on the storage, but if I rename or move some files locally, it will be a diaster in the next backup cycle.
With oiption 2 there is no problem with too many files, but I have to download the whole archive in order to access a single file inside.  Also I will need to make sure that the archives do not overlap (too much), or they will just waste space.

My solution is to organize and label the data, mostly by year. Good news is that most frozen data can be labeled this way, and they are often large files. This way it is mostly safe to upload them directly, since files may be added or removed, they are unlikely to be renamed or modified.

For unlabeled data, I'd just create archives every year, the size is small comparing with labeled data, so I wouldn't worry about it.


Format of Hot Backups
There are also two possibilities:
  1. File-based. Every time a file is modified or removed, the old version is saved somewhere else.
  2. Chunk-based. All files are broken and store in chunks. Like git repos.
There are lots of things to consider, e.g. size, speed, safety/robustness, easiness to access.

The decision is to go for #1 for all relevant data classes. My thoughts are:
  • In the worse case, the whole chunk-based repo may be affected by a few rotten bits. This is not the caes for file-based solutions.
  • I want to be able to access individual files without special tools.
  • Benefits of chunk-base approaches include deduplication and smaller sizes (mostly for changed files). But it does not really apply for my data. Most big files are large video files. They are rarely changed and cannot be compressed much.
On the other hand, I do plan to try out some chunk-based software in the future.


Backup for Repos
In my NAS I have a few git repos and (chunk-based) backup repos. So how should I back them up?

On one hand, the repos already saved versions of source files, so simply syncing them into cloud storage should work well enough.
On the other hand, should there be local data corruption, the cloud version will also be damaged after one data sync.

The decision is to keep versions in the repo backups as well. Fortunately they are not very big.
I plan to revisit this later. And hopefully I never need to recover a repo like this. 



Backup Storage

I don't have enough extra HDDS to back up all my data. Anyways I prefer cloud storage for this task.
Both hot and cold ones need to be discussed individually.


Hot Backup

Most cloud storage services would work well as a hot backup repo. In general the files are always available for reading or writing, which make them suitable for simple rsync-alike backups, or chunk-based backups.

It is not too difficult to fit Class 1 data into free quota, although I do need to subset it.

It is tricky to choose one service for other classes, as I'd like to keep all backup data together. 
I have check a number of services. I found the followings especially interesting.
  • Backblaze B2
  • Amazon S3
  • Google Cloud Storage
  • Google Storage
I'd just pick one while balancing cost, speed and reputation etc. I wouldn't worry too much about software support, since all of them are popular.


Cold Archive

I only learned recently about cold archives from Jeff Geerling's backup plan. After some readings I find the concept really interesting.

Mostly I'd narrow down to the following:
  • Amazon S3 Glacier Deep Archive
  • Google Archival Cloud Storage
  • Azure Archive
I remember also seeing similar storage class from Huawei and Tencent, but the software support is not as good among open source tools that I have found.


Software

I'd like to manage all backup tasks on my Raspberry Pi. 

rclone, so-called the swiss army knife for cloud storages, is an easy winner. I just couldn't find another one the matches with it. The more I learn about the tool, the more I like it. To name a few observations:
  • Great coverage on cloud storage providers.
  • Comprehensive and well-written documents.
  • Active community.
  • Lots of useful features and safety checks
  • Output both human-readable and machine-readable information.
So I ended up writing my own scripts that calls rclone. It is always easy to execute a task like "copy all files from A to B, and in case some files in B need to be modified, save a copy in C". So I just needed to focus on defining my tasks, scoping the data and set up the routines. Well it is not trivial though, more on that later.

I also spent quite some time researching chunk-based backup tools. For example
I also checked a few others, but not as extensive as these three. Here are a few useful lists:
I pulled myself out of the rabbit hole, as soon as I realized that I don't need them at the moment. While I still cannot decide which one to use, should I need a chunk-based backup today. Here's a summary of my 2-page notes on these tools:
  • BorgBackup is mature (forked from Attic in 2010), but have limited support on backends.
  • restic is relatively new (first GitHub commit in 2014). It used to have performance issues with pruning, which seems to have been fixed. The backup format is not fixed (yet).
  • Duplicacy is even younger (first GitHub commit in 2016). The license is not standard, which concerns many people. Due to the lock-free design it is measured faster than others, especially when there are multiple clients connecting to the same repo. However it might waste some space in order to achieve that.
Maybe thing will change in a few years. I will keey an eye on them.


Technical Issues

I had quite a few issues when using OneDrive + WebDAV.
  • Limit on max path length
  • Limit on max file length
  • No checksums
  • No quota metrics.
Fortunatley most of them are not big problems.

Another issue, about 7zip, is that I cannot add empty directories in the archive without adding files in the directories. This is particularly important for my cold yearly archives.

Eventually I used Python and tarfile to achieve it. I probably can do the same with a 7z python library. But I used tarfile anyway because it is natively available in Python, plus I realized that most of the archives cannot be effectively compressed.


Next Steps

I probably will add a few more scripts to monitor and to verify the backups. For example, download and verify ~10GB data that is randomly selected from the backup repo.

I will also keep an eye on chunk-based solutions.

No comments: