NOTE: This document is a bit obsolete. Since it was presented at a LUG meeting years ago I have kept it in tact in the original form however I have replaced it with a more up to date version of this presentation. The new version of this document is available at http://www.sanitarium.net/golug/rsync_backups_2010.html.
Backups using rsync
Written by Kevin Korb
as a presentation for GOLUG
Presented on 2005-08-04
This document is available at http://www.sanitarium.net/golug/rsync_backups_2005.html
- What is rsync?
Rsync is a program for synchronizing 2 directory trees across different filesystems even if they are on different computers. It can run its host <> host communications over ssh to keep things secure and to provide key based authentication. Rsync can also do a block level comparison of 2 files and transfer only the parts that have changed which is a huge benefit if you are transferring large files over a slow link.
- What are hard links?
Hard links are similar to symlinks. They are normally created using the ln command but without the -s switch. A hard link is when 2 file entries point to the same inode and disk blocks. Unlike symlinks there isn't a file and a symlink but rather 2 links to the same file. If you delete either entry the other will remain and will still contain the data. Here is an example of both:
------- Symbolic Link Demo -------
% echo foo > x
% ln -s x y
% ls -li ?
38062 -rw-r--r-- 1 kmk users 4 Jul 25 14:28 x
38066 lrwxrwxrwx 1 kmk users 1 Jul 25 14:28 y -> x
% grep . ?
% rm x
% ls -li ?
38066 lrwxrwxrwx 1 kmk users 1 Jul 25 14:28 y -> x
% grep . ?
grep: y: No such file or directory
------- Hard Link Demo -------
% echo foo > x
% ln x y
% ls -li ?
38062 -rw-r--r-- 2 kmk users 4 Jul 25 14:28 x
38062 -rw-r--r-- 2 kmk users 4 Jul 25 14:28 y
% grep . ?
% rm x
% ls -li ?
38062 -rw-r--r-- 1 kmk users 4 Jul 25 14:28 y
% grep . ?
------- Breaking a Hard Link -------
% echo foo > x
% ln x y
% ls -li ?
38062 -rw-r--r-- 2 kmk users 4 Jul 25 14:34 x
38062 -rw-r--r-- 2 kmk users 4 Jul 25 14:34 y
% grep . ?
% unlink y ; echo bar > y
% ls -li ?
38062 -rw-r--r-- 1 kmk users 4 Jul 25 14:34 x
38066 -rw-r--r-- 1 kmk users 4 Jul 25 14:34 y
% grep . ?
Why backup with rsync instead of something else?
Why/When wouldn't you want to use rsync for backups?
- Disk based: Rsync is a disk based backup system. It doesn't use tapes which are too slow to backup modern systems with large hard drives.
- Fast: Rsync only backs up what has changed since the last backup. It NEVER has to repeat the full backup unlike most other systems that have monthly/weekly/daily configurations.
- Less work for the backup client: Most of the work in rsync backups including the rotation process is done on the backup server which is usually dedicated to doing backups. This means that the client system being backed up is not hit with as much load as with some other backup programs. The load can also be tailored to your particular needs through several rsync options.
- Fastest restores possible: If you just need to restore a single file or set of files it is as simple as a cp or scp command. Restoring an entire filesystem is just a reverse of the backup procedure. Restoring an entire system is a bit long but is less work than backup systems that require you to reinstall your OS first and about the same as other manual backup systems like dump or tar.
- Only one restore needed: Even though each backup is an incremental they are all accessible as full backups. This means you only restore the backup you want instead of restoring a full and an incremental.
- Cross Platform: Rsync can backup and recover anything that can run rsync. I have used it to backup Linux, Windows, DOS, OpenBSD, Solaris, and even ancient SunOS 4 systems.
- Cheap: It doesn't seem like it would be cheap to have enough disk space for 2 copies of everything and then some but it is. With tape drives you have to choose between a cheap drive with expensive tapes or an expensive drive with cheap tapes. In a hard drive based system you just buy cheap hard drives and use RAID to tie them together. My current backup server uses 2 300GB Maxtor drives on an old 3Ware 6200 RAID controller giving me a total of 600GB for about $380 which is less than I paid for the DDS3 tape drive that I used to use and that doesn't even include the tapes that cost about $10/12GB.
- Internet: Since rsync can run over ssh and only transfers what has changed (at the block level not the file level) it is perfect for backing up things across the internet if you need to do so. This is perfect for backing up a web site at a web hosting company or even a colocated server.
- Do-it-yourself: There are FOSS backup packages out now that use rsync as their back end but the nice thing here is that you are using standard command line tools (rsync, ssh, cp, rm) so you can engineer your own backup system that will do EXACTLY what you want and you don't need a special tool to restore.
Why not just use RAID / Is this like using RAID-1?
I don't think I can ever say this enough times.... RAID is NOT a backup system! RAID (other than level 0) does a wonderful job of protecting your data from disk failures. However, it provides absolutely NO protection against file corruption, files destroyed by a virus or a hacker, or the "oops, I deleted the wrong file" problem which most of us have encountered. There is a time and a place for RAID and RAID is not always needed however data should ALWAYS be backed up regardless of what media it is stored on or how redundant that media may be.
How do you do offsite/offline backups with rsync?
The best way to do an offsite or offline backup is to do the rsync backup like normal and then backup the backup to tape or whatever media you want to use for your offline/offsite backups. This gives you all the speed advantages of rsync during the actual backups and restores while allowing you to do the slower tape backups during the day when the backup server would otherwise be idle. Note that I do not recommend using removable hard drives for offsite rsync backups. Hard drives have very fragile moving parts and if you are constantly transporting them around they will not last long and will probably fail when you need them most as that is when they will be transported.
How do you handle databases?
The best way is to use the file based backup tool that the database vendor provides to backup the database to the local disk. In the case of MySQL you have a choice of mysqldump or mysqlhotcopy. Once the database is backed up to the local disk rsync will backup the backup just as it would any other file. You should not have rsync backup the actual database files while the database is running because it is likely that the backup will be corrupted. The downside of this procedure is that you will be doing a full backup of the databases each time which can add up to a significant amount of data so you shouldn't use this if you have huge databases.
How much space does it take to do rsync backups while keeping old copies?
This completely depends on how much change there is between each backup and how many backups you store. I have found that my personal backups which store 10 copies at irregular intervals require an average of about 30% more disk space than a single full backup.
Since this is a do-it-yourself system this is totally up to you to design. I have my backup storage mounted under /backup and put all of my rsync backups under /backup/rsync. Within that directory I make a directory for each host that gets backed up. Then for each backup of each file system I change '/' to '_' in the mount point name and time stamp the filesystem so my backup of /home/asylum done at 17:47 on 2005-07-25 would be stored in /backup/rsync/asylum/_home_asylum.2005-07-25.17-47-42. When the backup is done I would create a symlink from that directory to /backup/rsync/asylum/_home_asylum.current to make it easier to find especially from scripts.
Rsync doesn't do incremental backups itself (actually it has recently gained that ability but I still prefer this method). In order to get incremental backups we will use a special feature of the cp command that will copy an entire directory tree by making hard links of each file instead of actually duplicating them. First you count up how many existing backups there are and decide if there are too many. If so delete the oldest until you are happy. Then you use cp -al to duplicate the most recent backup into a new directory with the current time stamp. Here is a demonstration:
- Databases: Rsync is a file level backup so it is not suitable for databases. If your primary data is databases then you should look somewhere else. If you have databases but they are not your primary data then there is a procedure below to integrate a database backup into the rsync backups.
- Windows: If you plan to backup windows boxes then rsync probably isn't for you. It is possible to backup Windows boxes with rsync but the system recovery process is UGLY and if you want a complete backup of the OS you will have to boot the computer into Linux to be able to read some of the files. Windows also has a very annoying difference in the way that it handles time stamps. This forces a full backup whenever your windows box changes time zones (or DST).
- Compression: Since rsync doesn't put the files into any kind of archive there is no compression at all. In most cases it is still more cost effective to store uncompressed data on a hard drive than it is to store compressed data on a tape or some other media but this might not be true for everyone. Also, most modern file formats are already compressed so in many cases the compression wouldn't help anyways.
- Commercial support: Like most of the stuff I talk about there is no real commercial support for this. If you want a backup software vendor that you can call and beg for help from then go buy some big commercial backup system but expect to pay a ton of money for something that isn't anywhere near as nice as rsync.
- Security: Since rsync runs over ssh you would normally set it up so that root on your backup server can ssh into all of your other machines as root without a password. This means that the security of your backup server becomes very important as anyone who roots it can root any other server with one command. There are ways that you could design around this or you could simply require the person running the backup to type in the root passwords as it goes but those solutions all over-complicate things. Giving your backup server all of the keys isn't really as bad as it sounds though when you consider that in any other backup system the backup server would still have some kind of root access to the other servers as well as a complete copy of them that a hacker could use to break in.
- Do-it-yourself: This is still a do-it-yourself system. You have to decide how you want your backups to work and how you want them organized. If you don't want to write/modify shell scripts then look for something else or look at the available backup systems that use rsync as their back end.
# readlink _home_asylum.current
# time cp -al _home_asylum.2005-07-25.15-32-42 _home_asylum.demo
0.218u 2.836s 0:25.62 11.8% 0+0k 0+0io 0pf+0w
# ls -ld _home_asylum.2005-07-25.15-32-42 _home_asylum.demo
drwxr-xr-x 5 root root 120 Jul 13 15:24 _home_asylum.2005-07-25.15-32-42/
drwxr-xr-x 5 root root 120 Jul 13 15:24 _home_asylum.demo/
# du -shc _home_asylum.2005-07-25.15-32-42 _home_asylum.demo
As you can see the directory is copied very quickly but it doesn't take up any additional space. This is because cp -al does not copy files but rather makes hard links to them and du is smart enough to only count each inode once. However both directories are complete if looked at individually:
# du -sh _home_asylum.2005-07-25.15-32-42
# du -sh _home_asylum.demo
This concept is the key to doing incremental backups with rsync. It is also what makes restoration so easy since each backup looks like a full backup to all standard file utilities. Here is a du comparison of all 10 backups of my home directory currently in my rotation:
# du -shc _home_asylum.2*
# foreach f (_home_asylum.2*)
foreach? du -sh $f
Note that in this particular case there is only a 15.4% space increase for these 10 copies even though they span more than a month due to the fact that I don't do my backups as regularly as I should.
Actually backing up
Now we get to actually look at rsync. When you run rsync you will tell it to backup the live filesystem into the directory where you just made that tree of hard links. Whenever rsync finds a new file it will copy over that file. Whenever it finds a modified file it will copy over the differences and break the hard link relationship causing the files to become independent of each other so the old version is still in the old directory. There is a wide variety of options that can be used with rsync but here is what I would usually use for that demo directory we have been working with:
rsync -vaHx --progress --numeric-ids --delete \
--exclude-from=asylum_backup.excludes --delete-excluded \
Now I will explain the components of that rather long command...
There is also an environment variable that rsync uses to determine what command to use for its network communications. Here is the variable that I use:
- rsync: Duh, the rsync command ;)
- -v: Verbose. This causes rsync to list each file that it touches. I would leave this out if running from cron.
- -a: Archive. This causes rsync to maintain things like file permissions and ownerships.
- -H: Hard Links. This causes rsync to maintain hard links that are on the server being backed up. This has nothing to do with the hard links used during the rotation.
- -x: One File System. This causes rsync to NOT recurse into other filesystems. If you use this like I do then you must backup each filesystem (mount point) one at a time. The alternative is to simply backup / and exclude things you don't want to backup (like /proc, /tmp, and any network or CDROM mounts)
- --progress: This adds to the -v and tells rsync to print out a %completion and transfer speed while transferring large files. I would definitely leave this out when running from cron!
- --numeric-ids: This tells rsync to not attempt to translate UID <> userid or GID <> groupid. This is very important when doing backups and restores. If you are doing a restore from a live cd such as Knoppix your file ownerships will be completely screwed up if you leave this out.
- --delete: This tells rsync to delete files that are no longer on the server from the backup. The delete is done BEFORE any of the new data is transferred.
- --exclude-from=asylum_backup.excludes: This is a plain text file with a list of paths that I do not want backed up on this particular server. The format of the file is simply one path per line.
- --delete-excluded: This tells rsync that it can delete stuff from a previous backup that is now within the excluded list.
- root@: This is the userid given to rsync which it will then use to ssh to the server getting backed up.
- asylum:: This is the hostname that rsync will ssh to.
- /home/asylum/: This is the path on the server that is to be backed up. Note that the trailing slash IS significant.
- /backup/rsync/asylum/_home_asylum.demo/: This again is that tree we created with cp -al. Note that the trailing slash is significant here as well.
RSYNC_RSH "ssh -c arcfour -o Compression=no -x"
Now I will explain the components of that variable..
Recovering files from backups
Because rsync doesn't put the backed up files into any kind of archive this is as simple as copying a file. Just find the file you need on the backup server and copy it to where you need it to be. If you are restoring it to another server just use scp to get it there. Here are 2 examples of files that can be restored from my home directory:
- ssh: use ssh instead of the default of rsh.
- -c arcfour: Uses the weakest but fastest encryption that ssh supports
- -o Compression=no: Turns off ssh's compression. Rsync has its own if you want it.
- -x: Turns off ssh's X tunneling feature (if you actually have it on by default)
# ls -li _home_asylum.2*/kmk/bin/encode
3605946 5 kmk users 2223 Jul 2 11:34 _home_asylum.2005-07-05.11-05-34/kmk/bin/encode
3605946 5 kmk users 2223 Jul 2 11:34 _home_asylum.2005-07-07.13-43-22/kmk/bin/encode
3605946 5 kmk users 2223 Jul 2 11:34 _home_asylum.2005-07-07.17-22-09/kmk/bin/encode
3605946 5 kmk users 2223 Jul 2 11:34 _home_asylum.2005-07-13.11-14-32/kmk/bin/encode
3605946 5 kmk users 2223 Jul 2 11:34 _home_asylum.2005-07-18.16-32-54/kmk/bin/encode
4853134 1 kmk users 4012 Jul 21 19:31 _home_asylum.2005-07-25.15-32-42/kmk/bin/encode
# ls -li _home_asylum.2*/kmk/bin/mp3db
4074469 1 kmk users 29598 Jun 19 16:01 _home_asylum.2005-06-21.15-29-25/kmk/bin/mp3db
4082467 1 kmk users 29943 Jun 22 19:10 _home_asylum.2005-06-22.20-12-01/kmk/bin/mp3db
4124342 1 kmk users 30570 Jun 30 17:22 _home_asylum.2005-06-30.18-36-21/kmk/bin/mp3db
2617551 1 kmk users 30701 Jul 1 12:17 _home_asylum.2005-07-01.12-15-05/kmk/bin/mp3db
3605948 1 kmk users 35604 Jul 1 16:50 _home_asylum.2005-07-05.11-05-34/kmk/bin/mp3db
4411207 2 kmk users 35668 Jul 6 11:06 _home_asylum.2005-07-07.13-43-22/kmk/bin/mp3db
4411207 2 kmk users 35668 Jul 6 11:06 _home_asylum.2005-07-07.17-22-09/kmk/bin/mp3db
4523360 1 kmk users 37041 Jul 9 17:28 _home_asylum.2005-07-13.11-14-32/kmk/bin/mp3db
4675812 1 kmk users 37201 Jul 18 09:50 _home_asylum.2005-07-18.16-32-54/kmk/bin/mp3db
4853138 1 kmk users 37200 Jul 19 16:46 _home_asylum.2005-07-25.15-32-42/kmk/bin/mp3db
As you can see my encode script has been fairly constant while my mp3db script has changed almost every time I have run a backup. I can choose to restore whichever version I want as they are all just plain files.
Recovering entire filesystems from backups
This is a simple reverse of the backup procedure. Just format the new filesystem and rsync the files back to it.
Recovering entire systems from backups
This is where things get a little ugly. Of course this is for times that are already ugly because you probably just lost your boot drive and have a brand new one installed that is completely blank. This procedure varies a bit depending on what OS you are restoring but here is the general idea:
- Boot from some media that gives you an OS, networking, rsync, and ssh. Knoppix can do the job however I have found that Knoppix >= 3.7 makes things different later on. I usually use either Knoppix-STD or the Gentoo install CD. Unfortunately there is no such disc for either OpenBSD or Solaris. In the case of OpenBSD I use their install disc and then use ftp to transfer a tarball the rsync backup instead of using rsync. The same thing will work in Solaris although it is usually easier to NFS mount the backup repository. Of course if you are restoring the backup server you don't need any of these tools as a simple cp will do.
- Partition the new drive with fdisk or whatever you usually use. If you follow my advice in the advanced section you will have an .sfdisk file and you can duplicate the original partition table with 'sfdisk /dev/whatever < file.sfdisk'.
- Format the new partitions. Linux choices are mke2fs, mkreiserfs, mkfs.xfs, and mkswap. For most other operating systems it is simply newfs.
- Mount up the new partitions in a convenient location with something like:
mount -vt [fstype] /dev/[root partition] /s
mkdir /s/usr /s/var /s/proc /s/dev /s/tmp
chmod 1777 /s/tmp
mount -vt [fstype] /dev/[var partition] /s/var
mount -vt [fstype] /dev/[usr partition] /s/usr
- Now run your filesystem level restores just like you would if you weren't recovering the entire system. You will need to restore each filesystem that was on the old boot disk.
- If you have made any changes such as device names, mount points, or partition layouts you should now update /s/etc/fstab and /s/boot/grub/grub.conf (or /s/etc/lilo.conf if using lilo).
- Now you have to make the disk bootable again. This totally varies by operating system and boot loader...
For Linux systems using grub:
For Linux systems using lilo:
- mount -vo bind /dev /s/dev
- mount -vo bind /proc /s/proc
- chroot /s /bin/bash
- root (hd0,0) - or whatever partition matches your boot disk
- setup (hd0)
For OpenBSD systems:
- mount -vo bind /dev /s/dev
- mount -vo bind /proc /s/proc
- chroot /s /bin/bash
- lilo -v
For Solaris systems:
- cd /s/usr/mdec
- ./installboot /s/boot ./biosboot /dev/rwd0c (or /dev/rsd0c if using SCSI)
- installboot /s/usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c0t0d0s0
- Format of backup repository: Assuming you are using a Linux box as your backup server you have multiple choices for the filesystem type that you want to format the backup drive with. I generally use reiserfs because it is better on small files (like the ones in the OS part of the backup) and because it is pretty fast. However, xfs is also a good choice because it is better at dealing with large files and it is much better at doing the delete portion of the backup rotation. You may want to play with these 2 choices a bit before you make your final decision.
- RAIDed backup repository: IMHO opinion RAID redundancy is not needed on the backups because they are an extra copy of the data anyways. My backup drive is a 600GB RAID-0 made up of 2 300GB IDE drives and the RAID-0 provides no redundancy at all. If you are extra paranoid and you want redundancy through RAID-1 or RAID-5 then you can add it but most people will find that it isn't needed. If you use RAID-0 or RAID-5 you should set your strip size small because most of the disk work is done at the filesystem metadata level not the file level so that is where you want your speed boost.
- Separating the rotation process from the backups: If your backup window is extra short you can separate out the components of the backup and the rotation. You can run your backup at night during the short window and you can do the rotation during the day since it doesn't affect the other servers.
- Cross-platform handling of /dev and other device files: Since different operating systems handle major and minor numbers differently I suggest excluding /dev from the rsync backups. I keep a /dev.tar tarball on all of my boxes with a backup of /dev in it just in case I ever need to restore that. The tarball will be very small since there are no actual files in it.
- What is different between 2 backups: I wrote a perl script that scans 2 backups of the same directory and lists what has changed between them. I have published that script at http://www.sanitarium.net/unix_stuff/backups/diff_backup.pl.txt
- Storing data that isn't kept in a file: I wrote a perl script that does backups of data that isn't stored in files such as partition tables. My main backup script runs this "getinfo" script whenever it backs up a root filesystem. The script is published at http://www.sanitarium.net/unix_stuff/backups/getinfo.pl.txt. I also have Linux and OpenBSD examples of its tab files published at http://www.sanitarium.net/unix_stuff/backups/asylum_backup.getinfo.tab.txt and http://www.sanitarium.net/unix_stuff/backups/hellmouth_backup.getinfo.tab.txt
- rsync -n: This is rsync's "dry run mode". You can use this on any other rsync command to have rsync tell you what it would have done without the -n parameter without actually doing anything.
- rsync -W: This tells rsync to transfer entire files instead of using its block level comparison system. If you have a nice fast link (like a LAN) this can make things faster since rsync doesn't have to checksum files at all but if you are transferring across the internet you don't want this.
- rsync -c: This tells rsync to checksum all files. Normally rsync compares the timestamp and the size of a file to determine if it has changed since the last backup. If you use -c rsync will checksum ALL files which will take a long time. You wouldn't normally use this option however it is good to have if you believe your data has become corrupted in a way that doesn't affect the information you see in an ls -l output.
- rsync -S: This tells rsync to handle sparse files as sparse files. If you have sparse files you should probably add this.
- rsync --delete vs. --delete-after: --delete-after tells rsync to delete files from the backup if they no longer exist on the server just like --delete does but --delete-after does the delete part at the end instead of the beginning. Doing the delete at the beginning gets the old junk out of the way however it forces rsync to scan the entire backup directory at the beginning. With --delete-after you get a small performance boost because rsync has already scanned the backup directory when it gets around to the delete part. Note that if you use the --delete-after option you should also use the --force option or rsync will not properly handle cases where a directory turns into a file or the other way around.
- rsync -T: If you have a tmpfs mount you can get a very small speed boost by using this parameter. It causes the partial files used during the block level transfers to be stored in an alternate (faster) location until the file is complete. This will only help if you are doing block level transfers and if the directory you specify is on a tmpfs mount. Note that your tmpfs mount must be big enough to hold any single file or it will cause rsync to fail with an insufficient disk space error. Also, if your tmpfs mount goes into swap you will completely kill your performance. IOW, don't use this unless you are sure it is going to help.
- rsync --link-dest vs. cp -al: The --link-dest parameter is a somewhat new feature of rsync that allows it to provide the functionality of the cp -al we did during the rotation. In this case you would tell rsync to backup to an empty directory instead of a populated tree and have --link-dest= the old backup directory. This is significantly faster as rsync does not need to link files that aren't there anymore or files that need to be updated and it also has no need for the delete part. I personally don't like --link-dest because when used with -v it displays every single directory name within the backup since directories can't be linked. If you think my backup method takes too long this is definitely where you want to start.
- rsync -bwlimit: Allows you to limit how much bandwidth rsync uses in its network communications.
- rsync --ignore-errors: This overrides one of rsync's built in safety features. Normally if there is a problem during the backup rsync will NOT run its delete pass. If you use --ignore-errors the delete pass will run regardless of any other errors. Note that this isn't as dangerous as it sounds since you still have older backups.
- rsync --max-delete: This allows you to reimplement the safety feature above with a threshold. You can tell rsync how many files it can delete before it decides that something must be wrong and stops.
- rsync -z: This tells rsync to use zlib compression on its communications. This would be good if you are backing up over the internet but it is usually counter productive on a LAN.
- rsync -A: This tells rsync to preserve ACLs in addition to permissions.
- rsync -D: This tells rsync to preserve device files.
- push instead of pull: Rsync can push data just as well as it can pull it. It is possible to have all servers push their backups to the backup server instead of the backup server pulling the data from them. I personally don't like this approach because it means that all your servers have the key to your backup server instead of the other way around and because you have to engineer a much more complicated way of doing the rotations as well as making sure you don't have 10 servers trying to back themselves up and once which would flood the backup server.
- Buddy backups: If you don't want to dedicate a box to running backups you could pair off your boxes and have them backup each other. You could also do this in a ring layout.
- LVM Snapshots: It is also possible to use an LVM to take an instant shot of a filesystem and then backup that snapshot. This would remove any chance of a filesystem changing during the backup.
- Squashfs for archives: If you want to make a permanent archive of a particular backup (perhaps to burn it) squashfs is a great way to do it. Squashfs creates a compressed mountable archive of a directory tree. You create a squashfs archive with mksquashfs which works much like mkisofs and then you can mount the resulting file as a loopback device. Here is a quick example of a squashfs containing old email archives which saves me 67% of the disk space that would be taken up without squashfs:
% df -hT /home/asylum/kmk/mail/backup_inbox.old
Filesystem Type Size Used Avail Use% Mounted on
squashfs 972M 972M 0 100% /home/asylum/kmk/mail/backup_inbox.old
% du -sh /home/asylum/kmk/mail/.backup_inbox.old.squashfs
% du -sh /home/asylum/kmk/mail/backup_inbox.old
[squashfs file] [mount point] squashfs ro,loop 0 0