Copying large files with Rsync, and some misconceptions

Posted by Daniel Leite de Abreu on September 16, 2019

There is a notion that a lot of people working in the IT industry often copy and paste from internet howtos. We all do it, and the copy-and-paste itself is not a problem. The problem is when we run things without understanding them.

Some years ago, a friend who used to work on my team needed to copy virtual machine templates from site A to site B. They could not understand why the file they copied was 10GB on site A but but it became 100GB on-site B.

The friend believed that rsync is a magic tool that should just “sync” the file as it is. However, what most of us forget is to understand what rsync really is, and how is it used, and the most important in my opinion is, where it come from. This article provides some further information about rsync, and an explanation of what happened in that story.

About rsync

rsync is a tool was created by Andrew Tridgell and Paul Mackerras who were motivated by the following problem:

Imagine you have two files, file_A and file_B. You wish to update file_B to be the same as file_A. The obvious method is to copy file_A onto file_B.

Now imagine that the two files are on two different servers connected by a slow communications link, for example, a dial-up IP link. If file_A is large, copying it onto file_B will be slow, and sometimes not even possible. To make it more efficient, you could compress file_A before sending it, but that would usually only gain a factor of 2 to 4.

Now assume that file_A and file_B are quite similar, and to speed things up, you take advantage of this similarity. A common method is to send just the differences between file_A and file_B down the link and then use such list of differences to reconstruct the file on the remote end.

The problem is that the normal methods for creating a set of differences between two files rely on being able to read both files. Thus they require that both files are available beforehand at one end of the link. If they are not both available on the same machine, these algorithms cannot be used. (Once you had copied the file over, you don’t need the differences). This is the problem that rsync addresses.

The rsync algorithm efficiently computes which parts of a source file match parts of an existing destination file. Matching parts then do not need to be sent across the link; all that is needed is a reference to the part of the destination file. Only parts of the source file which are not matching need to be sent over.

The receiver can then construct a copy of the source file using the references to parts of the existing destination file and the original material.

Additionally, the data sent to the receiver can be compressed using any of a range of common compression algorithms for further speed improvements.

The rsync algorithm addresses this problem in a lovely way as we all might know.

After this introduction on rsync, Back to the story!

Problem 1: Thin provisioning

There were two things that would help the friend understand what was going on.

The problem with the file getting significantly bigger on the other size was caused by Thin Provisioning (TP) being enabled on the source system — a method of optimizing the efficiency of available space in Storage Area Networks (SAN) or Network Attached Storages (NAS).

The source file was only 10GB because of TP being enabled, and when transferred over using rsync without any additional configuration, the target destination was receiving the full 100GB of size. rsync could not do the magic automatically, it had to be configured.

The Flag that does this work is -S or –sparse and it tells rsync to handle sparse files efficiently. And it will do what it says! It will only send the sparse data so source and destination will have a 10GB file.

Problem 2: Updating files

The second problem appeared when sending over an updated file. The destination was now receiving just the 10GB, but the whole file (containing the virtual disk) was always transferred. Even when a single configuration file was changed on that virtual disk. In other words, only a small portion of the file changed.

The command used for this transfer was:

rsync -avS vmdk_file syncuser@host1:/destination

Again, understanding how rsync works would help with this problem as well.

The above is the biggest misconception about rsync. Many of us think rsync will simply send the delta updates of the files, and that it will automatically update only what needs to be updated. But this is not the default behaviour of rsync.

As the man page says, the default behaviour of rsync is to create a new copy of the file in the destination and to move it into the right place when the transfer is completed.

To change this default behaviour of rsync, you have to set the following flags and then rsync will send only the deltas:

--inplace               update destination files in-place
--partial               keep partially transferred files
--append                append data onto shorter files
--progress              show progress during transfer

So the full command that would do exactly what the friend wanted is:

rsync -av --partial --inplace --append --progress vmdk_file syncuser@host1:/destination

Note that the sparse flag -S had to be removed, for two reasons. The first is that you can not use –sparse and –inplace together when sending a file over the wire. And second, when you once sent a file over with –sparse, you can’t updated with –inplace anymore. Note that versions of rsync older than 3.1.3 will reject the combination of –sparse and –inplace.

So even when the friend ended up copying 100GB over the wire, that only had to happen once. All the following updates were only copying the difference, making the copy to be extremely efficient.

Fedora Project community

Daniel Leite de Abreu

A Father, a husband and another Open source enthusiast.

28 Comments

4e4

Hm… beter wrote about tox vpn
tox are communicator https://github.com/cleverca22/toxvpn
or iodine, put files trought dns

September 16, 2019
- RG
  
  Could someone provide a package or build for the official Fedora repository? Is there any Copr at least?
  
  September 17, 2019
Kathirvel Rajendran

Hi Daniel,

Very useful article!

September 16, 2019
Garrett Nievin

Very nice article, thank you!

Tiny spelling correction: ‘–inaplce’ should be ‘–inplace’.

September 16, 2019
- Adam Šamalík
  
  Thank you for noticing! I’ve updated it.
  
  September 16, 2019
Night Romantic

Daniel, thanks for the article!

There’s a small typo here:

The first is that you can not use –sparse and –inaplce together when sending a file over the wire.

September 16, 2019
- Adam Šamalík
  
  Thank you for noticing! I’ve updated it.
  
  September 16, 2019
Rakshith

I use rsync alot…. Thank u for letting us know the nitty-gritty…. 👍

September 16, 2019
Jake

I thought the delta-xfer algorithm is always used, unless you use the –whole-file option to disable it? So, I would’ve thought that the large file should be being transferred “efficiently” over the network; just that the receiver side is chewing up twice the space if not using –inplace (the existing local destination file is used as a “basis” and matching blocks can be copied from there to the temporary file).

I could be wrong, but don’t know really how to check this… or this could be what you meant and I wasn’t understanding on my end.

September 16, 2019
- SVD
  
  That was my understanding too, could it be that the author has it wrong?
  
  September 16, 2019
  - Mohammed El-Afifi
    
    I’ve reviewed rsync man page and I can conclude the same. Rsync by default uses delta-xfer algorithm for the network transfer, but it’s only the handling of the transferred delta chunks on the destination end that’s by default applied to a temporary file before being copied to the target file. The options the author has listed overwrites the target file reconstruction step by dropping the use of a temporary file altogether.
    
    September 18, 2019
Linux Dummy

rsync is one of those commands I have never really understood, but it gets recommended by people a lot when you ask about making backups. I’ll bet have asked some variation of this question a dozen times, and have had rsync recommended every single time, but no one seems to know what are the correct options to use. The question is this:

Assume I am running a Linux server with no desktop using something like the “lite” version of Raspbian, Debian, or Ubuntu. This means I cannot use anything that requires a GUI configuration utility.

What I want to be able to do is backup in a manner similar to how Apple’s Time Machine does it, but again without the GUI. Specifically, on a regular schedule (every hour or every day) I want to do a full backup of all files to a different drive on the machine or elsewhere on the local network, but of course I only want it to write the data that has changed. My understanding is that Apple uses “sparse” files of some kind in Time Machine, so that’s fine, but what I really need is two things: The ability to easily recover a single file, group of files, or entire directory from the backup (not necessarily the most recent backup; I should be able to view or recover files from earlier backups in addition to the most recent, as is possible in Time Machine). But more important, and this is the thing that nobody seems to be able to explain how to do, is that I need to be able to reformat the hard drive and then put everything (system files AND user files) back as they were prior to the last update. A big bonus would be the ability to roll back the last kernel update and get back to how everything was just before that kernel update. In other words something similar to taking a snapshot of a virtual machine, then restoring from that snapshot, but without actually using a virtual machine.

Everybody seems to think rsync can do all this but no one seems to know how, and if there’s a guide that explains it I haven’t found it yet. Typically the next thing people will suggest is some program that uses a GUI (if only for configuration) but of course that doesn’t work if your only access is via ssh.

I wish someone would make a very basic guide for noobs that basically says that if you want to do this type of backup or recovery, these are the options you should use. Sometimes I get the feeling that nobody but the developers truly understand rsync, but a lot of people know a little bit about it.

September 16, 2019
- Night Romantic
  
  @Linux Dummy, as far as I know, rsync isn’t the solution for you. Rsync is for file syncing, and although it can be used (and is used) as a light backup solution, it isn’t well-suited to do everything you ask for (without some heavy-duty scripting).
  
  I think, you should look at backup options available. One possibility would be this:
  
  https://fedoramagazine.org/backup-on-fedora-silverblue-with-borg/
  
  The article is about Silverblue, but Borg itself is generic, not Siverblue-specific. There are other good and time-tested cli backup options, but I don’t remember them from the top of my head. You can ask for advice on Ask Fedora, I’m sure someone there will provide more options for you.
  
  September 17, 2019
  - Paul W. Frields
    
    @Linux Dummy: You could also look into Deja Dup.
    
    September 17, 2019
    - tallship
      
      Yes Deja Dup is a friendly front end for Duplicity (http://duplicity.nongnu.org/), something I incorporate into a lot of DR and backup solutions on a regular basis. They both take a lot of the guess work out of using tar and rsync together, and offer the flexibility to use in a myriad of ways.
      
      Great article @danbreuu, by the way 🙂
      
      September 18, 2019
  - mcshoggoth
    
    Another option is Veeam Agent for Linux. There’s a free edition that will do what you want and you can set it up from the command line. There is not an ARM version of it yet, but it does exist for debian, rpm, openSUSE, and several other flavors.
    
    If offers file level recovery and Forever Forward Incremental backups. Scheduling of the job occurs using crontab.
    
    https://www.veeam.com/linux-backup-free.html
    
    September 19, 2019
- RG
  
  What about backintime (in official repository)? It’s a GUI frontend for automation of jobs with rsync as the backend and it supports various targets.
  
  September 17, 2019
- cmurf
  
  Time Machine has gone through many revisions, but the central way it works is the snapshots are all hardlink based, and changed files tracked by FSEvents are written over “stale” hardlinks. And to this day it’s still based on Journaled HFS+, even if the source is APFS.
  
  Some time ago I found this approximate idea with rsync, specifically the first answer.
  https://serverfault.com/questions/489289/handling-renamed-files-or-directories-in-rsync
  
  A more modern approach is ZFS/Btrfs snapshots with incremental send/receive. That’s not a generic solution, of course. Both source and destination must be the same filesystem to take advantage of that approach.
  
  September 17, 2019
- Gregory Bartholomew
  
  @Linux Dummy, FWIW, I do something kind of similar and I’ve been very happy with using ZFS as a backup/rollback solution for my VMs. I have several VM “root” filesystems that I share out from my ZFS pool over NFS. I can snapshot them, roll them back, perform incremental backups, live migrate them between hypervisors — all the good stuff. And because ZFS is a log-based filesystem, the snapshots, backups, and clones are all extremely efficient. The destination of the backup doesn’t have to be ZFS BTW (as someone else stated earlier). You can write the backup out (incremental or otherwise) as a regular file on any target filesystem. I wouldn’t recommend that though as it might be difficult to re-assemble the incremental backup later on.
  
  September 17, 2019
- Stuart D Gathman
  
  The way that is done is usually to organize the backup directories into one per date and using –link-dest to share unchanged files with the last backup. While this can certainly be done manually, and I sometimes do so, at the very least, I will capture the command in a backup.sh script in the backup directory.
  
  Here are the internal scripts we use: https://github.com/sdgathman/rbackup
  
  The root backup destination has a flag file and directory per system:
  
  l /mnt/unilit
  
  total 24
  -rw-r–r–. 1 root root 0 Sep 5 2017 BMS_BACKUP_V1
  drwxr-xr-x. 3 root root 4096 Sep 7 2017 C6
  drwxr-xr-x. 4 root root 4096 Sep 7 2017 c6mail
  drwxr-xr-x. 4 root root 4096 Sep 4 22:07 c7mail
  drwxr-xr-x. 10 root root 4096 Sep 12 21:15 cumplir7
  drwxr-xr-x. 35 root root 4096 Sep 7 2017 current
  drwxr-xr-x. 4 root root 4096 Sep 7 2017 hulk
  
  And each directory looks like this:
  
  l /mnt/unilit/cumplir7
  
  total 44
  dr-xr-xr-x. 25 root root 4096 Jun 28 2017 17Sep07a
  dr-xr-xr-x. 20 root root 4096 Jan 1 2019 19Jan05
  dr-xr-xr-x. 20 root root 4096 Sep 4 18:13 19Sep03
  dr-xr-xr-x. 20 root root 4096 Jan 7 2019 19Sep04
  dr-xr-xr-x. 20 root root 4096 Sep 5 02:08 19Sep06
  dr-xr-xr-x. 20 root root 4096 Sep 5 02:08 19Sep12
  -rw-r–r–. 1 root root 0 Sep 12 21:15 BMS_BACKUP_COMPLETE
  drwxr-xr-x. 2 root root 4096 Sep 12 21:15 current
  lrwxrwxrwx. 1 root root 7 Sep 12 21:15 last -> 19Sep12
  drwx——. 2 root root 16384 Sep 5 2017 lost+found
  
  September 18, 2019
Raghav Rao

Oh wow this whole time I’ve been confusing the –update option to be what the –inplace option is. Thanks!

September 17, 2019
svsv sarma

Now I understand the Rsync. A good practical example of Rsync is Timeshift.

September 17, 2019
RG

Maybe you’re interested in backintime, it’s a GUI frontend for rsync as the backend to automate backup to various targets.

September 17, 2019
Davide

Just one important note about one option that you suggested:

–append
This causes rsync to update a file by appending data onto the end of the file …

The use of –append can be dangerous if you aren’t 100% sure that the files that are longer have only grown by the appending of data onto the end

So if you use the “–append” then modify the source file by changing the content on the middle of it (and not only append the content), you will get a different file on the destination! you can check it with md5sum.

This article is very interesting, it’s a big misunderstood! thank you

September 17, 2019
evgeniusx

rsync with inotifywait – make powerful synchronization on action

September 18, 2019
Diego Roberto

Nice!

September 21, 2019
Sam Tuke

Interesting and useful article, thanks!

September 26, 2019
pamsu

With faster and faster wifi ac, I’m seeing a lot of stuff on file transfers….or maybe its just me…Love this!

September 27, 2019