My name is Philipp C. Heckel and I write about nerdy things.
This site moved here from blog.philippheckel.com/blog.heckel.xyz!

Minimizing remote storage usage and synchronization time using deduplication and multichunking: Syncany as an example


Cloud Computing, Distributed Systems, Security, Synchronization

Minimizing remote storage usage and synchronization time using deduplication and multichunking: Syncany as an example


Contents


1. Introduction
2. Related Work
3. Deduplication
4. Syncany
5. Implications of the Architecture
6. Experiments
7. Future Research
8. Conclusion
A. List of Configurations
B. Pre-Study Folder Statistics
C. List of Variables Recorded
D. Best Algorithms by Deduplication Ratio
E. Best Algorithms by Duration
F. Best Algorithms by CPU Usage
Bibliography

Download as PDF: This article is a web version of my Master’s thesis. Feel free to download the original PDF version.


This thesis describes a synchronization software as well as a deduplication based algorithm. It uses techniques known in various disciplines and systems. This chapter introduces related research and projects from the areas of synchronization, version control and distributed file systems.

2.1. Synchronization Software

Synchronization software attempts to efficiently solve the remote file synchronization problem [44,46,47,49], i.e. the problem of keeping several replicas of a file or folder up-to-date throughout different locations. Since the problem occurs frequently in many different settings and applications, it is the goal of several algorithms and projects to solve it in an efficient manner with regard to their particular application. While the goal is the same (or at least very similar) in all cases, the solutions differ fundamentally due to the application’s requirements.

Depending on the main purpose of the software, algorithms either focus on constantly updating other replicas or on populating updates only periodically. Since in many situations hourly or even daily updates are sufficient, most synchronization and backup software follow the latter paradigm [44,50]. Another way of differentiating existing algorithms is to divide them into single-round and multi-round protocols. While multi-round algorithms often use recursive divide-and-conquer mechanisms on a hash-basis to detect changes in remote files, single-round protocols mostly use non-recursive fixed-size or variable-size chunking mechanisms to compare file contents [48,51].

A very widely used synchronization software is the open source tool rsync [44]. It is based on the rsync algorithm which uses a fixed-size single round synchronization protocol. rsync creates file and block lists of both local and remote file system, compares them and transfers the changed blocks in a single batch per direction. While rsync efficiently groups changed blocks and compresses the stream to additionally accelerate the transfer, it requires both sides to actively gather file lists and generate checksums for each round it is run. This characteristic makes it unsuitable for frequent small changes on a big file base. Since it requires a server with rsync installed, it cannot be used with arbitrary storage.

A few research papers have attempted to improve the original single-round rsync design in terms of bandwidth savings for specific applications or even suggested new algorithms. Irmak et al. [48] use several techniques to optimize rsync’s single-round algorithm. In their approach they simulate a complete multi-round algorithm by using erasure codes for the computed hashes in each round. Other researchers have proposed multi-round algorithms [72,47,74] that use the advantages of the divide-and-conquer paradigm when comparing remote and local files. While these algorithms can perform better in terms of bandwidth [51], this advantage might be lost in wide area networks due to higher latencies [73].

Another well-known file synchronizer is Unison [50,56,57]. Similarly to rsync, it generates file lists of both sides, compares and then reconciles them. While rsync only synchronizes in one direction, Unison has a bi-directional synchronization algorithm. It divides the process of synchronization in the two phases update detection and reconciliation. In the first phase, it detects updates on both sides based on modification time and inode numbers. It marks files dirty if either one of these properties has changed with regard to the last synchronization run. In the second phase, it applies the updates based on a well-defined recursive multi-round synchronization algorithm.

Similar to rsync, Unison relies on a rolling checksum algorithm to detect the parts of a file that have changed. It only works with two replicas and strongly relies on the Unison software to be present on both sides. It hence shares rsync’s drawbacks regarding frequent updates of small files. However, since updates are detected using metadata rather than checksums, the Unison update detection phase is typically much shorter.

Traditional backup and synchronization software such as Unison and rsync concentrate on providing on-demand file synchronization. They are mostly used for synchronizing two replicas periodically or to backup files and folders overnight. For cases in which more frequent (near-live/real-time) updates are desired, they are mostly not suitable due to the significant overhead required for the generation of file lists.

For this use case, synchronizers such as iFolder (was at ifolder.com, site now defunct, July 2019), Dropbox or Jungle Disk offer a solution. Instead of only analyzing the files on-demand, they register watch listeners in the operating system and react whenever a file is changed locally. Since every file update is registered by the software, generating file lists is superfluous. Moreover, it allows for the automatic recording of file versions by the software.

Due to its simplicity, especially Dropbox has become very popular. It offers live synchronization or continuous reconciliation [57] on different operating systems and integrates in the native file manager. While its core functionality is synchronization, it additionally provides basic revision control features and simple Web hosting functionalities. Dropbox uses an rsync-like chunking algorithm and only transfers binary diffs of updated files to save bandwidth and remote storage. On the server side, Dropbox uses deduplication among all Dropbox users and stores encrypted file chunks on Amazon S3.1

2.2. Version Control Systems

Similar to regular synchronization software, Version Control Systems (VCS) are designed to coordinate and synchronize files among different locations. However, while file synchronization and version control are closely related in terms of functionality, they are often used for entirely different purposes. The aim of synchronization algorithms is to identify the differences among two or more replicas in order to keep them up-to-date. Version control, however, does not only seek to synchronize the end state of two replicas, but the entire version history of each individual file. While VCS are best suitable for text files (e.g. source code), synchronization software is designed for arbitrary data. Even though VCS can handle binary files, their main focus lies on source control [3].

Unlike synchronization software, VCS keep track of changes made on a file, allow branching/forking file sets and can merge different branches into one. They keep track of updates using revision numbers and allow restoring older versions if necessary.

Widely used versioning software include the Concurrent Versions System (CVS), Apache Subversion (SVN) as well as the newer distributed revision control systems Git or Bazaar.

While their specific architecture differs in file formats and network protocols, they all rely on similar concepts. To store the contents of a revision, most of them record incremental changes of files and compare them to a snapshot of the same file. When a file is updated or restored, these changes are then sequentially applied to the snapshot [2,52]. While this is beneficial if the local and the remote file are only a few updates apart, it has a negative impact on performance and bandwidth if many updates have to be applied. Some revision control systems such as Mercurial counteract this behavior by creating more full snapshots so that the reconstruction mechanism can use closer snapshots and has to apply fewer deltas. Even though this benefits the overall update speed and improves the bandwidth, it adds significant storage overhead to the repository. Using this technique, the repository does not only save the changes of files, but also full copies of selected versions as snapshots.

While most versioning systems support multiple types of network protocols and sometimes even allow the use of arbitrary protocols, they are typically used with the respective standard protocol. Commonly supported are SSH/SFTP or HTTP/WebDAV.

To the best of the author’s knowledge, none of the widely used revision control systems support client-side encryption.

2.3. Distributed File Systems

Distributed file systems (DFS) such as the Network File System or the Andrew File System aim to transparently deliver storage services to clients. While they internally have to deal with similar problems as version control systems and synchronization software, their front-end is invisible to end-users. Distributed file systems are mounted on the operating system level (or as user file systems) and behave like local file systems. They are usually not visible to users and whenever a file is saved, changes are immediately applied to the attached replicas. Among others, they differ from VCS and synchronization software in the following aspects [50,56,57]:

The main purpose of distributed file systems is to be able to share files among different workstations and to extend the local disk space with remote storage. Unlike version control systems and file synchronizers, DFS do not keep a working copy on the local machine, but retrieve the file when it is opened by the user.

While most VCS and file synchronizers need an explicit user-originated commit command, DFS are usually based on implicit commits. That is they either block until the reconciliation operation is complete or they cache the files locally and publish the changes to the connected replicas at a later point in time. While nearly all distributed file systems today have a local cache, many operations are still synchronous and can negatively affect user experience [58].

A largely popular distributed file system is the Network File System (NFS). Originally designed in 1989 by Sun Microsystems [75], it is now the de-facto standard of the network file systems on Unix/Linux platforms. NFS exports parts of the server’s file system to its clients and interacts via lightweight Remote Procedure Calls (RPC). NFS clients do not have any permanent local caches and store metadata (such as file permissions) only for a very short time. In versions 2 and 3, the NFS protocol is entirely stateless and based around synchronous procedures, i.e. the server does not need to maintain any protocol state and RPCs block until the operation is complete. Since version 4 [76], NFS partially supports asynchronous calls and has improved caching performance. While NFS also works over wide area networks, its design specification focuses on local area networks with high bandwidth. Even though NFS reads and writes files in blocks, it does not use any bandwidth optimizing techniques and hence scales poorly [60].

The Low Bandwidth File System (LBFS) [36,38], developed at the University of Texas in Austin, addresses this problem by exploiting similarities among files and transferring only the altered file parts. LBFS maintains large caches on the local clients and performs most read and write operations on the local system. To minimize the bandwidth usage, it breaks files into chunks using the Rabin fingerprinting method and only transfers missing chunks to the LBFS server component. The original LBFS design extends the NFSv3 protocol with numerous performance related features [36]: (1) The local cache allows the client to operate mostly offline, (2) remote procedure calls are pipelined asynchronously and (3) traffic is compressed using Gzip. A slim lined version of LBFS [38] suggests further ways to minimize bandwidth consumption using the Pretty Light And Intuitive (PLAIN) fingerprinting algorithm. Rabin and PLAIN are further explained in chapter 3.4.3.

While NFS assumes a high bandwidth in the network, LBFS uses deduplication techniques to optimize bandwidth usage. However, even though LBFS outperforms NFS with small files, its file open latency for bigger files still relies on the network speed. Since LBFS only caches files after they have been opened before, file open operations can cause applications to freeze until the entire file is retrieved [38].

>> Next chapter: Deduplication


  • 1 Since Dropbox is closed-source, the information in this paragraph cannot be verified with the source code. It is gathered from (a) the Dropbox Web site and help pages, (b) trusted Internet resources such as Dropbox employees, as well as (c) strace output of the Dropbox binary on Linux.

1. Introduction
2. Related Work
3. Deduplication
4. Syncany
5. Implications of the Architecture
6. Experiments
7. Future Research
8. Conclusion
A. List of Configurations
B. Pre-Study Folder Statistics
C. List of Variables Recorded
D. Best Algorithms by Deduplication Ratio
E. Best Algorithms by Duration
F. Best Algorithms by CPU Usage
Bibliography

Pages:<12 3 ... 1415>

3 Comments

  1. JP

    Hi,

    I would love to see a ebook version of your thesis (epub or mobi). Would that be possible ?

    thanks



  2. Thiruven Madhavan

    Hi Philipp:
    Good Morning. Possible to receive pdf version of your thesis.
    cheers
    Madhavan


Leave a comment

I'd very much like to hear what you think of this post. Feel free to leave a comment. I usually respond within a day or two, sometimes even faster. I will not share or publish your e-mail address anywhere.