Download as PDF: This article is a web version of my Master’s thesis. Feel free to download the original PDF version.
File synchronization applications such as Dropbox and Jungle Disk have become increasingly popular in the last few years. While they allow easy file sharing among participating users, they strongly depend on a provider and lack confidentiality.
This thesis introduced Syncany, a file synchronizer designed with security and provider independence as a core part of its architecture. Syncany provides the same functionality as other file synchronizers, but encrypts files locally before uploading and supports any kind of storage. The core processes in Syncany are based on deduplication, a technology that drastically reduces the amount of storage required on the remote servers and at the same time enables versioning capabilities.
The key goal of this thesis was to determine the suitability of deduplication for end-user applications. In particular, the goals for the example of Syncany were to find a suitable deduplication algorithm for the application and to minimize synchronization time among Syncany clients.
To determine the best algorithm for Syncany, the thesis performed experiments on four different datasets. Experiment 1 focused on reducing the amount of required storage on the remote servers. By varying six different algorithm parameters, it analyzed 144 algorithm configurations in terms of deduplication ratio, CPU usage and duration. The experiment found that some of the parameters have a high influence on these measures and others hardly make a difference.
The most important factor when performing deduplication on a client is the fact that calculations are very CPU intensive: While the experiments have not found any algorithms with an acceptable level of CPU usage, they have identified the write pause parameter as the key factor to reduce processor usage. More specifically, none of the tested write pause values were able to lower the CPU usage to the desired level, but the concept of the parameter has proven to be effective and allows the possibility of future research.
In terms of deduplication ratio, the fingerprinting algorithm Adler-32 has a high impact on the efficiency of the deduplication algorithm. In combination with the variable-size chunking method TTTD, Adler-32 has achieved remarkable space savings in the experiments and outperforms other fingerprinters by the order of magnitude. To the best of the author’s knowledge, Adler-32 has not been used as a fingerprinter in other deduplication systems so far.
Experiments 2 and 3 focused on reducing the overall synchronization time between Syncany clients. By transferring chunks and multichunks resulting from the deduplication algorithms to the remote storage, the experiments measured the time and bandwidth implied by each algorithm configuration. The results show that independent of the storage type, the multichunk concept is invaluable to the Syncany architecture. By combining chunks to multichunks, both upload and download time have shown to be decreased significantly on all of the analyzed types of remote storage. While the results indicate a higher reconstruction size when the size of the multichunks is increased, downloading large multichunks is still faster than downloading small chunks with high per-request latency.
In conclusion, all of the experiments show that deduplication is a valid technology for end-user applications if controlled properly and combined with other concepts. Especially in the case of Syncany, in which the storage abstraction prohibits the use of server-side software, deduplication cannot function properly without additional data processing. As demonstrated in the experiments, the multichunk concept fulfills this role by adding chunks to a container format before uploading. Using multichunks, the deduplication ratio can be further increased and the synchronization time has shown to be reduced significantly. The experiments have also demonstrated that without the use of multichunks, the upload and download time for Syncany clients would be unfeasible to end-users.
With regard to the overall goal of the thesis, the selected algorithm has shown to be very effective on all of the datasets. It offers a good trade-off between chunking efficiency, processor usage and duration. At the same time, it reduces the total synchronization time between Syncany clients.
I'd very much like to hear what you think of this post. Feel free to leave a comment. I usually respond within a day or two, sometimes even faster. I will not share or publish your e-mail address anywhere.