Fast Two Way Sync in Ubuntu!

I love the portability of a laptop. I have a 45 min train ride twice a day and I fly a little too, so having my work with me on my laptop is very important. But I hate doing long running analytics on my laptop when I’m in the office because it bogs down my laptop and all those videos on The Superficial get all jerky and stuff.

I get around this conundrum by running much of my analytics on either my work server or on an EC2 machine (I’m going to call these collectively “my servers” for the rest of this post). The nagging problem with this has been keeping files in sync. RStudio Server has been a great help to my workflow because it lets me edit files in my browser and they run on my servers. But when a long running R job blows out files I want those IMMEDIATELY synced with my laptop. That way I know when I undock my laptop to run to the train station that all my files will be there for me to spill Old Style beer on as I ride the Metra North line.

I experimented with Dropbox and I gotta say, it’s great. It really is well engineered, fast, and drop dead simple. I love that with Dropbox I could pull up most any file from my Dropbox on my iPad or iPhone. That’s a very handy feature. And it’s fast. If I created a small text file on my server, it would be synced with my laptop in a few seconds. Perfect! Wel… almost. Dropbox has a huge limitation: encryption. Dropbox encrypts for transmission and may even store files encrypted on their end. However, Dropbox controls the key. So if a rogue employee, a crafty Russian hacker, or a law enforcement officer with a subpoena gained access to Dropbox, they could get access to my files without my knowledge. As a risk manager I can’t help but see Dropbox’s security as a huge, targeted, single point of failure. It’s hard to say which would be a bigger payday: cracking GMail, or cracking Dropbox. But I’m suspicious it’s Dropbox. There are some workarounds to try and shoehorn file encryption into Dropbox, and they all suck.

So Dropbox can’t really give me what I want (what I really really want). But I stumbled into Spideroak who are like the smarter, but lesser known cousins of Dropbox. Their software does everything Dropbox does (including tracking all revisions!) but they have a “trust no one” model which encrypts all files before leaving my computer using, and this is critical, MY key which they don’t store. Pretty cool, eh? Spideroak also has a iPad/iPhone app and offers a neat feature that allows emailing any file in my Spideroak “bucket” to anyone using my iPhone without having to upload the file to my iPhone first. They do this by sending a special link to the email recipient that allows them to open only the file you wanted them to have. This could be a huge bacon saver on the road.

So Spideroak’s the panacea then? Well… um… no. They have two critical flaws: 1) They depend on time stamps on files to determine most recent file. 2) Syncs are slow, sometimes taking more than 5 minutes for very small files. The time stamp issue is an engineering failure, plain and simple. I’ve talked to their tech support and been assured that they are going to change this and index using server time, not system time in the future. But as of April 6, 2011, Spideroak uses local system time. For most users this is no big deal. For my use case this is painful. My server and my laptop were 6 seconds different and that time difference was enough for me to get Spideroak confused about which files were the freshest. This is a big deal when syncing two file systems with fast changing files. The other issue, slow sync, was actually more painful but probably the result of their attempt to be nice with CPU time and also encryption. When jobs on my server finished, I expected those files to start syncing within seconds and the only delay I expected was bandwidth constraints. With Spideroak syncs might take 5 minutes to start and then it would go out for coffee, come back jittery and then finally complete. Even if SPideroak fixed the time sync issue (or I forced my laptop to set its time based on my server), it still would not work for my sync because of the huge lags.

So looking at Dropbox and Spideroak I realized that I liked everything about Spideroak except its sync. It’s a great cloud backup tool that seems to properly do encryption, it’s multiplatform (win, linux, mac), has an iPad/iPhone app for viewing/sending files, it’s smart about backups and won’t upload the same file twice (even if the file is on two different computers). For my business use, I just can’t use Dropbox. The lack of “trust no one” encryption is a deal killer. So what I really need is a sync solution to use along side Spideroak.

There are some neat projects out there for sync. Projects like Sparkleshare look really promising but they are trying to do all sorts of things, not just sync. I’ve already settled on letting Spideroak do backup and version tracking so I don’t really need all those features… OK, OK, I can hear you muttering, “just use rsync and be done with it already.” Yeah, that’s a good idea. But rsync is single directional and does a lot of things well, but can also be a bit of an asshole if you don’t set all the flags right and rub its belly the right way. If you google for “bidirectional sync” you’re going to see this problem has plagued a lot of folks. This blog post has already gone on long enough so I’ll cut to the chase. Here’s the stack of tools I settled on for cobbling together my own secure, real-time, bidirectional sync between two Ubuntu boxes (one of which changes IP address and is often behind a NAT router):

1) Unison – Fast sync using rsync-esque algos and really fast caching/scanning

2) lsyncd – Live (real-time) sync daemon

3) autossh – ssh client with a nifty wrapper that keeps the connection alive and respawns the connection if dropped

I’ll do another post with the nitty-gritty of how I set this up, but the short version is that I installed Unison and lsyncd on both the laptop and the server. Single direction sync from my laptop to the server is pretty straight forward: lsyncd watches files, if one changes it calls unison which syncs the files with the server. The tricky bit was getting my server to be able to sync with my laptop which is often behind a NAT router. The solution was to open an ssh connection from my laptop to my server using autossh and reverse port forward port 5555 from the server back to my laptop’s port 22. That way an lsyncd process on the server can monitor the file system and when it sees a change can kick off a unison job that syncs the server to ssh://localhost:5555//some/path which is forwarded to my laptop! Autossh makes sure that connection does not get dropped and respawns if it does get dropped. So with a little shell scripting to start the lsyncd daemon on both machines, some config of lsyncd, and a local shell script to fire off the autossh connection, I’ve got real-time bidirectional sync!

In a follow up post I’ll put of the details of this configuration. Stay tuned. (EDIT: Update posted!)

If you’ve solved sync a different way and you like your solution, please comment. I’ve not settled that this is my long-term solution. It’s just a solution that works. Which is more than I had yesterday.

22 Comments

  1. Vinh Nguyen says:

    Have you tried [this](http://fak3r.com/geek/howto-build-your-own-open-source-dropbox-clone/)? I haven’t tried it yet…waiting until I start my academic job after grad school.

  2. didi says:

    What about ubuntu one ?

  3. Nice hack! My own sync problems are a little different to yours, but may be relevant:

    I do all my work in a DVCS repository – specifically mercurial. If I want to make the syncing automatic, I just use the autosync extension:
    https://bitbucket.org/obensonne/hg-autosync/wiki/Home

    Since this runs as a client pull process it’s not instant, but with short poll intervals it can be made very close to that. (or, it seems, you could use your own file watcher like lsyncd) That solves all the NAT issues at once, too – although your port forwarding system is ingenious). There is no problem with time stamps, since sync conflicts are marked for resolution and both versions are kept. There is no need to worry about 3rd party solutions – the repository is on my machine in my office. Additionally, it works on OSX, windows and linux, and with an arbitrarily large number of clients. (which port forwarding does not scale to)

    The costs are:
    1) it’s a DVCS, designed for code, not data, so for your case of temporary data files, not so great – you don’t need a revision history of your ancient intermediate results, and you certainly don’t want to maintain copies of all previous data there
    2) hg does not perform very well with large binary files, which might give you the same problem as with spideroak.

    In any case, it might be of interest to you.

    (I understand there is an equivalent solution for git – https://github.com/commandline/flashbake/wiki – but haven’t tried it myself)

  4. John says:

    6 seconds? Why on earth aren’t you/they using NTP? Then the time difference would be closer to 6 milliseconds and the correct wall clock time to boot. There has been no reason to have inaccurate clocks for decades now.

  5. I have a lsyncd + autossh setup to sync files with my server and PC, it works ok, but it does get jammed about once every two weeks or so and you need to kill the lsyncd process and restart it.

    I’m curious about how/why you are using Unison here, rsync is all you need.

  6. JP Mehers says:

    Have you tried pogoplug? It links your external HDD to the internet creating a private cloud.

  7. JD Long says:

    JP, my one concern with Pogo plug is that its software is totally propitiatory. Obviously I’m willing to use prop tools like Spideroak and Dropbox so it’s not a deal-breaker. The bigger concern for my use case is that it does not seem to solve my desire for real time sync between two machines.

  8. JD Long says:

    Hey Dan, thanks for reading and commenting. You’re right that rsync is really good. My use of Unison is part fact based, and part emotional. First the emotional: I burned my fingers once using rsync because I didn’t fully grok how it handles deletions. Ever since then I’m a little bit gun shy with rsync. Another, more fact based, interest in Unison was that it is cross platform. I have a windows server that I’m currently keeping in (slow) sync with Spideroak. If Unity works out for me I might be able to use Unison to keep it in faster sync. My third challenge with rsync is that I don’t 100% grok how it handles conflicts in edits. I know in Unison that I’m not going to have data loss, and that’s a big deal for me.

  9. JD Long says:

    John, I agree, NTP is the way to go. I don’t know why Spideroak made their engineering choices. I could have gone the route of setting up my EC2 machine as an NTP server and then syncing all my other machines with that one. But since my EC2 machine is not up all the time, that would have some drawbacks. Another limitation is that EC2 machines can’t have their system clocks changed. And even if I was able to sync the clocks the speed of sync for Spideroak is a limitation for my use. So I didn’t spend any time trying to fix the time sync issue.

  10. JD Long says:

    Dan, I love the idea of full file system versioning using a DVCS. However for full file system use it gets slow for exactly all the reasons you list out in your costs. The “right” way to do this is with the file system which is how ZFS handles full version history. Eventually that will get built into our OSes but for now we’re left layering hackish solutions on top.

  11. JD Long says:

    Didi, Ubuntu one does not use user key encryption for stored files. So it’s no better than Dropbox in that regard.

  12. JD Long says:

    Vinh, that post on fak3r is what got me interested in lsyncd. I tried using lipsync but had some troubles getting it set up properly. In the process of understanding lipsync I realized that I really wanted to use Unison, not rsync and I really wanted to handle real-time changed on BOTH machines. lipsync uses one way real-time sync and then for the other end it just forces a cron job to run every 5 minutes. Not the real-time bi-directional solution I was looking for.

  13. Paula says:

    Hi JD,

    I am setting something up very similar and your solution will work well. I have several laptops (four) that I use at different times for different reasons. I have one as an Internet file server with a vpn that allows me to connect any of my other laptops and has one of my primary stores. I have another laptop that may be with me or not and has a second primary store. Both primary stores must remain synced and the second primary store may or may not be on. Both have vpns so If I am not near either of them I can connect to at least one while on a trip. I think your solution will work well for the two primary stores and am bringing up the various pieces. Thanks for the work!

    Paula

  14. Siah says:

    Hear, hear.

    They say an angel dies in heavens every time you need to close a trashy youtube music to save a little CPU power for your real programs.

  15. [...] a previous post I discussed my frustrations with trying to get Dropbox or Spideroak to perform BOTH encrypted [...]

  16. Malahal says:

    You could use encryption on your systems, that way what is transmitted to Dropbox is all encrypted. I use encfs for that although I don’t use dropbox, but it should work.

    With the following mount idea, everything under ~/Dropbox is encrypted. Performance may be a problem with Dropbox with big files as it may need to transmit whole file when you make small changes to it.

    #encfs ~/Dropbox ~/clear

  17. [...] uses lsyncd to keep his R files (specifically, R Studio output) in sync with his local machine. Post 1. Post 2. At DSN, we use lsyncd to create a magic folder on our server that pushes R plots generated [...]

  18. Pete says:

    Hi JD,

    I stumbled across your blog and just thought I’d suggest that you use encfs to encrypt on your files in a virtual directory on dropbox; its the solution I use and I think its pretty much the best compromise for me.

    It works perfectly across my macbook and home workstation running ubuntu; just have something like the following command run on startup: (obviously not my real password).

    echo 8b7fdb3408c2e89708e6ab442366e1e4 | encfs -S /Users/pete/Dropbox/Secure/ /Users/pete/SecureOnlineSync/

    Now you access your files on the directory called ‘SecureOnlineSync’ and they’re transparently encrypted on dropbox. Unlike using a truecrypt volume it will only need to sync the actual file thats updated.

    Cheers,
    Pete

  19. Syreeta Khosravi says:

    I have the exact same problem as you Kasper, it’s really annoying. I’ve tried everything to solve it, but found no solutions. It worked before with an older version of Photomatix and the plugin you had to install yourself.

  20. Rosamond Fairclough says:

    Does Mint LXDE finally got any option to configure multiple keyboard layouts? Until they do it, I can’t use or even try LXDE by default.

Leave a Reply