Teens and the Internet…

Given that I have a small version of myself running around my house, I think about how she’ll use the Internet when she gets older. Just the other day she, “Asked Google something” which made me realize that, although she’s just barely literate, my kid is going to “be online” for the rest of her life. Although I’m not really sure what “be online” will mean for her over the years. So without putting too many more “things in quotes” I wanted to share a few resources I’ve found helpful in framing how I think about my child’s future online.

First is an interview with danah boyd on Triangulation (she prefers lowercase. I presume it’s an artistic thing and just go with it). danah’s book _It’s Complicated: The Social Life of Networked Teens_ dives into research around how teens use the internet and what the real risks are online. Despite the title, my big take away was that it’s not all that complicated. Teens want to do, basically, the same things. Without providing any spoilers, some of what danah discusses reminds me of this:

http://xkcd.com/1289/

http://xkcd.com/1289/

 

 

Then today I stumbled on an article over at Boing Boing titled “Everything You Know About the Teenage Brain is Bullshit” and the article is worth reading. The tl;dr is “There’s no real evidence that online use is eating teens brains.”

Will I monitor how my child interacts with the connected digital world? Of course. I will also watch how she behaves at school and at the playground and talk about those things with her. Will I encourage her to put down the iPad and go fishing with me on a sunny Saturday? You betcha. But it’s not because the Internet is eating her brain… It’s because I want her to learn to have balanced interests.

Of course I’ll likely freak out at some point and lose perspective. Because, you know, parenting.

**EDIT**

As a great historical analog, according to this article, in 1859, _Scientific American_ had the following to say about… chess:

Those who are engaged in mental pursuits should avoid a chess-board as they would an adder’s nest, because chess misdirects and exhausts their intellectual energies. Rather let them dance, sing, play ball, perform gymnastics, roam in the woods or by the seashore, than play chess. It is a game which no man who depends on his trade, business or profession can afford to waste time in practicing; it is an amusement — and a very unprofitable one — which the independently wealthy alone can afford time to lose in its pursuit. 

We really do need to be careful what activities we let our children waste their youth on.

 

Installing & Debugging ODBC on Mac OS X


I just spent nearly two full days in a bare knuckle brawl with my Macbook Pro trying to get it to talk to a corporate MS SQL Server. I had abandoned MSSQL more than a year ago in favor of PostgreSQL because of how much easier it is to work with PostgreSQL from a non-Microsoft stack. At that point I was R running on Linux and soon R running on OS X. As part of me changing roles at the company I work for, I’ve joined a team where everyone else uses Python. So I’m now trying to play nice with the Python guys. In addition to Python, I need to talk to corporate servers which happen to be Microsoft SQL Server.

So enough about why I was banging my head against the Mac ODBC wall. Here’s some things I didn’t understand until I fought with them for a day. Maybe I can save you some pain.

The diagram above shows the sort of stack I now have working. But when I started I didn’t even understand how the pieces fit together. Keep in mind that for better or worse, I’m using Macports and in some ways I really like letting Macports do the installation of things. Yes, I know Homebrew rocks and yada yada yada, but I’m not currently using Homebrew. I’m using Macports. So the things I describe below are highly biased to Macports, but much of it is highly applicable to any install method.

It’s pretty obvious that when debugging something not working we should move through each piece and make sure each little chunk works before going to the next piece. Of course in real life we just flail, randomly try things, and cuss a lot. Until we get all red in the face and take a step back. So let’s start at the beginning:

The installing process should look something like the following:

SQL Server: Make sure you can log into SQL Server from a windows box on the same network which you are ultimately trying to connect from. And make sure you log in using the same credentials which you want to use elsewhere.

FreeTDS: This is the driver which sits between the Mac ODBC layer and MS SQL Server. FreeTDS does the talking to MS SQL. So it’s really important, obviously. I ultimatly installed FreeTDS using the following Macports command:

sudo port install freetds +mssql +odbc +universal

the +odbc bit installs unixODBC and the +mssql and +universal bits are totally mysterious to me. The command line tool tsql comes with FreeTDS. In the debugging section I’ll comment more on using isql.

For what it’s worth, the current version as of this writing is freetds @0.92.405_0

UnixODBC: Comes along with FreeTDS. unixODBC includes the command line tool isql which is a lot like tsql. More on that below.

pyodbc: (Sept 2013 update: Macport of pyodbc is now named py27-pyodbc, the way God intended.) I didn’t think there was a Macport for pyodbc.  The naming pattern for python packages in Mac is pythonVersion-packagename. So, for example, Pandas is py27-pandas for the Python 2.7 version of Pandas. So I tried the following:

sudo port install py27-pyodbc

Which fails to find the package. So I pulled the tar ball and spend hours yesterday trying to get pyodbc to compile and work in Mac OS X including fighting with the setup.py and trying to learn about build directories and other black magic things about Python. Then I stumbled on a discussion of using the Macports to install pyodbc. It turns out that the Macport of pyodbc is called, confusingly, simply odbc. So the install command is this:

sudo port install py27-odbc

OK, so if I had simply installed pyodbc properly with Macports I probably could have saved myself 4 hours of pain. But I did learn some things along the way. Let me capture a few of those so hopefully future adventurers will be spared some pain.

Debugging:

Problem: At one point I kept getting the following type of error in Python:

pyodbc.Error: ('00000', '[00000] [iODBC][Driver Manager]dlopen({SQL Server}, 6): image not found (0) (SQLDriverConnect)')

Solution: The clue here is the [iODBC] bit. iODBC is the default ODBC manager which now comes with Mac OS X. iODBC is a slightly less desirable ODBC manager than unixODBC. Despite elaborate comments found online where folks say, “iODBC works fine for me” most folks agree to use unixODBC and many have tried and failed to make iODBC work as expected. I spent a lot of time trying to figure out how to replace iODBC with unixODBC. It turns out the choice of whether to use iODBC or unixODBC happens when pyodbc is built and installed. Once I dropped back and installed freetds and pydobc as outlined above, I moved on to errors like the following:

ProgrammingError: ('42000', "[42000] [unixODBC][FreeTDS][SQL Server]Login failed for user 'sa'. (18456) (SQLDriverConnectW)")

You can see from that error message that I’m getting errors back through unixODBC and FreeTDS. Huge progress!

Problem: How can I tell if FreeTDS is properly installed?

Solution: Use the tsql command line program to connect to your DB. You’ll need to do something like the following:

tsql -S myserver -U username -P mypassword

The errors returned by tsql are fairly uninformative. You’ll likely see simply

There was a problem connecting to the server

or

Error 100 (severity 11):	unrecognized msgno
The good news is that FreeTDS only has one configuration file and only a couple of things are important in that file. The config file for FreeTDS (if installed by Macports) is /opt/local/etc/freetds/freetds.conf
The two things in freetds.conf which I found REALLY matter are port number and tds version. For my SQL Server 2008 I needed to set tds version to 7.2 by editing the [global] section of freetds.conf to look like:
[global]
     # TDS protocol version
     tds version = 7.2
Then I found that I also had to include port number for every server I wanted to connect to. This should not be needed because tsql allows passing port number using the -p switch. In my experience the -p switch resulted in a failure to connect, but putting the port number for the server in the freetds.conf file works. Note that my SQL Server instance uses the default port number so I didn’t think I would need this at all. But I do. My entry looks like this:
[MYSERVER]
   host = MYSERVER
   port = 1433

If you are connecting to all different vintages of SQL Server you may need to override the global tds version by using a different version in the server specific section. If you want some clues as to which TDS version you should be using, start with this chart from FreeTDS. If you get the port settings right and the names right, you’ll likely get a > prompt after you use tsql to connect to the DB. If you get a > prompt, you’re in pretty good shape!

Problem: How can I tell if unixODBC is properly installed?

Solution: First, make sure FreeTDS is installed properly. Seriously. Don’t skip that. After you’re able to connect using tsql, we need to configure some text files in order to move forward. unixODBC has three types of files to edit (paths assume installation using Macports):

Global driver configuration:

/opt/local/etc/odbcinst.ini

Global DSN configuration file:

/opt/local/etc/odbc.ini

Local DSN configuration file:

~/.odbc.ini

I elected to not even bother to create the Local DSN file and do everything from the two global files. What can I say? I’m global, that’s just how I roll.

The global driver config file (/opt/local/etc/odbcinst.ini) needs to contain a link to the FreeTDS driver. Mine looks like this:

[FreeTDS]
Description=FreeTDS Driver for Linux & MSSQL on Win32
Driver=/opt/local/lib/libtdsodbc.so
Setup=/opt/local/lib/libtdsodbc.so
UsageCount=1

From what I can tell, Macports does NOT create this file and it has to be manually created.

The DSN configuration file does not strictly have to be created. It’s possible to connect without DSN entries. But for the sake of testing, I highly recommend setting up at least one server DSN. You can find the DSN format in a lot of places online. Mine looks like this:

[MYSERVER]
Description         = Test to SQLServer
Driver              = FreeTDS
Trace               = Yes
TraceFile           = /tmp/sql.log
Database            = TechnicalProvisions
Servername          = MYSERVER
UserName            = myusername
Password            = mypassword
Port                = 1433
Protocol            = 7.2
ReadOnly            = No
RowVersioning       = No
ShowSystemTables    = No
ShowOidColumn       = No
FakeOidIndex        = No

After you set up the odbcinst.ini and the odbc.ini you are ready to test unixODBC using isql. To connect do your version of the following at the commend prompt:

isql MYSERVER myusername mypassword

YOU MUST INCLUDE YOUR USERNAME AND PASSWORD in the command line. Yes, even though the username and password are in the DSN entry, they must be included in the command line. I lost the better part of an hour trying to change fix a broken connection that wasn’t broken, only because I didn’t include my username and password. The only error isql returns when username and password are missing is:

[ISQL]ERROR: Could not SQLConnect

Yeah, that’s part of why this crap is hard.

Now if you’ve got isql connecting, you’re ready to move on to testing pyodbc in Python. If you’ve installed pyodbc as outlined above and then set up your ODBC files and FreeTDS and tested each one, this should be a snap! I use Pandas for its DataFrame structure. And after getting each of the previous bits working, I was able to do this:

import pyodbc
import pandas
import pandas.io.sql as psql
cnxn = pyodbc.connect('DSN=MYSERVER;UID=myusername;PWD=mypassword' )
cursor = cnxn.cursor()
sql = ("SELECT * FROM dbo.pandasTest")
df = psql.frame_query(sql, cnxn)

And it worked! But worth noting, just like on the command line, if I failed to include my username and password, I would get a failed connection. In my case it was this:

---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
in ()
3 import pandas.io.sql as psql
4
----> 5 cnxn = pyodbc.connect('DSN=MYSERVER' )
6 cursor = cnxn.cursor()
7 sql = ("SELECT * FROM dbo.pandasTest")
Error: ('08001', '[08001] [unixODBC][FreeTDS][SQL Server]Unable to connect to data source (0) (SQLDriverConnectW)')

Which, not unlike the isql error, is not particularly helpful.

Good luck. And may the ODBC be with you!

Solving easy problems the hard way

There’s a charming little brain teaser that’s going around the Interwebs. It’s got various forms, but they all look something like this:

This problem can be solved by pre-school children in 5-10 minutes, by programer – in 1 hour, by people with higher education … well, check it yourself! :)

8809=6
7111=0
2172=0
6666=4
1111=0
3213=0
7662=2
9313=1
0000=4
2222=0
3333=0
5555=0
8193=3
8096=5
7777=0
9999=4
7756=1
6855=3
9881=5
5531=0
2581=?

SPOILER ALERT…

The answer has to do with how many circles are in each number. So the number 8 has two circles in its shape so it counts as two. And 0 is one big circle, so it counts as 1. So 2581=2. Ok, that’s cute, it’s an alternative mapping of values with implied addition.

What bugged me was how might I solve this if the mapping of values was not based on shape. So how could I program a computer to solve this puzzle? I gave it a little thought and since I like to pretend I’m an econometrician, this looked a LOT like a series of equations that could be solved with an OLS regression. So how can I refactor the problem and data into a trivial OLS? I really need to convert each row of the training data into a frequency of occurrence chart. So instead of 8809=6 I need to refactor that into something like:

1,0,0,0,0,0,0,0,2,1 = 6

In this format the independent variables are the digits 0-9 and their value is the number of times they occur in each row of the training data. I couldn’t figure out how to do the freq table so, as is my custom, I created a concise simplification of the problem and put it on StackOverflow.com which  yielded a great solution. Once I had the frequency table built, it was simple a matter of a linear regression with 10 independent variables and a dependent with no intercept term.

My whole script, which you should be able to cut and paste into R, if you are so inclined, is the following:

## read in the training data
## more lines than it should be because of the https requirement in Github
temporaryFile <- tempfile()
download.file("https://raw.github.com/gist/2061284/44a4dc9b304249e7ab3add86bc245b6be64d2cdd/problem.csv",destfile=temporaryFile, method="curl")
series <- read.csv(temporaryFile)

## munge the data to create a frequency table
freqTable <- as.data.frame( t(apply(series[,1:4], 1, function(X) table(c(X, 0:9))-1)) )
names(freqTable) <- c("zero","one","two","three","four","five","six","seven","eight","nine")
freqTable$dep <- series[,5]

## now a simple OLS regression with no intercept
myModel <- lm(dep ~ 0 + zero + one + two + three + four + five + six + seven + eight + nine, data=freqTable)
round(myModel$coefficients)

Created by Pretty R at inside-R.org

The final result looks like this:

> round(myModel$coefficients)
zero   one   two three  four  five   six seven eight  nine
   1     0     0     0    NA     0     1     0     2     1
So we can see that zero, six, and nine all get mapped to 1 and eight gets mapped to 2. Everything else is zero. And four is NA because there were no fours in the training data.
There. I’m as smart as a preschooler. And I have code to prove it.

Fitting Distribution X to Data From Distribution Y

I had someone ask me about fitting a beta distribution to data drawn from a gamma distribution and how well the distribution would fit. I’m not a “closed form” kinda guy. I’m more of a “numerical simulation” type of fellow. So I whipped up a little R code to illustrate the process then we changed the parameters of the gamma distribution to see how it impacted fit. An exercise like this is what I call building a “toy model” and I think this is invaluable as a method for building intuition and a visceral understanding of data.
Here’s some example code which we played with:

set.seed(3)
x <- rgamma(1e5, 2, .2)
plot(density(x))
 
# normalize the gamma so it's between 0 & 1
# .0001 added because having exactly 1 causes fail
xt <- x / ( max( x ) + .0001 )
 
# fit a beta distribution to xt
library( MASS )
fit.beta <- fitdistr( xt, "beta", start = list( shape1=2, shape2=5 ) )
 
x.beta <- rbeta(1e5,fit.beta$estimate[[1]],fit.beta$estimate[[2]])
 
## plot the pdfs on top of each other
plot(density(xt))
lines(density(x.beta), col="red" )
 
## plot the qqplots
qqplot(xt, x.beta)

Created by Pretty R at inside-R.org

It’s not illustrated above, but it’s probably useful to transform the simulated data (x.beta) back into pre normalized space by multiplying by max( x ) + .0001 . (I swore I’d never say this but I lied) I’ll leave that as an exercise for the reader.

Another very useful tool in building a mental road map of distributions is the graphical chart of distribution relationships that John Cook introduced me to.

Shell scripting EC2 for fun and profit

Lately I’ve been doing some work with creating ad-hoc clusters of EC2 machines. My ultimate goal is to create a simple way to spin up a cluster of EC2 machines for use with Bryan Lewis’s very cool doRedis backend for the R foreach package. But that’s a whole other post. What I was scratching my head about today was that I’d really just like to, with a single command, spin up an EC2 instance, wait for it to come up, and then ssh into it. I do this iteration about 20 times a day when I’m testing things, so it seemed to make sense to shell script it.
To do this, one needs the EC2 command line tools installed on your workstation. In Ubuntu that’s as easy as `sudo apt-get ec2-api-tools`

So here’s a short shell script to spin up an instance, wait 30 seconds, then connect:

If you’re reading this through an RSS reader, you can see the script over at github.

Obviously you’ll need to change the parameters at the top of the script to suit your needs. But since this was a bit of a pain in the donkey hole for me to figure out, I thought I would share.

If you want to help out, I’d love you to enlighten me on how to have the script figure out if an instance has finished booting so I could eliminate the sleep step.

The best interview question I’ve ever been asked

In 2005 I was interviewing for a job as Risk Manager with Genworth Financial. I was working a gig up in Armonk, NY so I hopped a car to the GNW office and met with Mark Griffin, at that point the Chief Risk Office (CRO) for GNW. After some small talk, Mark asked me the single most interesting interview question I’ve ever been asked. I don’t recall the exact wording, but the gist was:

If you could go back and work more on one project from your past, what would it be and why?

This immediately struck me as a good question. Like all really good interview questions, there is no right answer, but any answer tells a LOT about the person answering it. I talked about a few projects I had really enjoyed from my past: fuel hedging dashboard for an international airline, data mining government program data, but said that the one thing I wish I could work more on was reinsurance ceding strategies for insurance companies. Naturally he responded, “Why so?” So I explained the challenge and how I felt that if I had a little more time and a little more data I could numerically optimize reinsurance strategies and when I last worked on the problem it was 2001 and now, four years later, the computing power was better and I thought I could really get it right.

I’m pretty sure I didn’t explain very well. Mark was obviously fishing around to see if I got a little OCD about analytical challenges and if I loved digging. I thought about Mark’s question a lot three years later when I left Genworth to go work in reinsurance, optimizing reinsurance strategies.

Details of two-way sync between two Ubuntu machines

In a previous post I discussed my frustrations with trying to get Dropbox or Spideroak to perform BOTH encrypted remote backup and AND fast two way file syncing. This is the detail of how I set up for two machines, both Ubuntu 10.10, to perform two way sync where a file change on either machine will result in that change being replicated on the other machine.

I initially tried running Unison on BOTH my laptop and the server and had the server Unison set to sync with my laptop back through an SSH reverse proxy. After testing this for a while I discovered this is totally the wrong way to do it. The problem is that the Unison process makes temp directories and files in the file system of the target. So my Unison job on the laptop would be trying to syn files and, in the process, create temp files which would kick off a Unison sync on the sever which would make temp files on the laptop… I think you can see how convoluted this gets.

So a much better solution is to only run Unison from one machine (I chose my laptop) and have the other machine (server in my case) send an SSH command (over the aforementioned reverse proxy) to the laptop asking the laptop to kick off a Unison sync. This way all of the syncs happen from the laptop.

So, in short, both machines run lsyncd which monitors files for changes. I keep up an SSH tunnel with reverse port forwarding which forwards a remote machine port back to my laptop’s port 22 (SSH). Unison need be installed ONLY on my laptop. When a change happens on my laptop, lsyncd fires off a Unison sync from my laptop that syncs it with the server. When a file changes on the server, the lsyncd job on the server makes a connection to my laptop via ssh and fires off a Unsion sync between my laptop and the server.

Here’s an example of my lsyncd config scripts:

Laptop:

settings = {
logfile = “/home/jal/lsyncd/laptop/lsyncd.log”,
statusFile = “/home/jal/lsyncd/laptop/lsyncd.status”,
maxDelays = 15,
–nodaemon = true,
}

runUnison2 = {
maxProcesses = 1,
delay = 15,
onAttrib = “/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents”,
onCreate = “/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents”,
onDelete = “/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents”,
onModify = “/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents”,
onMove = “/usr/bin/unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents”,
}

sync{runUnison2, source=”/home/jal/Documents”}

Server:

settings = {
logfile = “/home/jal/lsyncd/server/lsyncd.log”,
statusFile = “/home/jal/lsyncd/server/lsyncd.status”,
maxDelays = 15,
–nodaemon = true,
}

runUnison2 = {
maxProcesses = 1,
delay = 15,
onAttrib = “ssh localhost -p 5432 unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents”,
onCreate = “ssh localhost -p 5432 unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents”,
onDelete = “ssh localhost -p 5432 unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents”,
onModify = “ssh localhost -p 5432 unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents”,
onMove = “ssh localhost -p 5432 unison -batch /home/jal/Documents ssh://12.34.56.78//home/jal/Documents”,
}

sync{runUnison2, source=”/home/jal/Documents”}

Keep in mind that I am using version 2 of lsyncd which can be downloaded here: http://code.google.com/p/lsyncd/

The version of lsyncd available in the Ubuntu repo is version 1.x which does not use the same config format as I illustrate above. However, if you run into dependency issues with v2, the easiest thing to do is install the repo version which will install dependencies and then manually download and install v2 from the above URL.

My reverse port forwarding set up looks like this:

autossh -2 -4 -X -R 5432:localhost:22 12.34.56.78

the -R bit forwards remote port 5432 to my laptop’s port 22 which is the ssh. So on my server if I run ssh localhost -p 5432 what actually happens is I am sshing from the remote machine to my laptop.

Notes:

  • The IP address of my server in this example is 12.34.56.78.
  • Don’t try and sync the directories where the lsyncd logs are kept. That will results in an endless sync cycle as each machine keeps noticing changes endlessly. Don’t ask me how I know this.
  • The command to start the sync on the laptop is “lsyncd /home/jal/lsyncd/laptop/configfile” where configfile is the above lsyncd configuration file.
  • lsyncd could, conceivably, tell Unison to sync only the part of the directory tree that changed. I have not been able to make that feature work right, however. And it only takes Unison a few seconds to sync, so I’ve not worried about it.

This has greatly sped up my RStudio based workflow when doing analysis with R. Now when I change files on my server using RStudio they are immediately (well it waits 15 seconds) replicated to my local machine and vice versa!

Good luck and if you have any suggestions please post a comment!

Fast Two Way Sync in Ubuntu!

I love the portability of a laptop. I have a 45 min train ride twice a day and I fly a little too, so having my work with me on my laptop is very important. But I hate doing long running analytics on my laptop when I’m in the office because it bogs down my laptop and all those videos on The Superficial get all jerky and stuff.

I get around this conundrum by running much of my analytics on either my work server or on an EC2 machine (I’m going to call these collectively “my servers” for the rest of this post). The nagging problem with this has been keeping files in sync. RStudio Server has been a great help to my workflow because it lets me edit files in my browser and they run on my servers. But when a long running R job blows out files I want those IMMEDIATELY synced with my laptop. That way I know when I undock my laptop to run to the train station that all my files will be there for me to spill Old Style beer on as I ride the Metra North line.

I experimented with Dropbox and I gotta say, it’s great. It really is well engineered, fast, and drop dead simple. I love that with Dropbox I could pull up most any file from my Dropbox on my iPad or iPhone. That’s a very handy feature. And it’s fast. If I created a small text file on my server, it would be synced with my laptop in a few seconds. Perfect! Wel… almost. Dropbox has a huge limitation: encryption. Dropbox encrypts for transmission and may even store files encrypted on their end. However, Dropbox controls the key. So if a rogue employee, a crafty Russian hacker, or a law enforcement officer with a subpoena gained access to Dropbox, they could get access to my files without my knowledge. As a risk manager I can’t help but see Dropbox’s security as a huge, targeted, single point of failure. It’s hard to say which would be a bigger payday: cracking GMail, or cracking Dropbox. But I’m suspicious it’s Dropbox. There are some workarounds to try and shoehorn file encryption into Dropbox, and they all suck.

So Dropbox can’t really give me what I want (what I really really want). But I stumbled into Spideroak who are like the smarter, but lesser known cousins of Dropbox. Their software does everything Dropbox does (including tracking all revisions!) but they have a “trust no one” model which encrypts all files before leaving my computer using, and this is critical, MY key which they don’t store. Pretty cool, eh? Spideroak also has a iPad/iPhone app and offers a neat feature that allows emailing any file in my Spideroak “bucket” to anyone using my iPhone without having to upload the file to my iPhone first. They do this by sending a special link to the email recipient that allows them to open only the file you wanted them to have. This could be a huge bacon saver on the road.

So Spideroak’s the panacea then? Well… um… no. They have two critical flaws: 1) They depend on time stamps on files to determine most recent file. 2) Syncs are slow, sometimes taking more than 5 minutes for very small files. The time stamp issue is an engineering failure, plain and simple. I’ve talked to their tech support and been assured that they are going to change this and index using server time, not system time in the future. But as of April 6, 2011, Spideroak uses local system time. For most users this is no big deal. For my use case this is painful. My server and my laptop were 6 seconds different and that time difference was enough for me to get Spideroak confused about which files were the freshest. This is a big deal when syncing two file systems with fast changing files. The other issue, slow sync, was actually more painful but probably the result of their attempt to be nice with CPU time and also encryption. When jobs on my server finished, I expected those files to start syncing within seconds and the only delay I expected was bandwidth constraints. With Spideroak syncs might take 5 minutes to start and then it would go out for coffee, come back jittery and then finally complete. Even if SPideroak fixed the time sync issue (or I forced my laptop to set its time based on my server), it still would not work for my sync because of the huge lags.

So looking at Dropbox and Spideroak I realized that I liked everything about Spideroak except its sync. It’s a great cloud backup tool that seems to properly do encryption, it’s multiplatform (win, linux, mac), has an iPad/iPhone app for viewing/sending files, it’s smart about backups and won’t upload the same file twice (even if the file is on two different computers). For my business use, I just can’t use Dropbox. The lack of “trust no one” encryption is a deal killer. So what I really need is a sync solution to use along side Spideroak.

There are some neat projects out there for sync. Projects like Sparkleshare look really promising but they are trying to do all sorts of things, not just sync. I’ve already settled on letting Spideroak do backup and version tracking so I don’t really need all those features… OK, OK, I can hear you muttering, “just use rsync and be done with it already.” Yeah, that’s a good idea. But rsync is single directional and does a lot of things well, but can also be a bit of an asshole if you don’t set all the flags right and rub its belly the right way. If you google for “bidirectional sync” you’re going to see this problem has plagued a lot of folks. This blog post has already gone on long enough so I’ll cut to the chase. Here’s the stack of tools I settled on for cobbling together my own secure, real-time, bidirectional sync between two Ubuntu boxes (one of which changes IP address and is often behind a NAT router):

1) Unison – Fast sync using rsync-esque algos and really fast caching/scanning

2) lsyncd – Live (real-time) sync daemon

3) autossh – ssh client with a nifty wrapper that keeps the connection alive and respawns the connection if dropped

I’ll do another post with the nitty-gritty of how I set this up, but the short version is that I installed Unison and lsyncd on both the laptop and the server. Single direction sync from my laptop to the server is pretty straight forward: lsyncd watches files, if one changes it calls unison which syncs the files with the server. The tricky bit was getting my server to be able to sync with my laptop which is often behind a NAT router. The solution was to open an ssh connection from my laptop to my server using autossh and reverse port forward port 5555 from the server back to my laptop’s port 22. That way an lsyncd process on the server can monitor the file system and when it sees a change can kick off a unison job that syncs the server to ssh://localhost:5555//some/path which is forwarded to my laptop! Autossh makes sure that connection does not get dropped and respawns if it does get dropped. So with a little shell scripting to start the lsyncd daemon on both machines, some config of lsyncd, and a local shell script to fire off the autossh connection, I’ve got real-time bidirectional sync!

In a follow up post I’ll put of the details of this configuration. Stay tuned. (EDIT: Update posted!)

If you’ve solved sync a different way and you like your solution, please comment. I’ve not settled that this is my long-term solution. It’s just a solution that works. Which is more than I had yesterday.

Where the heck has JD been?

It’s been pointed out to me that I haven’t had any blog posts in a while. It’s true. I’m fairly slack. But in the last few months I’ve changed jobs (same firm, new role), written an R abstraction on top of Hadoop, been to China, and managed to stay married. While that sounds pretty awesome, I’m nothing compared to Hideaki Akaiwa.

And you may have heard that the R Cookbook by Chicago’s own Paul Teeter has been printed! Way to go Paul! And for a limited time you can get the book 50% off direct from O’Reilly.

And let it be known: I’ve double dog dared you to find a stats or programming book with any better back cover quotes:

Controlling Amazon Web Services using rJava and the AWS Java SDK

I’ve been messing around with using Amazon Web Services for a while. I’ve had some projects where I wanted to upload files to S3 or fire off EMR jobs. I’ve been controlling AWS services using a hodgepodge of command line tools and the R system() function to call the tools from the command line. This has some real disadvantages, however. Using the command line tools means each tool has to be configured individually which is painful on a new machine. It’s also much harder to roll my R code up into a CRAN package because I have to check dependencies on the command line tools and ensure that the user has properly configured each tool. Clearly a pain in the ass.

So I was looking for more simple/elegant solutions. After thinking the Boto library for Python might be helpful, I realized that the easiest way to use that would be with rJython which meant having to interact with R, Python, AND Java. Considering I don’t program in Python or Java, that seemed like a fair bit of complexity. Then I realized that the canonical implementation of the AWS API was the AWS Java SDK. The rJava package makes interacting with Java from R a viable option.

Since I’ve never written a single line of Java code in my pathetic life, this was somewhat harder than it could have been. But with some help from Romain Francois I was able to cobble together “something that works.” The code below gives a simple example of interfacing with S3. The example will look to see if a given bucket exists on S3, if not it will create the bucket. Then it will upload a single file from your PC into that bucket. You will have to download the SDK, unzip it in the location of your choice, and then change the script to reflect your configuration.

If you are running R in Ubuntu, you should install rJava using apt-get instead of using install.packages() from inside of R:

sudo apt-get install r-cran-rjava

Here’s the codez. And a direct link for you guys reading this through an RSS reader:

I realize that Duncan Temple Lang has created the RAmazonS3 package which can easily do what the above code sample does. The advantage of using rJava and the AWS Java SDK is the ability to apply the same approach to ALL the AWS services. And since Amazon maintains the SDK this guarantees that future AWS services and features will be supported as well.