Shuffling, symmetric difference, and other GNU tools fun

11 Oct 2015

Today I was working on the Springleaf Marketing Response data science competition on kaggle and was having a problem loading the training data into memory.

The data itself is almost a thousand features (columns) and about 150k lines. All in all it was about 921 megabytes. A little too big to load into pandas in the raw.

Before I could even start doing any exploratory analysis with the data I needed a way to make some smaller samples that I could quickly load into memory and run some test algorithms and visualizations on.

I decided a good first step to looking at the data would be to randomly sample lines using the shuf command (available on linux), which shuffles the lines of a file.

Before I could shuffle the file, I needed to remove the first line containing the column names. Later, after I shuffle the file, I would then add the column names back on.

There are a lot of ways to remove the head (a reasonable way would to just use the tail -n +2 command which outputs everything from the second line onward) but for fun I decided to do it using symmetric (set) difference. Symmetric difference is equivalent to the union minus the intersection, so all elements contained in one set or the other but not both.

One way we can take the set difference of two files is by sorting both files and then filtering based on unique lines. The sorting step is necessary so that we can use the uniq command.

Hence, set difference of two files can be achieved using

sort file1 file2 | uniq -u

(The u flag is for displaying only unique lines.)

Before we can apply this set difference, we still have one more problem; it’s inelegant to make a temporary file of just the first line of the data so we can have two files.

In order to get around making a temp file, I can use parentheses (which generates a sub-shell).

cmd 2 <(cmd 1 file1) file2

means run cmd 2 on file2 and the result of cmd 1 and file1.

So to get the main file stripped of the head we use

sort <(head -1 datafile) datafile | uniq -u | head -n 1000

which translates to “sort the data file and the first line of the data file and then filter out all duplicates (symmetric difference). Finally, take the first 1000 lines of the shuffled data”.

Now I just need to concatenate the column titles with the first thousand lines of the shuffled data. I can do this with another parenthesis subshell.

cat <(cmd1 file1; cmd2 file2)

means “concatenate the result of cmd1 on file1 and the result of cmd2 on file2 (parallel streams)”

Hence, a command for stripping the first line off a dataset, randomly sampling 1000 lines and appending the first line back on is

cat <(head -1 data.csv; sort <(head -1 data.csv) data.csv | uniq -u | shuf| head -n 1000) > raw_short.csv 

To confirm that my command did what I wanted I first checked the number of lines

wc -l raw_short.csv

(this should return 1001 lines since we had 1000 lines and we concatenated the head).

Next I checked that the IDs (in column 1) were actually shuffled:

cut -f1 -d"," raw_short.csv | head -n 5

This says “take the first column (f stands for field) deliminated with ‘,’ and display the first five lines”. This makes sense because the original ids were in numeric order.

I then ran this command on my file and it finished in less than three seconds. Not bad!