Data science at the command line
The UNIX command line is a powerful tool for diving into large data files and piping specialized utilities to create summary statistics. This is a compilation of some useful commands.
Parsing CSV
Read the first line of a CSV file, which contains the
column names, and list each column name in a new line
after splitting by comma and stripping surrounding
quotes. Notice awk
can create an array and
has operations like length
that returns the
number of elements in an array.
head -n 1 WebExtract.txt | awk '{ split($0, a, ","); max = length(a) }
END { for (x=1; x<=max;x++) {gsub(/"/, "", a[x]); print a[x]} }'
Take the average of a column of integers in a CSV file after filtering for a category.
cat iris.csv | grep 'Iris-setosa' | awk -F "," '{ sum += $1; n += 1; }
END { printf "%0.5f\n", sum/n }'
To select a single line in a CSV file and output one column.
sed -n '2405p;2405q' WebExtract.txt | awk -F "\",\"" '{ print $3; }'
We can also select a subset of lines in the middle of the document and return a single column.
sed -n '1000,1010p' WebExtract.txt | awk -F "\",\"" { 'print $3; }'
Count the number of lines with wc
.
cat trip_data_1.csv | wc -l
JSON files can be parsed using jq. The following
command parses a 35MB file for the attribute
city
, sorts the cities, and returns the
number of occurrences.
cat ./yelp_train_academic_dataset_business.json | jq '.city' | sort | uniq -c
Parsing JSON
To test a command on a large file we can select just the first few lines.
head -n 100 ./yelp_train_academic_dataset_business.json | jq '.city' | sort | uniq -c
To further filter the output for occurrences greater
than 5 we can use awk
.
head -n 100 ./yelp_train_academic_dataset_business.json | jq '.city' |
sort | uniq -c | awk '{if ($1 > 5) {print $0}}'
The above filter also works reasonably well on the entire dataset.
cat ./yelp_train_academic_dataset_business.json | jq '.city' |
sort | uniq -c | awk '{if ($1 > 100) {print $0}}'
To get the total number of entires with over 100 occurrences we can take the sum.
cat ./yelp_train_academic_dataset_business.json | jq '.city' |
sort | uniq -c | awk '{ if ($1 > 100) {sum += $1;} } END { print sum }'
To filter by a key within jq
we can pipe
commands and return only the city names.
head -n 3 ./yelp_train_academic_dataset_business.json | jq 'select(.city |
contains("De Forest")) | .city'