Exploring open data
I've recently got interested into exploring open data. This article presents the tools and platforms I use when exploring the most recent household surveys run by the World Bank and the Nigerian Bureau of statistics as part of the Living Standards Measure
In June 2020, I noticed the Living Standards Measurement Study (LSMS) organised by the World Bank. These studies are carried out in different parts of the world to better understand households and provide policy makers with information that would help to positively affect society. I got particularly interested in the Nigerian Living Standards Survey. There are official reports published for the datasets. I however decided to explore the data and create my own visual interpretations. This article is a summary of the tools I found useful and the processes I go through when exploring the datasets.
Data Files
The datasets are provided as CSV, Stata or SPSS files. I opted for the CSV as it did not tie me into having any special licences. My intent for the dataset was to first get a visual overview of the dataset to understand if there are any identifiable trends before deep-diving into more rigorous statistical analysis. Each of the LSMS survey is accompanied by a a series of data files that fit a relational database model.
AWS Athena
Amazon's Athena is part of the data ops tools for creating a platform that allows data analysis in the cloud. You provide it datasources such as logs, CSV or JSON files. These files are typically uploaded to AWS S3. You then create a crawler that transform the input data into a relational format using AWS Glue. Glue is able to run a scheduled job that automatically updates the generated tables. Finally, Athena provides a dashboard for querying the tables using SQL.
One gotcha for the CSV dataset I was processing was that the AWS Glue crawler expects that each datafile exists in its own folder before it created a table for each file. I remedied that by creating a shell script that looped through the files in a given directory, created a folder with the same name and moves the file into the folder. See below for an example.
#!/bin/sh
for filename in --REPLACE-WITH-PATH--
do
directoryName=`echo $filename | cut -d'.' -f1`
mkdir $directoryName
mv $filename $directoryName
echo "moved $filename to $directoryName"
done
echo "done"
I was able to complete the data import of 94 tables into AWS in under 15 minutes. I am also exploring ways to build some automation around provisioning users who might be interested in querying or analysing the data using SQL on AWS.
Note that Athena costs some money to run. The costs are proportional to the amount of data scanned while processing a query. Familiarise yourself with the pricing before using it.
Observables HQ
Once I get a good view of the data which might involve filters, grouping and joining multiple tables, I export new views of the data to CSV and visualise the new datasets using d3 or vega-lite on observablehq. Observablehq provides an interactive platform for quickly generating visualisations.
Why not use MySQL workbench?
MySQL workbench community edition is a free tool and provides the ability to import CSV files as tables into a MySQL database. It attempts to determine the column type based on the values in the first few rows of the data file. I was pretty optimistic. My optimism dwindled once I saw how error prone the process was. For example, determining the column type from the first few rows means that if the type has been determined to be an INTEGER
and the import engine later finds an empty value, it throws an error or fails to import the entire record instead of writing a NULL
value to the cell. The wizard creates a VARCHAR
type for columns whose fields are empty in the first few rows. This quickly became a tedious process. Some of the tables had over 20 columns with nullable INTEGER
values. I found myself changing most of the columns to VARCHAR
since it was the only way to be sure that all the records in the data files were imported into the database tables.
Aside from manually updating the column types, the import process was slow. For instance, it took nearly an hour to import a data file with half a million records. Attempting to run multiple imports at the same time showed lots of errors in the logs and did not fill me with confidence about the records that were successfully imported into the tables.
I’m publishing this as part of 100 Days To Offload. You can join in yourself by visiting https://100daystooffload.com.