AWS S3 Select — Retrieving Subsets of Data



AWS’s S3 is their immensely popular object storage service. S3 is relatively cheap, flexible and extremely durable. It’s fairly common for me to store large data files in an S3 bucket and pull them down for analysis on my laptop. These data files can be several gigs and I usually only operate on subsets that fit comfortably into memory. Downloading these large files only to use part of them makes for sub-par performance and increased S3 costs.

An option to the above pattern is using the relatively unknown S3 Select feature. S3 Select allows you to use SQL to fetch only the data you need. The selection takes place on AWS so performance goes up and costs go down. There are other ways (Athena, Presto, Drill, etc) to run SQL against S3, but it’s hard to beat the simplicity of this setup.

Let’s get to some code. I’ll demonstrate how to perform a select on a CSV file using Python and boto3.

First some imports. Notice that I’m using Pandas as it’s my preferred way to analyze data.

import boto3
import pandas as pd
from io import StringIO

Next four lines are straightforward. Creating an S3 object, S3 bucket name, CSV file name and the SQL statement. The file we’re using is some bay area bike share data I had laying around.

s3 = boto3.client(‘s3’)
bucket_name = ‘my-bucket’
file_name =201508_trip_data.csv’
sql_stmt = """SELECT Duration, \"Start Date\", \"Start Station\"
FROM s3object
WHERE \"Start Station\" = 'San Jose City Hall'"""

Next, we make the selection with the parameters we covered above. We also use CSV for both the Input and Output serialization parameters. JSON is the other option by the way. We elect to use FileHeaderInfo so the SQL statement can refer to column names instead of column positions.

# making the request
req = s3.select_object_content(
InputSerialization = {CSV: {‘FileHeaderInfo’: ‘Use’}},
OutputSerialization = {CSV: {}},

Things get a little interesting in this next section. The select_object_content method returns an AWS EventStream that you have to loop through to get your data. The data is found in one or more Records keys. Each value in Records is a byte string that is likely to be made up of several rows in your CSV file. We also capture the Stats (bytes scanned, bytes returned, etc) associated with the request.

records = []
# looping through the payload of the AWS EventStream and getting one
# or more Records and Stats
for event in req[‘Payload’]:
if ‘Records’ in event:
elif ‘Stats’ in event:
stats = event[‘Stats’][‘Details’]

I want to get the query results into a Pandas DataFrame for analysis so I need to do some additional work. First, I convert the byte strings into regular strings and then join all the records to form one large string. I then use StringIO() so Pandas read_csv() can read it just like a file off disk.

# converting the byte strings to strings and then joining them together
# to form one large string
file_str = ‘’.join(r.decode(‘utf-8) for r in records)
# doing StringIO(file_str) so it looks like CSV file to pd.read_csv()
select_df = pd.read_csv(StringIO(file_str),
names=[‘Duration’,‘Start Date’, ‘Start Station’])
>> 832
Duration Start Date Start Station
0 409 8/31/2015 23:10 San Jose City Hall
1 2915 8/30/2015 22:32 San Jose City Hall
2 2915 8/30/2015 22:31 San Jose City Hall
3 447 8/28/2015 15:31 San Jose City Hall
4 756 8/27/2015 13:06 San Jose City Hall

Retrieving those 832 records from the CSV took less than a second. Compare that to the code below which yields the same 832 records, but took 10 seconds.

# reading the entire file from S3
df = pd.read_csv(‘s3://my-bucket/201508_trip_data.csv’, usecols=[1,2,3])
# filtering for stations marked as San Jose City Hall
filter = df[‘Start Station’] == ‘San Jose City Hall’
>> 832

Here’s the code in the form of a Jupyter notebook.