Efficiently uploading Serratus files to AWS databases using Lambda
I recently implemented an update across the stack for serratus.io. Here, I’ll share some experiences from the process.
- The data
- The goals
- Our approach, including experimentation of various AWS offerings.
Data
summary2/<sra_id>.summary
: Serratus analysis results stored as S3 files in Serratus summary format.- For the Nucleotide search, there was 3,837,755 files.
- For the RdRP search, there was 5,686,715 files.
index.txt
: index file containing onesra_id
per line.
Goals
- Quickly upload all summary data.
- Expose the data as publicly searchable. Consumption options:
- Direct databse connection
- R [tantalus]
- Graphical interface [serratus.io]
- REST API [serratus-summary-api]
Upload
Source code: serratus-summary-uploader
This didn’t take much trial/error - it was clear that AWS Lambda would be perfect for this. A manager-worker parallelism approach was used. This means 2 Lambda functions:
- Manager lambda
- Iterate through the index file
- Invoke the worker lambda for every batch of lines to be processed (used
n=1000
) - Self-invoke prior to function timeout to continue index file iteration
- Worker lambda
- Download summary files from S3 (with exponential backoff for throttling)
- Parse summary files into
pandas
dataframes - Upload data to destination. The destination choice was an experiment itself which is discussed in the next section.
Each worker lambda takes ~5 minutes to complete. Adding in the sequential nature of the manager lambda + time for lambda function invocation, an index file with 3,837,755 lines finished in < 20 minutes.
Data storage
The choice of upload destination was tricky and much time was spent experimenting with various AWS offerings. Below is a section for each offering we tried.
Athena
Pros:
- Easily view files in S3
awswrangler
made it easy to upload data with optimizations
Cons:
- Not publicly query-able (requires access to Athena on host AWS account)
- S3 upload is slow with frequent throttling during parallelized upload (
SlowDown
) - Simple queries are extremely slow, e.g.
select * from nfamily where sra_id = 'ERR2756788' limit 10
- Initially, this was the problem for any column used in
where
. After adding partition projection forscore
andpercent_identity
(since those are the filter options on serratus.io), queries on those columns were manageable. However, partitioning on a high-cardinality column e.g.sra_id
isn’t the right optimization. Bucketing could be an option, but it didn’t seem possible with our parallelized upload approach, and would result in many small parquet files (undesireble for Athena queries)
- Initially, this was the problem for any column used in
DynamoDB
I didn’t spend much time evaluating this one.
Pros:
- Fast upload
Cons:
- NoSQL (our data is relational)
- PartiQL isn’t SQL, and no ability to have a direct public database connection
Aurora Serverless (PostgreSQL) (v1)
Pros:
- Low cost
- Auto/manual scaling of “capacity units”
- Fast upload when instance count manually scaled up
Cons:
- Direct connection must be from within the VPC.
- Data API is publicly accessible but has drastic limitations
- 1MB / 1000-row query result limit
- Requires Secrets Manager entry retrieval (not publicly accessible via password auth)
- Waking from “0 capacity units” state takes up to a minute, even for simple queries. This would result in painful experiences for the end user on serratus.io.
Aurora Provisioned
Pros:
- Publicly accessible
- Always-on
Cons:
- No way to scale up/down for batch uploads to save cost.
Final setup
The final setup is a mixture of provisioned and serverless Aurora:
- Serverless Aurora for data upload
- Restore from snapshot to
db.t3.medium
provisioned Aurora instance for every update.- Provide read-only access to database with public credentials.
We can do this because our data is served as read-only with infrequent updates. All updates go to the Serverless cluster first, then a snapshot is created and restored (takes ~30min) and references to the current endpoint are updated. This way we can also stage changes in a separate endpoint and make a PR downstream (example).
See this wiki page for more info on Serratus database management.