Efficiently uploading Serratus files to AWS databases using Lambda
I recently implemented an update across the stack for serratus.io. Here, I’ll share some experiences from the process.
- The data
- The goals
- Our approach, including experimentation of various AWS offerings.
Data
summary2/<sra_id>.summary: Serratus analysis results stored as S3 files in Serratus summary format.- For the Nucleotide search, there was 3,837,755 files.
- For the RdRP search, there was 5,686,715 files.
index.txt: index file containing onesra_idper line.
Goals
- Quickly upload all summary data.
- Expose the data as publicly searchable. Consumption options:
- Direct databse connection
- R [tantalus]
- Graphical interface [serratus.io]
- REST API [serratus-summary-api]
Upload
Source code: serratus-summary-uploader
This didn’t take much trial/error - it was clear that AWS Lambda would be perfect for this. A manager-worker parallelism approach was used. This means 2 Lambda functions:
- Manager lambda
- Iterate through the index file
- Invoke the worker lambda for every batch of lines to be processed (used
n=1000) - Self-invoke prior to function timeout to continue index file iteration
- Worker lambda
- Download summary files from S3 (with exponential backoff for throttling)
- Parse summary files into
pandasdataframes - Upload data to destination. The destination choice was an experiment itself which is discussed in the next section.
Each worker lambda takes ~5 minutes to complete. Adding in the sequential nature of the manager lambda + time for lambda function invocation, an index file with 3,837,755 lines finished in < 20 minutes.
Data storage
The choice of upload destination was tricky and much time was spent experimenting with various AWS offerings. Below is a section for each offering we tried.
Athena
Pros:
- Easily view files in S3
awswranglermade it easy to upload data with optimizations
Cons:
- Not publicly query-able (requires access to Athena on host AWS account)
- S3 upload is slow with frequent throttling during parallelized upload (
SlowDown) - Simple queries are extremely slow, e.g.
select * from nfamily where sra_id = 'ERR2756788' limit 10- Initially, this was the problem for any column used in
where. After adding partition projection forscoreandpercent_identity(since those are the filter options on serratus.io), queries on those columns were manageable. However, partitioning on a high-cardinality column e.g.sra_idisn’t the right optimization. Bucketing could be an option, but it didn’t seem possible with our parallelized upload approach, and would result in many small parquet files (undesireble for Athena queries)
- Initially, this was the problem for any column used in
DynamoDB
I didn’t spend much time evaluating this one.
Pros:
- Fast upload
Cons:
- NoSQL (our data is relational)
- PartiQL isn’t SQL, and no ability to have a direct public database connection
Aurora Serverless (PostgreSQL) (v1)
Pros:
- Low cost
- Auto/manual scaling of “capacity units”
- Fast upload when instance count manually scaled up
Cons:
- Direct connection must be from within the VPC.
- Data API is publicly accessible but has drastic limitations
- 1MB / 1000-row query result limit
- Requires Secrets Manager entry retrieval (not publicly accessible via password auth)
- Waking from “0 capacity units” state takes up to a minute, even for simple queries. This would result in painful experiences for the end user on serratus.io.
Aurora Provisioned
Pros:
- Publicly accessible
- Always-on
Cons:
- No way to scale up/down for batch uploads to save cost.
Final setup
The final setup is a mixture of provisioned and serverless Aurora:
- Serverless Aurora for data upload
- Restore from snapshot to
db.t3.mediumprovisioned Aurora instance for every update.- Provide read-only access to database with public credentials.
We can do this because our data is served as read-only with infrequent updates. All updates go to the Serverless cluster first, then a snapshot is created and restored (takes ~30min) and references to the current endpoint are updated. This way we can also stage changes in a separate endpoint and make a PR downstream (example).
See this wiki page for more info on Serratus database management.