Scaling Doc Processing with Serverless
Intro to Document Processing
Standard Metrics' mission is to accelerate innovation in the private markets. We're building a platform that aims to become a financial "lingua franca" between startups and venture firms.
We help startups share financial metrics with their investors effortlessly. One of our core services is analyzing, validating, and capturing metrics from Financial Documents that startups share with their investors.
Our Data Solutions team is in charge of this process. They offer strong SLAs on the time it takes from when a document is uploaded to our platform until it's analyzed and its metrics are available. It's part of their quarterly OKRs, so they track it religiously and have goals around it.
The Problem
Excel files are used ubiquitously for financial data. They represent 60% of all the documents on the platform. It's so crucial that we built a specific tool where the Data Solutions team analyzes, validates, and maps which value belongs to which financial metric.
A JSON representation of the Excel file powers the tool's UI. Files uploaded to the platform go to S3 for storage and antivirus scanning. Then, a Celery task kicks off in the background, processes the file, transforms it into its JSON representation, and uploads it back to S3 for the UI to consume.
There were several issues with this first implementation of the Document Processing queue. Files would fail to process from time to time, "clogging" the queue. We did not have instrumentation for the failures, so this would happen silently until the solutions team noticed and reported the issue.
Investigating these issues was a time sink for engineering and negatively impacted the Solution team's SLAs and OKRs.
Step Zero
The initial theory was that Calamine, the library we used to read the Excel files, was consuming too much memory for big files or files with images and getting killed by the Kubernetes controller, blocking other tasks in the same Pod and the tasks behind it as when it was rescheduled.
Our highest priority was instrumenting the celery task to see into the failures. We improved logs and metrics on the task to track "Time in Queue" and the number of "Items in the Queue."
We added two alarms:
- When the time in queue exceeded 30 minutes.
- When there were more than 25 files in the queue for more than 15 minutes.
Things improved significantly, and we detected issues in hours instead of days. On the flip side, this alarm was triggering as often as three times per week. Things still needed to be better for the Solution team’s SLAs.
We also noticed a new pattern: During reporting season (when Startups share more data with their investors) time in queue went up, taking up to 20 minutes for individual tasks to run.
With all this new information, and knowing this was happening more often than initially thought, we carved out space in our roadmap to find a solution.
Doc Processing v2
With a better understanding of how things needed to work, the team proposed a new architecture for Doc Processing where queueing and JSON generation happens outside our core backend using AWS Lambda and SQS instead of running inside our Kubernetes cluster running in Celery tasks.
Our goals for the project were:
- Better Isolation. Processing wouldn’t affect other tasks in the platform. And when a file fails, it is retried automatically. It shouldn’t block other tasks.
- Built with good bones. Good instrumentation makes understanding problems easier. We have better guard rails against failures with strong retry guarantees from the infra and the ability to “replay” failures. We have better guard rails against failures with strong retry guarantees from the infra and the ability to “replay” failures.
- Dogfooding and integrating with the core service. The plan was to leverage our existing backend codebase as much as possible. We replaced the consumer module only with webhooks for event notifications, keeping the original workflow and business rules untouched as much as possible. We also integrated the existing anti-virus workflow and improved its interaction with the backend.
High-level implementation
We swapped out the Celery tasks for NodeJS lambda processors. We integrated directly with the antivirus events from S3 via EventBridge. Lambda picks up files and queues them for processing as soon as the antivirus marks them as clean.
When files are processed, their JSON representation is uploaded back to S3, and there’s a notification via a webhook to our core backend with the status. Because of the SQS integration, all files are automatically retried three times in case of failures, and they’re sent to a Dead-letter queue (DLQ) when they’re unprocessable.
We limit the volume of requests to the backend with Lambda’s reserved concurrency settings. Any spikes are processed slowly, limiting their impact on other tasks for resources in the backend.
Observability is a first-class citizen this time around. We use DataDog to trace the process end to end. Logs from all our lambda functions and the backend are centralized in DataDog, too, making figuring out issues much simpler.
We have alarms for Processing time, Queue length, and Documents in the DQL. We can trace by document name, document id, request id or execution id and see the flow from when a file is uploaded, when it’s queued, picked up for processing, all the way to the webhook in the backend.
Processing files is much faster. Our average processing time is now 8 seconds, down from 20 minutes. The maximum time processing went down to 3 minutes, down from 4 days. Because of that, we also alarm much earlier now; we alarm when a document is been queued for 10 minutes (down from 30 minutes) or when there are files in the DLQ.
We also handle spikier loads gracefully now. In our previous implementation, heavier traffic usually meant the queue slowed to a crawl. When tasks spiked 3x to 4x our regular traffic, resource contention started bleeding into other endpoints and celery tasks.
Our most recent spike was 35x our average traffic. It was very noticeable on the dashboards compared with our average weekly traffic; however, its impact was negligible.
We processed the spike with a 100% success rate. Maximum time in queue went up to 2.43min as the new system “buffered” files in the queue for processing. Still, time processing per document (each task’s execution) stayed well under the 8s average.
Learning Lessons
This project was a success; we learned a lot while building it. We ran the two versions side to side for a week and identified all failure modes in processing that were hidden before. We migrated all traffic in about three weeks, including a few extra features — no major issues.
Our big wins were:
- We reduced the time in queue and improved the experience of the solutions team with extra features and a better integration with the antivirus workflow.
- We improved error handling, processing speed, traffic spikes handling, and instrumentation, improved our time to recovery, and made our runbooks a lot simpler.
Some trade-offs:
- This project adds extra complexity as we are adding more infrastructure and a new pattern with serverless. All engineers will need time to understand and work with this new stack. Testing locally is also more difficult.
- We had to choose between TypeScript and Python for Lambda. We went with TypeScript because Lambda makes it much easier to work with Node (both infra and logic use the same codebase).
Overall, we are very happy with this new implementation. It was a win for both DS and Engineering teams not having to think about “the queue” anymore.
For the DS team specifically it was a huge quality of life improvement, they now focus on the part of their day to day that matters (and not how that is implemented). They also had a huge win on their SLAs last quarter! It’s nice to think the engineering team help them achieve it.