LLM Experimentation

Intro

For Q4 2024, my focus at Standard Metrics has been experimenting with LLMs for Data Extraction from Financial Documents startups share with their investors (think an export of your income statement from QuickBooks or a Slide deck for your investors).

It's been a learning experience, and for me, a good first foray into the world of "productionizing" Generative AI.

I joined the project specifically to iterate and improve the accuracy of the LLM results and build our own evaluation framework.

Evaluations

Evaluations are a set of "deterministic tests" running against a prompt, model and a dataset that score how good the LLM results are for a specific use case, and for us, how much of the result can flow into our system.

A representative set of evaluations measure how well an AI feature is working, and helps detect and avoid regressions as the feature evolves or the models update.

I modeled our evaluation metrics considering data extraction as "suggestions to a human in the loop." — We give the LLM a document, and it says: "These are the metrics from this document I suggest you push into your system." Then, the Data Solutions team analyzes the results and can accept or reject any of the data points.

If you're a developer, this is very similar to the suggestions you get when using Gen AI in your IDE or a terminal like Warp:

Warp auto complete — Warp Terminal auto completing the "uv add" command

Accuracy is essential to maintain Standard Metrics' high bar for data quality. The "human in the loop" model treats the LLM more like a "suggestions engine" or an "AI-powered assistant" to guarantee accuracy. It acknowledges that while the LLMs are very valuable for data extraction, they are still improving and need oversight.

Evaluation Metrics

Our evaluation metrics penalize the LLM if it fails to include one of our Standard Metrics from the document, if the suggested metric value does not match the actual reported value, or if the team rejects any part of the suggestions.

Additionally, since LLMs do not always provide the expected output, we also score whether the response aligns with the requested schema format.

Gold-Standard Dataset

A key component of our framework is our "gold-standard" dataset used for scoring the accuracy of the data extracted by the LLMs against a known set of metrics in our platform (this is sometimes also called a "reference-based" or "context-free" evaluation dataset). This dataset provides the ground truth for evaluating every experiment we run.

As the the first step of this project, we wanted to measure the current performance as baseline to compare against as we ran experiments. The Collections and Data Solutions teams had been running this feature's proof of concept since Q3, so I had enough data to build a "gold-standard dataset" for evaluations.

I built our dataset with a Jupyter Notebook, pulling historical documents from the system, comparing the LLM responses against metric values in Standard Metrics, and storing the "gold-standard" values in a DuckDB database.

Evaluation Framework

The evaluation framework uses Prompt Foo with a custom provider and custom assertions for each one of our four core metrics.

The custom provider fetches the LLM results, and the assertions compare them with the "gold-standard" values from DuckDB, applying our scoring metrics.

At the end of each evaluation, we get a summary table with the total scores for each one of our metrics.

These were our baseline results using GPT-4o:

bash

$ pnpm tsx evaluator.ts --hash baseline --pdf

✅ Processing llm results: 10:35.255 (m:ss.mmm) 
🧮 Collecting results: 1.038s

Eval scores for pdf files:
┌────────────────────┬──────────────┬──────────────┬─────────────────┐
│ schema_consistency │ metric_count │ metric_match │ acceptance_rate │
│       double       │    double    │    double    │     double      │
├────────────────────┼──────────────┼──────────────┼─────────────────┤
│               0.95 │         0.73 │         0.47 │            0.61 │
└────────────────────┴──────────────┴──────────────┴─────────────────┘

⚫◗ Adding score to duckdb: 81.669ms

This is how we interpret these results:

Schema consistency: GPT-4o's responses (with "function calling") followed our expected schema 95% of time. It was generally producing a response with our JSON schema but not every time. This is one of the main subtle differences between calling a normal API endpoint and a LLM.
Metric Count: It identified and extracted the expected metrics in each document 73% of the time. It's capturing a high portion of the metrics, but it is still missing metrics.
Metric Match: This score shows how often the values GPT-4o suggested match what we expect from the reference dataset. Only 47% of the time, which is pretty low.
Acceptance Rate: The Data Solutions team accepted only 61% of the data points GPT-4o suggested from each document. Little over half of the data points were deemed correct or useful by the team. Lots of room for improvement.

Summarizing the results, we found GPT-4o consistently identify most of the metrics, but there was a significant opportunity to improve its accuracy.

Our experimentation workflow

With this framework, we've reduced our feedback loop between thinking of an experiment that will improve the accuracy of the LLM results and getting a score to decide if we push the experiment to production.

We have three phases after defining the experiment:

Building the test case: Getting a list of documents we'll use for the test and validating parameters (like the mix of prompt, model, and version).
Generating the results: Invoking the LLM pipeline with our parameters against all the documents in the dataset.
Evaluating results: Fetching the results, running the evaluations and scores, and finally analyzing and comparing the results.

When we started working with LLMs this was a manual, nuanced, and error-prone process that took a couple of days. Now, the entire workflow runs in less than 30 minutes. At the end of the run, we have a clear decision on whether the experiment is good to go to production.

Key experiments & findings

Anthropic's PDF Pre-processing (beta)

I've always been partial to Claude in my day to day. Comparing Sonnet against GPT-4o was one of the first things I wanted to evaluate. When they announced beta support for PDFs we experimented with it right away.

bash

$ pnpm tsx evaluator.ts --hash V3ZpJYMXI --prompt "anthropic-base-prompt:6" --pdf

✅ Processing llm results: 10:44.656 (m:ss.mmm)
🧮 Collecting results:  1.318s

Eval scores for pdf files:
┌────────────────────┬──────────────┬──────────────┬─────────────────┐
│ schema_consistency │ metric_count │ metric_match │ acceptance_rate │
│       double       │    double    │    double    │     double      │
├────────────────────┼──────────────┼──────────────┼─────────────────┤
│                1.0 │         0.94 │         0.64 │             0.8 │
└────────────────────┴──────────────┴──────────────┴─────────────────┘

⚫◗ Adding score to duckdb: 82.101ms

Anthropic does an outstanding job pre-processing a file on their side, and injecting it as context to the prompt. Claude Sonnet (claude-3-5-sonnet-20241022) performed better than GPT-4o across all four evaluation metrics in this experiment. It's now our default model for PDFs that fall within their API limits.

OpenAI's o1-preview model

When OpenAI announced o1, we were curious to know if it improved OpenAI's performance compared to its previous model. We had to tweak our OpenAI wrapper module for this experiment because o1 does not support calling tools and does not support system prompts either.

The o1-preview model could only process 30% of the documents in the dataset; for those, the results were much better than GPT-4o:

bash

$ pnpm tsx evaluator.ts --hash o1 --prompt "openai-base-prompt:8" --pdf

✅ Processing llm results: 15:32.182ms (m:ss.mmm) 
🧮 Collecting results: 1.002ms

Eval scores for pdf files:
┌────────────────────┬──────────────┬──────────────┬─────────────────┐
│ schema_consistency │ metric_count │ metric_match │ acceptance_rate │
│       double       │    double    │    double    │     double      │
├────────────────────┼──────────────┼──────────────┼─────────────────┤
│               0.96 │         0.97 │         0.65 │             0.8 │
└────────────────────┴──────────────┴──────────────┴─────────────────┘

⚫◗ Adding score to duckdb: 47.028ms

While the results were substantially better than GPT-4o for the subset of the data the model could process, Claude Sonnet still scored higher for the same subset of documents:

bash

$ pnpm tsx evaluator.ts --hash V3ZpJYMXI --prompt "anthropic-base-prompt:6" --pdf

✅ Processing llm results: 9:22.180 (m:ss.mmm) 
🧮 Collecting results: 973.689ms

Eval scores for pdf files:
┌────────────────────┬──────────────┬──────────────┬─────────────────┐
│ schema_consistency │ metric_count │ metric_match │ acceptance_rate │
│       double       │    double    │    double    │     double      │
├────────────────────┼──────────────┼──────────────┼─────────────────┤
│                1.0 │         0.94 │         0.67 │             0.9 │
└────────────────────┴──────────────┴──────────────┴─────────────────┘

⚫◗ Adding score to duckdb: 72.189ms

What's next

Our next set of experiments are around "chained prompts" or "prompt routing", having the LLM identify which prompt from our library we should use for specific types of documents, and do several passes with these simpler, more specific prompts to extract different sets of data.

We also want to experiment with Docling, and Markitdown for pre-processing different types of documents. They do something very similar to Anthropic's PDF beta feature. They parse documents from different formats like PDF, PPT, Excel and Word, and export it to Markdown or JSON.

These exports are easier for LLMs to understand the documents, and we theorize that they will increase accuracy across the board for data extraction.

Closing

Language models continue to get bigger and better; we expect scores to improve, too. We'll continue refining, adopting new strategies, and experimenting towards our goal of 100% accuracy, and engineering more "AI-enhanced" features for Standard Metrics.

Thanks for reading. Now, coffee time!

Coffee Time: Coffee Cupping

Method: Cupping
Preparation Time: 20 minutes

Coffee cupping is coffee evaluation. It gives you a shared language to understand coffee tastes.

Steps

Grind your favorite coffee beans to a medium grind. Similar to what you'd use for a v60
Grab a cup, use 10 grams of coffee on the cup
Fill the cup with hot water evenly to the brim, let them sit for four minutes
Break "the crust" of grounds floating in the surface with a round spoon pushing and swirling gently as you lower your nose close to the cup
Take notes of the smells you get from the coffee
Remove the floating ground coffee from the top with the help of the spoon
Wait for the coffee to cool down to a temperature you're comfortable drinking
Slurp your coffee with the help of the spoon. Be bold when slurping, it sprays the coffee flavor across your palate to "catch" more flavors
Take notes again of the flavors you get from the slurp.

When tasting, try to identify: Acidity (bright? citrusy? winey?). Body (light? heavy? creamy?). Flavors (fruity? nutty? chocolate? earthy?) and the Aftertaste.

Here's the full video What Is Coffee Cupping from Onyx Lab.

LLM Experimentation ​

Intro ​

Evaluations ​

Evaluation Metrics ​

Gold-Standard Dataset ​

Evaluation Framework ​

Our experimentation workflow ​

Key experiments & findings ​

Anthropic's PDF Pre-processing (beta) ​

OpenAI's o1-preview model ​

What's next ​

Closing ​

Coffee Time: Coffee Cupping

Steps

Learning More 📚 ​

LLM Experimentation

Intro

Evaluations

Evaluation Metrics

Gold-Standard Dataset

Evaluation Framework

Our experimentation workflow

Key experiments & findings

Anthropic's PDF Pre-processing (beta)

OpenAI's o1-preview model

What's next

Closing

Learning More 📚