In this post, we will dive a bit into model evaluation. Building with Large Language Models (LLMs) presents two major challenges:
- How do you know if the output is any good?
- How do you prevent API costs from spiraling out of control?
Today, I tackled both of these head-on by building a robust evaluation and cost-tracking framework. It’s an (again) interesting learning journey but important step for moving from a fun prototype to a reliable tool.
The Goal: Confidence in Quality and Cost
I wanted a system that could answer these questions automatically:
- When I change a prompt, does the analysis quality get better or worse?
- Which part of the analysis is the most expensive?
- Can I quantify the trade-off between cost and quality for different LLM models?
To do this, My AI assistants have helped me build three key components: a golden data generator, a model evaluator, and a centralized cost registry.
Step 1: Establishing the “Golden Truth”
You can’t evaluate quality without a benchmark. I created a script, create_golden_data.py
, to generate a “golden set” of ideal analysis outputs for a few representative tickers (like AMZN, MSFT, MA).
# scripts/create_golden_data.py
async def create_golden_file(ticker: str, email: str):
"""Run comprehensive analysis and save results as a golden file."""
logger.info("Creating golden file for ticker: %s", ticker)
# ... setup ...
response = await comprehensive_analysis(
ticker=ticker,
include_financials_history=True,
# ... other flags ...
)
# Save the complete, ideal JSON output
output_path = os.path.join(GOLDEN_DIR, f"{ticker}.json")
with open(output_path, "w", encoding="utf-8") as f:
json.dump(response.model_dump(), f, indent=2)
# ...
This script runs the full analysis pipeline and saves the structured JSON output. This becomes the “correct” answer that future runs will be compared against.
Step 2: Tracking Every Penny with CostTracker
and CostRegistry
LLM costs are measured per million tokens, which can add up deceptively fast. I needed a way to track this meticulously.
First, I created CostTracker
, a simple class to record each API request’s model, token counts, and purpose.
# src/utils/cost_tracker.py
@dataclass
class LLMRequest:
model: str
prompt_tokens: int
completion_tokens: int
request_type: str
@dataclass
class CostTracker:
requests: List[LLMRequest] = field(default_factory=list)
def add_request(self, request: LLMRequest) -> None:
self.requests.append(request)
def get_total_cost(self) -> float:
# ... calculates cost based on model pricing ...
But with multiple analyzers (for Business, MD&A, Risks) running concurrently, passing a CostTracker
instance around would be messy. The solution? A singleton CostRegistry
.
# src/utils/cost_registry.py
class CostRegistry:
_instance = None
_lock = threading.Lock()
def __new__(cls):
# Singleton pattern implementation
...
def register_tracker(self, component_name: str, tracker: CostTracker):
self.trackers[component_name] = tracker
def get_total_cost(self) -> float:
# ... aggregates costs from all registered trackers ...
Now, each BaseAnalyzer
and BaseFormatter
automatically registers its own CostTracker
with the central registry upon initialization. This is clean, decoupled, and works seamlessly with concurrent execution.
# src/utils/base_analyzer.py
class BaseAnalyzer(Generic[T]):
def __init__(self, ...):
self.cost_tracker = CostTracker()
# Register this analyzer's cost tracker with the registry
component_name = self.__class__.__name__.lower().replace("analyzer", "")
CostRegistry().register_tracker(component_name, self.cost_tracker)
Step 3: The Evaluation Harness
With the golden set and cost tracking in place, the final piece was the evaluation script, model_evaluation.py
. This script:
- Takes a list of tickers to test.
- Runs the full, current analysis pipeline for each one.
- Loads the corresponding “golden” file.
- Compares the results, calculating metrics like
parse_success_rate
(did the section get extracted at all?) andfield_fill_rate
(how many optional fields were successfully populated?). - Pulls cost data from the
CostRegistry
. - Saves a detailed JSON report with metrics and cost breakdowns.
# scripts/model_evaluation.py
async def evaluate_ticker(ticker: str, email: str):
# Reset the cost registry for a clean run
cost_registry = CostRegistry()
cost_registry.reset()
# Run the analysis
actual = await comprehensive_analysis(...)
# Load the benchmark
golden = load_golden(ticker)
# Compare and calculate metrics
parse_rate, fill_rate = compare_results(golden, actual.model_dump())
# Get costs from the central registry
total_cost = cost_registry.get_total_cost()
return {
"ticker": ticker,
"parse_success_rate": parse_rate,
"field_fill_rate": fill_rate,
"cost": {"total_cost_usd": total_cost, ...}
}
The Result: Data-Driven Development
Now, after I make a change—whether it’s tweaking a prompt, swapping a model, or refactoring code—I can run a single command:
python scripts/model_evaluation.py --tickers AAPL MSFT --email "my@email.com"
And get a report like this:
{
"run_time": "2025-06-07T20:57:18.945306",
"elapsed_seconds": 326.41,
"tickers": ["MA", "ARWR", "VST", "MSFT", "AMZN", "UBER"],
"results": [
{
"ticker": "MA",
"parse_success_rate": 0.8333333333333334,
"field_fill_rate": 1.0,
"latency_s": 280.47,
"errors": {},
"cost": {
"total_cost_usd": 0.0813,
"breakdown": {
"base": {
"total_cost": 0.0258,
"total_tokens": 41638,
"cost_by_type": {
"BaseAnalyzer": 0.0258
},
"tokens_by_type": {
"BaseAnalyzer": 41638
},
"request_count": 1
},
"incomestatement": {
"total_cost": 0.0135,
"total_tokens": 20608,
"cost_by_type": {
"IncomeStatementFormatter": 0.0135
},
"tokens_by_type": {
"IncomeStatementFormatter": 20608
},
"request_count": 13
},
"balancesheet": {
"total_cost": 0.0196,
"total_tokens": 30754,
"cost_by_type": {
"BalanceSheetFormatter": 0.0196
},
"tokens_by_type": {
"BalanceSheetFormatter": 30754
},
"request_count": 13
},
"cashflowstatement": {
"total_cost": 0.0198,
"total_tokens": 30983,
"cost_by_type": {
"CashFlowStatementFormatter": 0.0198
},
"tokens_by_type": {
"CashFlowStatementFormatter": 30983
},
"request_count": 13
},
"equitystatement": {
"total_cost": 0.0026,
"total_tokens": 3910,
"cost_by_type": {
"EquityStatementFormatter": 0.0026
},
"tokens_by_type": {
"EquityStatementFormatter": 3910
},
"request_count": 7
}
},
"metrics": {
"parse_success_per_dollar": 10.2501,
"field_fill_per_dollar": 12.3001
}
}
}
]
}
This framework gives a better view of the project’s performance and cost dynamics. I can see immediately if a change breaks something or has a significant impact on cost. It’s been a good learning experience building this LLM-based application, and I’m excited to see how it helps guide the project forward.