Introducing Fundamental Research Labs

May 26, 2025

At this year's AI for Finance Symposium - the 2nd Workshop on LLMs and Generative AI for Finance (ACM ICAIF '25) - the focus was squarely on turning AI from prototypes into reliable workflow tools. Speakers from BlackRock, Fidelity, Lion Global, CLSA, IBM, and MIT Sloan converged on three themes: evaluation discipline, randomness control, and context engineering. Together, these themes captured how buy-side, sell-side, and research teams are tackling the practical challenges of bringing AI into production. (1) Evaluation is becoming the core of AI deployment Across buy- and sell-side teams, the focus has shifted from testing individual models to evaluating entire workflows. Speakers from Fidelity and Lion Global emphasized that answers must stay tied to evidence, behave consistently across multi-turn questions, and be testable with point-in-time data. Many firms described evaluation frameworks that combine reproducibility checks, backtestable retrieval, and clearer documentation of assumptions. The takeaway was consistent: the ability to evaluate models rigorously - not new model capability - determines whether AI reaches production in the investment process. (2) Randomness control is now a gating requirement A recurring theme, highlighted by IBM's Raffi Khatchadourian and Rolando Franco, was the need to control nondeterminism. Investment workflows cannot rely on outputs that shift from run to run. Output drift - where a model gives different answers to the same question - showed up even in well-tuned systems. IBM demonstrated that smaller 7–8B models can deliver fully deterministic outputs, while larger models tend to "out-reason themselves" and lose stability. To manage this, teams are adding cross-provider validation, agreement tests across multiple models, drift monitoring, and multi-turn reproducibility checks. Firms like BlackRock stressed that stability and predictability matter more than raw capability. (3) Context engineering drives real workflow quality Speakers across BlackRock, Bernstein, and CLSA showed that performance improves dramatically when AI systems are fed structured internal data and firm-specific frameworks. Teams are investing in cleaner retrieval pipelines, consistent formatting of filings and broker research, and deterministic extraction layers that reduce noise. Several firms demonstrated layered agents that separate retrieval, reasoning, memory, and validation. Adding lightweight memory systems often increased correctness from ~30% to ~85% by helping agents follow firm-specific processes rather than relying on one-shot prompts.