SR 11-7Model Risk ManagementAI Governance

SR 11-7 and AI: How to Validate LLM-Based Models Under OCC/Fed Guidance

XeroML Team ·

The adoption of large language model (LLM) based agents in financial services is accelerating. From automated credit decisioning to fraud detection narratives and customer-facing advisory bots, institutions are deploying AI systems that would have been unthinkable five years ago. But the regulatory framework governing these deployments is not new. SR 11-7, the Federal Reserve and OCC’s supervisory guidance on model risk management, remains the definitive standard — and examiners are already applying it to LLM-based systems.

The challenge is that SR 11-7 was written for traditional statistical models. Applying its requirements to non-deterministic, prompt-driven, black-box AI agents demands a fundamentally different validation approach. This guide maps each SR 11-7 pillar to practical LLM validation techniques that compliance officers and AI engineers can implement today.

What SR 11-7 Requires: A Quick Refresher

Issued in 2011 by the Federal Reserve Board and adopted by the OCC, SR 11-7 establishes three core pillars of model risk management:

  1. Model Development, Implementation, and Use — Models must be built on sound theory, tested rigorously before deployment, and used only within their intended scope.
  2. Model Validation — Independent review must confirm that a model performs as expected, including evaluation of conceptual soundness, outcomes analysis, and benchmarking.
  3. Model Risk Governance — Ongoing oversight, including board-level accountability, model inventories, and escalation procedures for model failures.

SR 11-7 defines a “model” broadly as any quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates. Under this definition, LLM-based agents that generate credit scores, risk assessments, or lending recommendations are unambiguously within scope.

“The use of models invariably presents model risk, which is the potential for adverse consequences from decisions based on incorrect or misused model outputs and reports.” — SR 11-7, Section 1

How LLM-Based Agents Differ From Traditional Models

Before mapping SR 11-7 to LLM validation, it is critical to understand why traditional model validation frameworks break down when applied to generative AI systems.

Non-Deterministic Outputs

Traditional regression or scorecard models produce the same output for the same input every time. LLMs do not. Even with temperature set to zero, subtle differences in tokenization, model versioning, and API behavior can produce varying outputs. This makes reproducibility — a cornerstone of SR 11-7 validation — significantly harder.

Black-Box Architecture

While logistic regression coefficients can be inspected directly, the internal reasoning of a 70-billion-parameter language model cannot. Feature importance analysis, a standard validation technique, does not translate cleanly to transformer architectures. Explainability requires new approaches such as attention analysis, chain-of-thought extraction, and input attribution methods.

Prompt-Driven Behavior

In traditional models, behavior is governed by trained weights and fixed feature engineering. In LLM agents, behavior is heavily influenced by prompts — system instructions, few-shot examples, and retrieval-augmented generation (RAG) context. A single prompt change can fundamentally alter model behavior in ways that traditional change management processes do not capture.

Emergent Capabilities and Failure Modes

LLMs exhibit emergent behaviors that are difficult to predict from their training data. They can hallucinate plausible-sounding but factually incorrect information, exhibit biases not present in explicit training examples, and fail unpredictably on edge cases that would be straightforward for traditional models.

Mapping SR 11-7 Requirements to LLM Validation

Pillar 1: Model Development and Documentation

SR 11-7 requires thorough documentation of model design, theory, data, assumptions, and limitations. For LLM-based agents, this translates to:

  • System prompt documentation — Full version-controlled records of all system prompts, including the rationale for prompt design decisions, tested alternatives, and known limitations.
  • Architecture documentation — Which foundation model is used, what version, fine-tuning details (if any), RAG pipeline configuration, and guardrail implementations.
  • Training data provenance — For fine-tuned models, complete documentation of training data sources, preprocessing steps, and bias assessments. For off-the-shelf models, documentation of known training data characteristics and limitations.
  • Intended use and scope boundaries — Explicit documentation of what the agent is and is not designed to do, including prohibited use cases and fallback procedures.
  • Input/output logging — Complete audit trails of every input sent to and output received from the LLM, with timestamps, user context, and session metadata. This is not optional; it is foundational to every other validation activity.

Pillar 2: Model Validation

Independent validation under SR 11-7 must include conceptual soundness evaluation, outcomes analysis, and benchmarking. For LLMs, each of these takes a different form.

Conceptual Soundness Evaluation:

  • Review prompt engineering methodology for logical soundness
  • Evaluate RAG retrieval quality and relevance scoring
  • Assess guardrail design and coverage of known risk scenarios
  • Validate that the LLM architecture is appropriate for the use case (e.g., generative models should not be used where deterministic rule-based systems would suffice)

Outcomes Analysis:

  • Compare LLM outputs against known-correct outcomes on historical data
  • Measure accuracy, consistency, and compliance of outputs across demographic segments
  • Test for disparate impact using matched-pair testing methodologies
  • Evaluate adverse action notice quality against ECOA requirements

Benchmarking:

  • Implement champion-challenger testing (detailed below)
  • Compare LLM performance against traditional model baselines
  • Benchmark against human expert decisions on representative samples

Pillar 3: Ongoing Monitoring and Governance

  • Drift detection — Monitor input distributions, output distributions, and performance metrics over time to detect model degradation
  • Performance dashboards — Real-time visibility into accuracy, consistency, latency, and compliance metrics
  • Escalation procedures — Automated alerts when performance drops below defined thresholds, with clear escalation paths to model risk committees
  • Periodic revalidation — Scheduled comprehensive revalidation at least annually, with event-driven revalidation triggered by model updates, prompt changes, or performance anomalies

Champion-Challenger Testing for LLM Agents

Champion-challenger testing is a well-established practice in traditional model risk management. The concept is straightforward: run the production model (champion) alongside a candidate model (challenger) on the same inputs, and compare performance. For LLM agents, this methodology requires adaptation.

Designing Effective Tests

Test dataset construction is critical. Build datasets that include:

  • Representative production traffic across all customer segments
  • Edge cases that have historically caused errors
  • Protected class scenarios to test for fair lending compliance
  • Adversarial inputs designed to trigger hallucinations or policy violations

Evaluation criteria must go beyond accuracy to include:

  • Consistency — Does the challenger produce more stable outputs across repeated runs?
  • Compliance — Does the challenger generate outputs that meet regulatory requirements (e.g., complete adverse action reasons)?
  • Explainability — Can the challenger’s decisions be explained to consumers and regulators?
  • Latency and cost — Does the challenger meet operational requirements?

Running Parallel Evaluations

Best practice is to run champion and challenger models simultaneously on a shadow basis — the challenger processes the same inputs but its outputs are not served to customers. This eliminates the risk of exposing customers to an unvalidated model while generating direct comparison data.

Log all inputs, champion outputs, and challenger outputs to your compliance observability platform. Automated comparison reports should be generated weekly, with statistical significance testing on all key metrics.

Promotion Criteria

Define explicit, quantitative criteria for promoting a challenger to champion status:

  • Minimum sample size (typically 10,000+ decisions for lending models)
  • Statistical significance thresholds for performance improvement
  • No degradation in fair lending metrics across any protected class
  • Compliance team sign-off on output quality
  • Model risk committee approval

Ongoing Monitoring: What Examiners Expect

Recent examination findings make clear that regulators expect robust ongoing monitoring programs for AI models. The following areas receive particular scrutiny.

Input Drift Detection

Monitor the distribution of inputs to your LLM agents. If the characteristics of incoming applications, queries, or transactions shift meaningfully from the data on which the model was validated, performance may degrade. Track:

  • Feature distribution statistics (mean, variance, percentiles) over rolling windows
  • New category values or out-of-vocabulary inputs
  • Changes in input volume or timing patterns
  • Shifts in the demographic composition of the applicant population

Output Drift Detection

Even when inputs remain stable, LLM outputs can drift due to subtle changes in API behavior, context window utilization, or upstream data pipeline changes. Monitor:

  • Output distribution statistics (approval rates, risk scores, denial reason distributions)
  • Output consistency (variance across repeated evaluations of identical inputs)
  • Compliance metrics (completeness of required disclosures, accuracy of reason codes)
  • Sentiment and tone consistency in customer-facing outputs

Performance Degradation Signals

Establish automated alerting for:

  • Accuracy dropping below validation-period baselines by more than a defined threshold (commonly 2-5%)
  • Approval rate shifts that diverge from expected seasonal patterns
  • Disparate impact ratios approaching the four-fifths threshold
  • Increase in override rates by human reviewers
  • Customer complaint trends related to AI-driven decisions

Documentation and Audit Trail

Every monitoring observation, alert, investigation, and remediation action must be documented. Examiners expect to see a clear trail from anomaly detection through root cause analysis to resolution. This documentation should be generated automatically where possible and reviewed by model risk management staff.

How Compliance Observability Platforms Meet SR 11-7 Requirements

The validation and monitoring requirements described above are technically feasible but operationally demanding. Manual processes cannot scale to the volume and velocity of LLM-based decisions at most financial institutions.

A compliance observability platform like XeroML addresses this by providing:

Automated Input/Output Logging — Every interaction with your LLM agents is captured with full context, creating the audit trail that examiners require without burdening engineering teams with custom logging infrastructure.

Real-Time Drift Detection — Continuous statistical monitoring of input and output distributions, with automated alerting when drift exceeds configurable thresholds. This replaces quarterly manual reviews with continuous oversight.

Bias and Fair Lending Testing — Automated disparate impact analysis across protected classes, integrated into your CI/CD pipeline so that every model update is tested before deployment. For more detail on fair lending testing, see our ECOA compliance guide.

Champion-Challenger Infrastructure — Built-in support for running parallel model evaluations, with automated comparison dashboards and statistical significance testing.

Exam-Ready Reporting — Pre-built report templates aligned to SR 11-7 requirements, including model inventory reports, validation summaries, ongoing monitoring dashboards, and issue tracking. For a comprehensive framework, review our SR 11-7 AI Model Governance Guide.

Version Control and Change Management — Full version history of prompts, model configurations, and guardrails, with automated impact analysis for proposed changes.

Building Your SR 11-7 Compliance Program for AI

Implementing a compliant model risk management program for LLM-based agents is not a one-time project. It requires a systematic approach.

Step 1: Inventory and Classify

Identify all LLM-based systems in use or in development. Classify each by risk tier using your institution’s existing model risk tiering framework, adjusting criteria as needed for AI-specific risk factors (non-determinism, explainability limitations, data privacy exposure).

Step 2: Establish Validation Standards

Develop LLM-specific validation standards that map to SR 11-7 requirements. These should be reviewed by your model risk committee and approved at the appropriate governance level. Standards should address prompt validation, output quality testing, bias testing, and documentation requirements.

Step 3: Implement Monitoring Infrastructure

Deploy continuous monitoring for all production LLM agents. At minimum, this must include input/output logging, output drift detection, performance tracking, and automated alerting. Manual monitoring is insufficient for the volume of decisions most institutions process.

Step 4: Train Your Teams

Both model developers and validators need training on LLM-specific validation techniques. Traditional model validators may not have experience with prompt engineering, attention analysis, or generative AI evaluation methodologies.

Step 5: Prepare for Examination

Assemble exam-ready documentation packages for each LLM agent, including development documentation, validation reports, ongoing monitoring summaries, and issue logs. Conduct mock examinations to identify gaps before regulators arrive.

Conclusion

SR 11-7 compliance for LLM-based agents is not optional, and the bar is rising. Institutions that invest in robust validation and monitoring infrastructure now will be better positioned for both regulatory examinations and the operational risk management that AI systems demand.

The key is to recognize that while SR 11-7’s principles are technology-agnostic, the specific techniques for applying those principles to LLMs require significant adaptation. Input/output logging, continuous drift monitoring, automated bias testing, and champion-challenger infrastructure are not luxuries — they are the minimum standard that examiners will expect.

For institutions building or evaluating their AI model risk management programs, XeroML provides the compliance observability infrastructure needed to meet SR 11-7 requirements at scale. From automated logging and drift detection to exam-ready reporting, the platform is purpose-built for the unique challenges of LLM validation in financial services.

For related guidance, explore our fair lending risk assessment guide, understand the true cost of compliance breaches, and download our Fair Lending Risk Assessment Template.