Agentic code review in production: orchestration, evaluation, and the cost of being wrong

by Prakhar Singh · 12 May 2026

What "agentic" actually buys you over a linter, why single-model approaches stall, and why false positives — not raw model capability — determine whether the system stays in the loop.

Agentic has become a marketing flag, but in code review it carries a precise technical meaning: the system, not the user, decides which tools to invoke against a change, in what order, and how to weight their findings. A linter runs a fixed pipeline. A single-pass language-model reviewer reads the diff and emits comments end-to-end. An agentic reviewer chooses between a compiler, a type checker, a test runner, a secret scanner, a static analyzer, and one or more language-model calls — then arbitrates their disagreements before surfacing a review comment.

The model is one tool among several. The system's value is in the arbitration policy that decides which findings reach the developer.

The orchestration problem

Single-model approaches stall on three axes that pull in different directions: accuracy, latency, and cost. A frontier model gives the strongest multi-step reasoning on a non-trivial change but typically adds several seconds of latency and an order-of-magnitude cost premium per call; a small open-weights model returns in under a second but misses subtle invariants. Three routing strategies cover most of the production space:

Task-classification routing. A lightweight classifier (a smaller model or a rules layer) decides which downstream model handles a request. Style nits, dead-code removal, and import-order checks go to a fast cheap model; logic changes and concurrency reasoning go to a stronger one. This works as long as the classifier is calibrated; misclassification lands hard-reasoning prompts on under-powered models and produces confident-sounding nonsense.
Fallback chains. Try the cheap model first; if self-consistency across N samples is low — or a cheap verifier disagrees — escalate. This is robust against classifier drift but doubles cost on the long tail.
Evaluation-driven A/B routing. Maintain an offline evaluation set of past pull requests with human accept/reject outcomes; score model variants on precision and recall against that ground truth and route traffic to whichever variant scores highest on the relevant slice. This is the only strategy that adapts when a model is updated.

In practice production systems combine all three: classify, fall back on low confidence, and let offline evaluations reshape weights every release cycle.

Grounding with static analysis and retrieval

A pure language-model review hallucinates fixes — proposing API calls that do not exist, citing version-specific behavior incorrectly, suggesting refactors that break other call sites the model never saw. Two anchors push the hallucination rate down.

First, deterministic static analyzers run in parallel with the language model. Type errors, null dereferences, missing await, unused imports — these are cheap, deterministic, and not worth a model call. The agent uses their output as ground truth and frames its review around facts the static analyzer surfaced, not facts the model invented.

Second, retrieval-augmented generation over the repository itself: prior review threads, commit messages, and the project's design documents. Most code review observations are not novel. The same patterns get flagged across files — null-safety regressions, missing index migrations, inconsistent error wrapping. Retrieving prior review comments scoped to the touched files, modules, or owners shifts the model from generic best-practice advice to comments that match the codebase's established conventions.

False positives as the dominant cost

Developer trust in an automated reviewer collapses non-linearly: a handful of bad comments is usually enough for the team to start dismissing the bot reflexively. The arithmetic is unforgiving: a 5% false-positive rate at twenty review comments per pull request is one bogus flag per PR. Within a sprint, the team stops reading the bot's output.

Three controls keep the rate manageable:

Confidence thresholding — never surface a comment below a calibrated threshold, even if the model is willing to speak.
Dedup against historical dismissals — if a reviewer dismissed an analogous comment six months ago, the same shape of comment on the same file is suspect this time.
A closed feedback loop — every dismissed or accepted comment becomes training signal for the next routing decision and threshold update.

The third is where most teams underinvest. Without the loop the false-positive rate is whatever the underlying model happens to produce. With it, the rate trends down per release.

Compliance as a routing constraint

Compliance is not a bolt-on check. It belongs at the same layer as task classification — a first-class routing input, not a separate stage tacked on at the end.

Code touching regulated data — protected health information, payment card numbers, EU resident identifiers — has to route differently. GDPR shapes both transfer (no diffs leaving the controller's processors without a Data Processing Agreement) and retention (logged prompts and completions are themselves processing activity). HIPAA obligations — Business Associate Agreements and minimum-necessary access — determine which model endpoints are eligible to process diffs containing PHI. PCI-DSS controls dictate cardholder-data redaction before model invocation. SOC 2 controls dictate operational guarantees on the reviewer service itself. Bolting any of this on after the fact produces gaps that surface during the first audit, not during development.

Closing

Agentic code review is a coordination system with a language model embedded in it, not a language model with tools attached. The hard problems are not in the model — they are in the routing, the grounding, the evaluation, and the feedback loops that decide what the system does next time. Teams that treat the model as the product underinvest in everything that actually determines whether the product stays in use.