Agentic code review in production: orchestration, evaluation, and the cost of being wrong

What "agentic" actually buys you over a linter, why single-model approaches stall, and why false positives — not raw model capability — determine whether the system stays in the loop.

Agentic has become a marketing flag, but in code review it carries a precise technical meaning: the system, not the user, decides which tools to invoke against a change, in what order, and how to weight their findings. A linter runs a fixed pipeline. A single-pass language-model reviewer reads the diff and emits comments end-to-end. An agentic reviewer chooses between a compiler, a type checker, a test runner, a secret scanner, a static analyzer, and one or more language-model calls — then arbitrates their disagreements before surfacing a review comment.

The model is one tool among several. The system's value is in the arbitration policy that decides which findings reach the developer.

The orchestration problem

Single-model approaches stall on three axes that pull in different directions: accuracy, latency, and cost. A frontier model gives the strongest multi-step reasoning on a non-trivial change but typically adds several seconds of latency and an order-of-magnitude cost premium per call; a small open-weights model returns in under a second but misses subtle invariants. Three routing strategies cover most of the production space:

In practice production systems combine all three: classify, fall back on low confidence, and let offline evaluations reshape weights every release cycle.

Grounding with static analysis and retrieval

A pure language-model review hallucinates fixes — proposing API calls that do not exist, citing version-specific behavior incorrectly, suggesting refactors that break other call sites the model never saw. Two anchors push the hallucination rate down.

First, deterministic static analyzers run in parallel with the language model. Type errors, null dereferences, missing await, unused imports — these are cheap, deterministic, and not worth a model call. The agent uses their output as ground truth and frames its review around facts the static analyzer surfaced, not facts the model invented.

Second, retrieval-augmented generation over the repository itself: prior review threads, commit messages, and the project's design documents. Most code review observations are not novel. The same patterns get flagged across files — null-safety regressions, missing index migrations, inconsistent error wrapping. Retrieving prior review comments scoped to the touched files, modules, or owners shifts the model from generic best-practice advice to comments that match the codebase's established conventions.

False positives as the dominant cost

Developer trust in an automated reviewer collapses non-linearly: a handful of bad comments is usually enough for the team to start dismissing the bot reflexively. The arithmetic is unforgiving: a 5% false-positive rate at twenty review comments per pull request is one bogus flag per PR. Within a sprint, the team stops reading the bot's output.

Three controls keep the rate manageable:

The third is where most teams underinvest. Without the loop the false-positive rate is whatever the underlying model happens to produce. With it, the rate trends down per release.

Compliance as a routing constraint

Compliance is not a bolt-on check. It belongs at the same layer as task classification — a first-class routing input, not a separate stage tacked on at the end.

Code touching regulated data — protected health information, payment card numbers, EU resident identifiers — has to route differently. GDPR shapes both transfer (no diffs leaving the controller's processors without a Data Processing Agreement) and retention (logged prompts and completions are themselves processing activity). HIPAA obligations — Business Associate Agreements and minimum-necessary access — determine which model endpoints are eligible to process diffs containing PHI. PCI-DSS controls dictate cardholder-data redaction before model invocation. SOC 2 controls dictate operational guarantees on the reviewer service itself. Bolting any of this on after the fact produces gaps that surface during the first audit, not during development.

Closing

Agentic code review is a coordination system with a language model embedded in it, not a language model with tools attached. The hard problems are not in the model — they are in the routing, the grounding, the evaluation, and the feedback loops that decide what the system does next time. Teams that treat the model as the product underinvest in everything that actually determines whether the product stays in use.