Working draft. Please do not circulate or reproduce without permission.

For questions, feedback, or access to the full manuscript, contact anjali@analoguegroup.org.

Measurement for Shared Compute

This guide develops a measurement framework for multi-tenant compute infrastructure. While we focus on AI inference systems, the principles apply to any resource allocation decision that affects both an individual request and the broader shared capacity. We'll cover routing, rate limits, caching, batching and similar mechanisms.

The central challenge is that resource allocation decisions affect shared system state, so they must be evaluated at the policy level rather than at the user or request level (some exceptions apply).

The framework draws from my research into state-dependent inference costs and my working experience as a pricing data scientist. An earlier version of these ideas was discussed on Live with Tim O'Reilly.

01

Shared Compute Is Not a Standard A/B Test

In shared compute, the "treatment" is typically a dynamic rule rather than a static action. Hereafter, we refer to any such dynamic rule as a "policy." For example, a router, rate limit, caching logic, or batching algorihm selects an action based on both the request-level workload and the live system state, and potentially considers historical data.

Consider the example. A scheduler sends latency-sensitive, high-value requests to a geographically near cluster when capacity is available, then falls back to a farther cluster during congestion.
A = π(X, S)
X = pre-decision workload and user information
S = live decision-time system state: queue, cache, region, capacity, incidents, and recent demand

In the scheduler example, cluster assignment is not simply "treated." The same request may be routed to a nearby cluster when the queue is shallow and sent to a farther cluster when it is congested.

System State Is Both Confounder and Outcome

Shared inference systems make the timing of state measurement central. At time t, system state is an input into the routing, admission, or fallback decision at hand. At time t + 1, that same state may reflect the consequences of earlier decisions.

At = π(Xt, St)
St+1 = f(St, At, Wt)
Wt = arrivals, failures, provider events, cache eviction, and other external disturbances

This loop is what differentiates shared compute from a conventional A/B test. Queue depth before a routing decision can be a confounder. Queue depth after the policy has run can be part of the policy's effect that you'll want to measure. When an experiment modifies the capacity state that later requests draw from, the unit-level treatment is no longer well defined (see Experimental Designs Under Shared Compute for recommendations and mitigations).

Fat-Tailed Costs and Capacity Risk

For AI infrastructure and inference products, a plain difference-in-means readout can be fragile when the metric is tokens, cost, or GPU-hours per user. That is because these metrics are highly variable and can be dominated by a small number of very large users, long-context workflows, or high-concurrency windows.

A note on vantage point. This section extends my earlier analysis of why fat-tailed costs emerge at scale, which was written from outside: I do not operate these systems, and the distributional claims are reasoned from serving architecture and public information rather than from operator telemetry. Treat the shape claims as hypotheses to check against your own data. The actual distribution of tokens per request, and how strongly cost depends on system state, are empirical questions that only real data can settle.

Tokens per request plausibly follow something like a truncated power law. The mean is finite because the context window acts as an upper limit for every request, but the variance is large and likely growing as use cases become more heterogeneous and context windows lengthen with each model release. The distribution is estimable: the mean exists, but it likely converges slowly, and a handful of long-context or agentic workflows can dominate any short experiment.

True resource cost per request is harder to measure. It depends on live system state: batch concurrency, KV-cache pressure, prefill/decode balance, GPU efficiency, and temporal demand. There is no fixed upper limit as the effective capacity boundary is reactive to this system state; this means the conditional mean of resource cost may not converge inside the experiment window.

Tokens per request: bounded support, finite mean, fat tails → mean estimates converge slowly
Resource cost per request: cost = f(request, peak system state), state non-stationary → conditional mean may not converge
Implication: use robust estimation for token cost; treat resource cost as a capacity-state outcome, not a per-request scalar

This is important to be aware of, because KV-cache memory scales with sequence length and batch size. Said another way, that means unrelated long contexts arriving concurrently can push a resource pool past its memory boundary even when every individual request appears within bounds. The tail risk is better characterized as resource overcommitment, not as a single expensive spike. Overcommitting can cause elevated fallback rates, or even a crash, erasing the margin earned on other requests. This is a business-critical risk to monitor.

TK: explain the behavioral tail risk (dominant users) vs. system-state tail risk (peak-concurrency windows) and develop a POV on how to deal with each.

Measurement Implications

  1. Make the time window a unit of analysis. Switchback and cluster-level experiments are discussed in Section 3; regression, diff-in-diff, and propensity scoring approaches in Section 4.
  2. Ensure the measurement period observes the boundary. You should include peak-concurrency periods; a decision made entirely off-peak provides no information about the states where the policy could fail. If the boundary was never observed, the readout should state that explicitly
  3. Report boundary events as treatment effects, not incident noise. Queue overflow, cache pressure, and error spikes during peak windows are outcomes of the allocation policy under evaluation. It's misleading to exclude these as ooutliers.
  4. Prefer capacity-impact signals to per-request cost prediction.  The marginal resource cost of a request is statistically impractical to predict, but a request's impact on aggregate capacity state — predicted sequence length, compute-time consumed — is measurable and a more practical guardrail. This also yields a better understanding of the relationship between workload characteristics and resource requirements.
02

Name the Decision First

A useful discipline is to ask what you would measure if the experiment were perfect. If every request could be randomly routed to a performant cluster with no interference, no capacity limits, and full observability, what would that experiment tell you? That line of questioning exposes the causal target and the measurement constraints, so the design can approximate the ideal experiment, and mitigate the measurement challenges.

  1. Name the decision. Is the decision to ship broadly, target a segment, or change a dynamic rule?
  2. Choose the comparison. The counterfactual might be the old policy, untreated users, a different cluster, a prior model checkpoint, or a different time block (peak vs. off-peak hours).
  3. Select the decision metric. What is the business goal? Common success metrics include task completion, retention, performance against internal evals, throughput, LTV: this is highly dependent on stakeholders and the organization's existing measurement infra.
  4. Use unit economics as a guardrail. A policy can improve the decision metric while burning too much compute. Pair it with latency, errors, GPU-hours, or cost per successful request.
  5. End with a rule. Ship, ramp, target, hold out, pause, roll back, or instrument more. The readout should state the recommendation and the tradeoff that would change it. We'll discuss what to do under different scenarios in Section 9.

Decision Pattern Cheat Sheet

Decision pattern Comparison frame Potential tradeoff
Ship broadly Average Treatment Effect (ATE): what happens across the eligible population? Can hide that only one workload or customer segment benefits enough to justify the cost.
Explain treated traffic Average Treatment Effect on the Treated (ATT): what happened to users or requests that actually received treatment? Can be less useful for future rollout if treated users were unusually selected.
Target segment Conditional Average Treatment Effect (CATE): who benefits enough to receive higher limits or better routing? Can overfit unless the segment is stable, interpretable, and operationally usable.
Change the rule Policy value: what is the expected outcome if the system runs rule π? Needs support in the logs, system guardrails, and a clear cost/value objective.
Policy value, and why it is not ATE or CATE: ATE and CATE are properties of a treatment: what does action A do on average, or for segment x? Policy value is a property of a decision rule: what happens if the system runs rule π, where π looks at context and chooses an action. A dynamic rule only treats some units, in some states, and its value depends on how often those contexts occur and on operational constraints.
03

Experimental Designs

Experiments still start with the same question: what is the unit that receives the policy, and what other traffic changes because that unit was treated? If the policy does not change shared state, a user-level or request-level randomized controlled trial (RCT) might make sense. If it changes queue, cache, or capacity state, the experiment has to randomize a larger unit or time block.

A core principle: randomize at or above the unit of analysis. If your decision metric is user-level (retention, expansion), randomizing by request is not recommended because the same user receives mixed treatment. If your metric is request-level (latency, errors), randomizing by user is valid but less efficient.

Request RCT

Use for: latency, error rate, cost per request, retry rate, or whether a route succeeds.

Why: request-level randomization gives much more statistical power, especially for high-volume traffic. It is a good fit for isolated serving changes where the main outcome is request-level.

Potential downside: The same account can see mixed treatment, which is a poor user experience and makes user-level or session-level analysis invalid. If the policy materially alters shared state, request-level randomization will also understate spillovers to untreated requests.

Mitigation: use it only when the change is invisible or low-risk to the user and does not materially alter shared state.

User or Workspace RCT

Use for: user-level outcomes like retention, expansion, and sustained usage; or for changes where the same user should keep seeing the same experience (rate limits, prices, model access, priority tiers).

Why: if the outcome is retention, expansion, or sustained usage, the randomization unit should match how the user actually experiences the product.

Potential downside: randomizing by user usually has less statistical power than randomizing by request, and takes longer to read.

Mitigation: use short-run system metrics as early checks, keep a holdout for long-run outcomes.

Cluster RCT

Use for: changes to batching, caching, regional routing, or capacity allocation where the intervention changes a shared capacity pool.

Why: spillovers break user-level randomization. If treated traffic changes queue depth, available capacity, control users in the same resource pool experience effects of the treated users. Randomizing by cluster contains most interference within the randomized unit.

Potential downside: there are fewer independent units, so statistical power is worse. Clusters can also differ by chip generation, geography, user mix, latency, customer value, or incident patterns.

Mitigation: stratify before randomization, analyze at the cluster level, and report system-level outcomes.

Time-based RCT (Switchbacks)

Use for: cache policy, global routing configuration, scheduler settings, or any change that has to be turned on for a whole system at once.

Why: when the policy changes queue or cache state for everyone, compare blocks of time under policy A versus policy B. In this design, state after the policy runs is an outcome, not a pre-treatment nuisance variable.

Potential downside: demand changes over time. Hour-of-day, day-of-week, incidents, launches, and seasonal traffic can get mixed into the treatment effect. And carryover is present: queue backlog, retries, and latency may persist across blocks.

Mitigation: randomize across comparable time blocks, balance treatment across weekends and weekdays, record incidents. Insert washout periods between blocks to let queue and cache state clear if necessary. Blocks must be long enough for the policy to take effect but short enough to avoid conflating treatment with demand cycles.

This is the main takeaway! For a shared compute intervention, do not pretend requests are independently treated.
04

Quasi-Experimental Designs (for When You Can't Randomize)

[I'll rewrite this section.]

The most obvious cases where you can't randomize are when the organization lacks an experimentation engine, when partial treatment would be unethical, or when it would trigger customer backlash.

But the most common scenario, by far, is when the change has already been launched—especially in fast-moving industries like AI where improvements are expected to ship quickly. Another common case is a universal policy that is costly to reverse. If you can reverse it, switchback experiments are still an option even without an untreated group. But if reversal is too expensive—eg. the change is a large-scale refactoring—you are left with the methods in the table below.

The main thing to understand is that causal inference without randomization requires strong assumptions. You need a clear grasp of what drives the outcomes and how the business operates around those factors, because incorrect assumptions will undermine the validity of your results. It also takes considerable care to produce reliable findings.

Difference-in-Differences

Use for: When you have a natural control unit that looks like the treated unit in trend—e.g., a rollout that hits one region, fleet, or customer tier before another.

Why: It replaces the missing counterfactual by taking the treated unit's pre-period baseline and adding the control unit's growth trend. Unlike a simple before-and-after, it does not require the treated and control to start at the same level.

Potential downside: The parallel trends assumption is doing all the work. If the treated unit was selected because it was already underperforming or growing faster, the trend bias can be large. With aggregated data you also lose the ability to compute standard errors.

Mitigation: Plot pre-trends across multiple pre-periods; if the treated and control diverge before the intervention, DiD is not credible. Check for simultaneous incidents, launches, or demand shifts that break the common trend.

Example. You deploy a new KV-cache eviction policy to the us-east-1 cluster on January 15, but delay it in us-west-2 until February 1. Average latency in us-east-1 drops from 180 ms to 140 ms. But us-west-2 also improved, from 175 ms to 150 ms, because traffic patterns shifted. A naive before-and-after would claim a 40 ms win; DiD subtracts the control trend and estimates a 25 ms improvement. If us-east-1 was chosen because it had the worst latency spikes, the pre-trend was already steeper, and the true effect is smaller still.

Synthetic Control

Use for: When no single control unit looks enough like the treated unit, but a weighted combination of many donors might—e.g., one region, fleet, or provider changes while many others stay constant.

Why: Instead of relying on a single control unit, it builds a weighted combination of donor units to forge a synthetic counterfactual that closely tracks the treated unit before the intervention.

Potential downside: OLS can overfit the pre-period perfectly, producing a synthetic control that matches noise rather than signal; post-intervention variance then explodes. Inference is hard because there is only one treated unit.

Mitigation: Use regularized weights rather than pure OLS. Show pre-period fit, donor weights, and placebo tests: run the same procedure on every donor unit pretending it was treated. Remove donors with high pre-treatment error. Report the distribution of placebo effects against the treated effect.

Example. You switch the eu-central-1 cluster to a new continuous-batching scheduler. No other cluster uses it. For the six months before the switch, you build a synthetic eu-central-1 from weighted combinations of eu-west-1, us-east-1, and ap-south-1 so that the synthetic cluster tracks your actual GPU utilization and p99 latency closely. After the switch, actual utilization is 12% lower than the synthetic counterfactual, but the synthetic line wobbles because the donor weights overfit a two-week maintenance window. You re-run the procedure pretending each donor cluster got the scheduler; only one placebo shows a gap as large as the real one, giving a p-value of 0.03.

Propensity Score Weighting

Use for: Treatment assignment is non-random but driven by observed covariates; non-compliance in an otherwise randomized setting; overlapping treated and untreated populations with different baseline probabilities.

Why: It collapses the confounder space into a single balancing score. Units with the same propensity score are comparable even if their raw covariates differ, making it possible to weight or match on one dimension instead of many.

Potential downside: Maximizing the predictive accuracy of the propensity score does not improve causal estimation; adding variables that predict treatment but not the outcome inflates variance. Weak overlap produces extreme weights and dangerous extrapolation. Standard errors must account for the two-step estimation.

Mitigation: Check overlap visually; if distributions barely touch, narrow the estimand or switch designs. Trim extreme weights cautiously, but recognize that clipping introduces bias. Bootstrap the entire procedure to get valid standard errors. Include confounders, not pure treatment predictors.

Example. Your router sends high-value enterprise requests to a premium GPU pool. The decision is logged and depends on account tier, estimated sequence length, and time-of-day queue depth. After six months, executives want to know whether premium routing actually improves task completion. A naive comparison shows 94% completion on premium vs. 87% on standard, but premium requests are shorter and submitted at off-peak hours. You estimate propensity scores from the logged routing rule, reweight the standard pool to look like the premium pool, and find the true completion lift is 4 percentage points, not 7. A few premium requests have propensity scores near 0.95 with almost no standard counterparts; you trim those weights and note the estimate is local to the overlap region.

Regression Discontinuity Design

Use for: Eligibility, limits, priority tiers, or pricing that change sharply at a known cutoff; threshold-based routing or admission rules.

Why: Near the threshold, assignment is as good as random: units just above and just below are otherwise similar. The jump in the outcome at the cutoff identifies a local average treatment effect.

Potential downside: The estimate is local to the cutoff and may not generalize away from it. Entities can manipulate their position around the threshold (bunching). Linear extrapolation can misestimate the true counterfactual slope. In fuzzy designs, treatment probability does not jump cleanly to one.

Mitigation: Use kernel weighting (e.g., triangular) to focus on observations close to the threshold. Run the McCrary density test to check for manipulation. Test bandwidth sensitivity and covariate continuity. For fuzzy RD, use a Wald estimator (jump in outcome divided by jump in treatment probability).

Example. Your rate-limiting policy grants "fast lane" priority to any workspace that exceeds 1,000 TPM. Workspaces at 998 TPM and 1,002 TPM are otherwise similar, but the 1,002 TPM workspace gets priority. You compare latency for requests just below the cutoff (995–999 TPM) to those just above (1,000–1,005 TPM). A simple linear fit shows a 22 ms drop at the threshold, but a triangular kernel that downweights observations far from 1,000 TPM shrinks the estimate to 14 ms. You check request density around the cutoff and see no spike at 999 TPM—workspaces are not gaming the threshold—so the local effect is credible. You caveat that the 14 ms lift only applies to workspaces near the limit, not to small developers at 50 TPM.
05

Targeting and Dealing with Heterogeneity

AI usage is inherently heterogeneous: task diversity, user type, volume, frequency, and required resources all vary. Many allocation decisions will be targeted toward a specific customer profile, behavioral profile, or workload type. But before segmenting, you must decide what you are partitioning by.

Prediction vs. Sensitivity

A model that predicts who uses the most tokens or who has the highest latency is doing prediction, not causal segmentation. It partitions units by baseline outcome level, E[Y|X]. For allocation decisions, what matters is treatment sensitivity: the slope of outcome on treatment, ∂Y/∂T. You want to know whose cost or latency changes the most under the new policy.

Consider the scheduler example from Section 1. Partitioning requests by "who has high baseline latency" is not the same as partitioning by "who sees the biggest latency improvement when routed to the nearby cluster." The first partition might be long-context batch jobs that are slow regardless of routing. The second might be short real-time requests that are queue-sensitive. A policy targeted at high-baseline users could waste capacity with no marginal gain.

In other words, segments are useful for targeting only if the treatment-effect slope differs across them. If the slopes are similar, the segmentation is not operationally useful, no matter how different the baseline levels are.

How to Validate a Segment

  1. Start with the average effect. Show that the policy has a real lever worth segmenting. If the ATE is near zero, there may be no heterogeneity to exploit.
  2. Pre-specify operational segments. Good definitions include workload shape, model family, latency sensitivity, customer tier, and peak versus off-peak demand. Do not data-mine segments after seeing the results.
  3. Check that slopes differ. Within each segment, estimate the treatment effect. If the effect sizes are statistically similar across segments, the segmentation is not useful for targeting.
  4. Convert lift into net value. A high segment-level treatment effect can still be a bad allocation if it uses expensive capacity for low-margin traffic.
  5. Keep some exploration alive. If the router only exploits the current best segment, future evaluation loses support for alternatives.
Bad segmentation: Segmenting by raw GPU-hours, token volume, or "power user" status is a common mistake. A high-volume segment may have high baseline cost but near-zero treatment effect if the policy does not change their bottleneck. Always validate that the segment changes the slope, not just the intercept.

Segment Lenses for Shared Compute

Segment lens Causal question (what slope are you estimating?) Why this matters for allocation What to watch
Workload shape Does routing or batching sensitivity differ by context length, real-time vs. batch, or agentic vs. single-turn? A policy that helps chat may hurt batch; averaging them together can hide a zero or negative net effect. Check that segments have different treatment effects, not just different baselines.
Customer tier Does priority routing improve outcomes more for enterprise vs. pro vs. free? Scarce capacity should go where the marginal gain is highest, not where baseline value is highest. Enterprise users may have inelastic demand; the lift may be smaller than for mid-tier users who currently get starved.
Latency sensitivity Is ∂(task completion)/∂(latency) higher for real-time chat than for async jobs? A 50 ms improvement matters for chat, not for overnight batch. The policy value depends on the workload mix. Do not average latency-sensitive and latency-tolerant requests; their slopes differ.
Model family / size Do routing changes affect small and large models differently? Large models may saturate memory; small models may not. Segment-level effects can flip signs. A cache eviction policy might help small models and hurt large ones due to KV-cache pressure.
System state Does the same policy have different effects when the queue is deep vs. shallow? A cache policy may help off-peak and hurt peak. The same user can be highly sensitive during congestion and insensitive otherwise. Report CATE by state, not just ATE across all states. The policy may need state-dependent rules.

Minimum Viable CATE

You do not need machine learning to start. A linear regression with interaction terms is often enough to test whether a segment is worth targeting:

latency = β₀ + β₁(treatment) + β₂(workload_type) + β₃(treatment × workload_type) + controls

Here, β₃ is the difference in treatment effect between workload types. If β₃ is small or not significant, your segmentation by workload type is not operationally useful, regardless of how different the baseline latencies are. You can extend this to any segment by replacing workload_type with the segment of interest.

Operational principle: Segments are for changing the allocation rule, not for decorating the analysis. If the treatment-effect slope does not differ across segments, do not target by segment.
06

TK: Offline Policy Evaluation

07

TK: Holdouts and Longer-Term Decisions

08

Assumption Checks and Guardrails

In a standard A/B test, a broken randomization check is a reason to discard the result. In shared compute, the same principle applies to identification assumptions: a violated parallel-trends assumption, poor overlap, or unobserved boundary states mean the comparison is not credible, regardless of the p-value. Run these checks before looking at effect sizes.

Identification Checks: Is the Comparison Credible?

Check What to look for How it changes the decision
Assignment integrity SRM, eligibility, exposure time, and whether treatment started when the logs say it did. If assignment is broken, stop. The comparison is invalid.
Identification assumption Parallel pre-trends (DiD), propensity overlap, no manipulation at cutoff (RDD), pre-period synthetic fit, cluster balance, or washout clearance (switchback). If the core assumption is violated, downgrade to descriptive or stop. Do not report a causal effect.
Execution fidelity Intended action, executed action, fallback reason, timeout path, and retry path. If execution diverged, report the operational failure separately. Do not attribute the outcome to the policy as designed.
Boundary coverage Did the experiment observe peak congestion, cache pressure, or the capacity states where the policy could fail? If boundary states were never observed, the readout cannot support a broad rollout. Recommend a limited ramp or more data.

Guardrail Checks: Is the System Safe?

Check What to look for How it changes the decision
System spillover Untreated-user latency, errors, queue depth, cache state, and utilization. If spillovers are material, the user-level lift is overstated. Move to cluster-level evidence or stop.
Economic significance User value per GPU-hour, cost per successful request, margin impact, and reliability risk. If the practical effect does not clear the cost bar, do not ship just because the p-value is good.
Rollback trigger Latency above X, error rate above Y, cost per success above Z, or no movement in task completion. If triggers fire, pause or roll back even if the average user metric looks fine. In the router example, elevated fallback rates during congestion should stop the ramp even if premium-routed requests improved.
Operational principle: A statistically significant effect under a violated assumption is not evidence. A positive user-level effect with negative spillovers is not a win. Run the checks first, then interpret the numbers.
09

The Decision Readout

The readout should follow a strict hierarchy: first verify that the evidence is credible, then check that the system is safe, then evaluate whether the effect is large enough to justify the change. Only then should you decide to ship, ramp, target, hold out, pause, or roll back.

Stop Gates: Do Not Launch If

  1. Identification assumptions are violated. Parallel trends diverge, propensity overlap is poor, the RDD cutoff shows manipulation, or synthetic control pre-period fit is weak. (See Section 8.)
  2. Guardrail metrics show negative movement. Untreated-user latency degrades, error rates exceed thresholds, queue depth spikes, or cost per successful request rises.
  3. Boundary states were unobserved. The experiment ran only off-peak; you have no evidence about how the policy behaves under congestion or cache pressure.
  4. Execution fidelity is broken. The policy did not run as intended—fallbacks, timeouts, or operational incidents dominated the treatment period.

If any stop gate is true, the recommendation is do not launch. Downgrade to descriptive findings, fix the instrumentation, or run a stronger design.

Proceed Criteria: Launch or Ramp If

  1. Success metrics are practically and statistically significant. The effect is large enough to matter to users or the business, and the uncertainty is small enough to act on. In cluster or switchback designs, use cluster-robust or time-block standard errors; conventional user-level standard errors are anti-conservative.
  2. All identification assumptions pass. The design's core assumption is plausibly satisfied, with documented caveats if needed.
  3. Guardrail metrics are flat or positive. No material spillover to untreated users, no reliability degradation, and no cost regression.
  4. The effect is economically significant. User lift justifies compute cost, latency cost, or operational complexity. A 2% retention lift that requires 20% more GPU-hours may not be a good trade.

If all four criteria are met, the recommendation is ship or ramp. If state coverage is limited or tail risk is moderate, recommend ramp with monitoring rather than full ship.

Tradeoff Analysis: When Results Are Mixed

If success metrics are positive but guardrails are mixed, or if effects differ across segments, do not default to "launch and monitor." Do the work to unify the evidence into a single recommendation.

1. Unify to a single metric

Combine user value and compute cost into one denominator: value per GPU-hour, cost per successful request, or margin per request. Do not let a positive user metric and a negative cost metric coexist without a common currency. If the unified metric is negative, the answer is no.

2. Segment by user groups and features

A positive average effect can mask negative effects on a specific segment, or vice versa. Report CATE by workload type, customer tier, model family, and system state. A policy should only be targeted if the segment-level treatment-effect slope differs materially. (See Section 6.)

3. Discuss overall desirability

Even with a positive unified metric, consider operational complexity, maintenance burden, and tail risk. A small average gain with high variance in outcomes—or a gain that only holds in narrow state windows—may not be worth the risk.

4. State what would change the decision

Every readout should close with a pre-specified contingency: "If error rates rise above 0.5% in the next two weeks, roll back." This prevents post-hoc rationalization when early monitoring data arrives.

Decision Matrix

Situation Recommendation Contingency
All proceed criteria met; low tail risk Ship Monitor guardrails for two weeks; rollback if triggers fire.
All proceed criteria met; limited boundary coverage Ramp Expand to peak windows gradually; pause if congestion behavior diverges from off-peak.
Mixed results: strong positive for some segments, flat or negative for others Target Deploy only to the winning segment; keep a holdout for the excluded segment to validate the negative finding.
Positive short-run metrics; long-run outcomes (retention, LTV) unobserved Holdout Keep the smallest durable holdout that can answer the long-run question; ship only if the surrogate ladder is validated.
Assumption concerns that are not fatal, or mixed tradeoffs that need more data Pause Collect more data, fix instrumentation, or run a stronger design before deciding.
Guardrail violations, assumption failures, or negative unified metric Roll back Do not wait for more data. Reverse the policy and investigate.
Assumption checks failed due to missing data or broken logging Instrument more Fix logging, add decision-time state fields, and re-run before any causal claim.
Template for the readout summary. For this population, the new policy improved success metric by effect size relative to the old policy or best counterfactual. Identification relies on design and assumption, which passed / passed with caveats / failed. Guardrails passed / showed mixed results / failed. The unified metric is positive / negative / uncertain. Recommendation: ship, ramp, target, holdout, pause, roll back, or instrument more. If contingency, then revised action.
10

TK: Minimum Logging for a Policy Readout