Working draft. Please do not circulate or reproduce without permission.
For questions, feedback, or access to the full manuscript, contact anjali@analoguegroup.org.
Measurement for Shared Compute
This guide develops a measurement framework for multi-tenant compute infrastructure. While we focus on AI inference systems, the principles apply to any resource allocation decision that affects both an individual request and the broader shared capacity. We'll cover routing, rate limits, caching, batching and similar mechanisms.
The central challenge is that resource allocation decisions affect shared system state, so they must be evaluated at the policy level rather than at the user or request level (some exceptions apply).
The framework draws from my research into state-dependent inference costs and my working experience as a pricing data scientist. An earlier version of these ideas was discussed on Live with Tim O'Reilly.
02
Name the Decision First
Name the Decision First
A useful discipline is to ask what you would measure if the experiment were perfect. If every request could be randomly routed to a performant cluster with no interference, no capacity limits, and full observability, what would that experiment tell you? That line of questioning exposes the causal target and the measurement constraints, so the design can approximate the ideal experiment, and mitigate the measurement challenges.
- Name the decision. Is the decision to ship broadly, target a segment, or change a dynamic rule?
- Choose the comparison. The counterfactual might be the old policy, untreated users, a different cluster, a prior model checkpoint, or a different time block (peak vs. off-peak hours).
- Select the decision metric. What is the business goal? Common success metrics include task completion, retention, performance against internal evals, throughput, LTV: this is highly dependent on stakeholders and the organization's existing measurement infra.
- Use unit economics as a guardrail. A policy can improve the decision metric while burning too much compute. Pair it with latency, errors, GPU-hours, or cost per successful request.
- End with a rule. Ship, ramp, target, hold out, pause, roll back, or instrument more. The readout should state the recommendation and the tradeoff that would change it. We'll discuss what to do under different scenarios in Section 9.
Decision Pattern Cheat Sheet
| Decision pattern | Comparison frame | Potential tradeoff |
|---|---|---|
| Ship broadly | Average Treatment Effect (ATE): what happens across the eligible population? | Can hide that only one workload or customer segment benefits enough to justify the cost. |
| Explain treated traffic | Average Treatment Effect on the Treated (ATT): what happened to users or requests that actually received treatment? | Can be less useful for future rollout if treated users were unusually selected. |
| Target segment | Conditional Average Treatment Effect (CATE): who benefits enough to receive higher limits or better routing? | Can overfit unless the segment is stable, interpretable, and operationally usable. |
| Change the rule | Policy value: what is the expected outcome if the system runs rule π? | Needs support in the logs, system guardrails, and a clear cost/value objective. |
03
Experimental Designs
Experimental Designs
Experiments still start with the same question: what is the unit that receives the policy, and what other traffic changes because that unit was treated? If the policy does not change shared state, a user-level or request-level randomized controlled trial (RCT) might make sense. If it changes queue, cache, or capacity state, the experiment has to randomize a larger unit or time block.
A core principle: randomize at or above the unit of analysis. If your decision metric is user-level (retention, expansion), randomizing by request is not recommended because the same user receives mixed treatment. If your metric is request-level (latency, errors), randomizing by user is valid but less efficient.
Request RCT
Use for: latency, error rate, cost per request, retry rate, or whether a route succeeds.
Why: request-level randomization gives much more statistical power, especially for high-volume traffic. It is a good fit for isolated serving changes where the main outcome is request-level.
Potential downside: The same account can see mixed treatment, which is a poor user experience and makes user-level or session-level analysis invalid. If the policy materially alters shared state, request-level randomization will also understate spillovers to untreated requests.
Mitigation: use it only when the change is invisible or low-risk to the user and does not materially alter shared state.
User or Workspace RCT
Use for: user-level outcomes like retention, expansion, and sustained usage; or for changes where the same user should keep seeing the same experience (rate limits, prices, model access, priority tiers).
Why: if the outcome is retention, expansion, or sustained usage, the randomization unit should match how the user actually experiences the product.
Potential downside: randomizing by user usually has less statistical power than randomizing by request, and takes longer to read.
Mitigation: use short-run system metrics as early checks, keep a holdout for long-run outcomes.
Cluster RCT
Use for: changes to batching, caching, regional routing, or capacity allocation where the intervention changes a shared capacity pool.
Why: spillovers break user-level randomization. If treated traffic changes queue depth, available capacity, control users in the same resource pool experience effects of the treated users. Randomizing by cluster contains most interference within the randomized unit.
Potential downside: there are fewer independent units, so statistical power is worse. Clusters can also differ by chip generation, geography, user mix, latency, customer value, or incident patterns.
Mitigation: stratify before randomization, analyze at the cluster level, and report system-level outcomes.
Time-based RCT (Switchbacks)
Use for: cache policy, global routing configuration, scheduler settings, or any change that has to be turned on for a whole system at once.
Why: when the policy changes queue or cache state for everyone, compare blocks of time under policy A versus policy B. In this design, state after the policy runs is an outcome, not a pre-treatment nuisance variable.
Potential downside: demand changes over time. Hour-of-day, day-of-week, incidents, launches, and seasonal traffic can get mixed into the treatment effect. And carryover is present: queue backlog, retries, and latency may persist across blocks.
Mitigation: randomize across comparable time blocks, balance treatment across weekends and weekdays, record incidents. Insert washout periods between blocks to let queue and cache state clear if necessary. Blocks must be long enough for the policy to take effect but short enough to avoid conflating treatment with demand cycles.
04
Quasi-Experimental Designs (for When You Can't Randomize)
Quasi-Experimental Designs (for When You Can't Randomize)
The most obvious cases where you can't randomize are when the organization lacks an experimentation engine, when partial treatment would be unethical, or when it would trigger customer backlash.
But the most common scenario, by far, is when the change has already been launched—especially in fast-moving industries like AI where improvements are expected to ship quickly. Another common case is a universal policy that is costly to reverse. If you can reverse it, switchback experiments are still an option even without an untreated group. But if reversal is too expensive—eg. the change is a large-scale refactoring—you are left with the methods in the table below.
Difference-in-Differences
Use for: When you have a natural control unit that looks like the treated unit in trend—e.g., a rollout that hits one region, fleet, or customer tier before another.
Why: It replaces the missing counterfactual by taking the treated unit's pre-period baseline and adding the control unit's growth trend. Unlike a simple before-and-after, it does not require the treated and control to start at the same level.
Potential downside: The parallel trends assumption is doing all the work. If the treated unit was selected because it was already underperforming or growing faster, the trend bias can be large. With aggregated data you also lose the ability to compute standard errors.
Mitigation: Plot pre-trends across multiple pre-periods; if the treated and control diverge before the intervention, DiD is not credible. Check for simultaneous incidents, launches, or demand shifts that break the common trend.
us-east-1 cluster on January 15, but delay it in us-west-2 until February 1. Average latency in us-east-1 drops from 180 ms to 140 ms. But us-west-2 also improved, from 175 ms to 150 ms, because traffic patterns shifted. A naive before-and-after would claim a 40 ms win; DiD subtracts the control trend and estimates a 25 ms improvement. If us-east-1 was chosen because it had the worst latency spikes, the pre-trend was already steeper, and the true effect is smaller still.
Synthetic Control
Use for: When no single control unit looks enough like the treated unit, but a weighted combination of many donors might—e.g., one region, fleet, or provider changes while many others stay constant.
Why: Instead of relying on a single control unit, it builds a weighted combination of donor units to forge a synthetic counterfactual that closely tracks the treated unit before the intervention.
Potential downside: OLS can overfit the pre-period perfectly, producing a synthetic control that matches noise rather than signal; post-intervention variance then explodes. Inference is hard because there is only one treated unit.
Mitigation: Use regularized weights rather than pure OLS. Show pre-period fit, donor weights, and placebo tests: run the same procedure on every donor unit pretending it was treated. Remove donors with high pre-treatment error. Report the distribution of placebo effects against the treated effect.
eu-central-1 cluster to a new continuous-batching scheduler. No other cluster uses it. For the six months before the switch, you build a synthetic eu-central-1 from weighted combinations of eu-west-1, us-east-1, and ap-south-1 so that the synthetic cluster tracks your actual GPU utilization and p99 latency closely. After the switch, actual utilization is 12% lower than the synthetic counterfactual, but the synthetic line wobbles because the donor weights overfit a two-week maintenance window. You re-run the procedure pretending each donor cluster got the scheduler; only one placebo shows a gap as large as the real one, giving a p-value of 0.03.
Propensity Score Weighting
Use for: Treatment assignment is non-random but driven by observed covariates; non-compliance in an otherwise randomized setting; overlapping treated and untreated populations with different baseline probabilities.
Why: It collapses the confounder space into a single balancing score. Units with the same propensity score are comparable even if their raw covariates differ, making it possible to weight or match on one dimension instead of many.
Potential downside: Maximizing the predictive accuracy of the propensity score does not improve causal estimation; adding variables that predict treatment but not the outcome inflates variance. Weak overlap produces extreme weights and dangerous extrapolation. Standard errors must account for the two-step estimation.
Mitigation: Check overlap visually; if distributions barely touch, narrow the estimand or switch designs. Trim extreme weights cautiously, but recognize that clipping introduces bias. Bootstrap the entire procedure to get valid standard errors. Include confounders, not pure treatment predictors.
Regression Discontinuity Design
Use for: Eligibility, limits, priority tiers, or pricing that change sharply at a known cutoff; threshold-based routing or admission rules.
Why: Near the threshold, assignment is as good as random: units just above and just below are otherwise similar. The jump in the outcome at the cutoff identifies a local average treatment effect.
Potential downside: The estimate is local to the cutoff and may not generalize away from it. Entities can manipulate their position around the threshold (bunching). Linear extrapolation can misestimate the true counterfactual slope. In fuzzy designs, treatment probability does not jump cleanly to one.
Mitigation: Use kernel weighting (e.g., triangular) to focus on observations close to the threshold. Run the McCrary density test to check for manipulation. Test bandwidth sensitivity and covariate continuity. For fuzzy RD, use a Wald estimator (jump in outcome divided by jump in treatment probability).
05
Targeting and Dealing with Heterogeneity
Targeting and Dealing with Heterogeneity
AI usage is inherently heterogeneous: task diversity, user type, volume, frequency, and required resources all vary. Many allocation decisions will be targeted toward a specific customer profile, behavioral profile, or workload type. But before segmenting, you must decide what you are partitioning by.
Prediction vs. Sensitivity
A model that predicts who uses the most tokens or who has the highest latency is doing prediction, not causal segmentation. It partitions units by baseline outcome level, E[Y|X]. For allocation decisions, what matters is treatment sensitivity: the slope of outcome on treatment, ∂Y/∂T. You want to know whose cost or latency changes the most under the new policy.
In other words, segments are useful for targeting only if the treatment-effect slope differs across them. If the slopes are similar, the segmentation is not operationally useful, no matter how different the baseline levels are.
How to Validate a Segment
- Start with the average effect. Show that the policy has a real lever worth segmenting. If the ATE is near zero, there may be no heterogeneity to exploit.
- Pre-specify operational segments. Good definitions include workload shape, model family, latency sensitivity, customer tier, and peak versus off-peak demand. Do not data-mine segments after seeing the results.
- Check that slopes differ. Within each segment, estimate the treatment effect. If the effect sizes are statistically similar across segments, the segmentation is not useful for targeting.
- Convert lift into net value. A high segment-level treatment effect can still be a bad allocation if it uses expensive capacity for low-margin traffic.
- Keep some exploration alive. If the router only exploits the current best segment, future evaluation loses support for alternatives.
Segment Lenses for Shared Compute
| Segment lens | Causal question (what slope are you estimating?) | Why this matters for allocation | What to watch |
|---|---|---|---|
| Workload shape | Does routing or batching sensitivity differ by context length, real-time vs. batch, or agentic vs. single-turn? | A policy that helps chat may hurt batch; averaging them together can hide a zero or negative net effect. | Check that segments have different treatment effects, not just different baselines. |
| Customer tier | Does priority routing improve outcomes more for enterprise vs. pro vs. free? | Scarce capacity should go where the marginal gain is highest, not where baseline value is highest. | Enterprise users may have inelastic demand; the lift may be smaller than for mid-tier users who currently get starved. |
| Latency sensitivity | Is ∂(task completion)/∂(latency) higher for real-time chat than for async jobs? |
A 50 ms improvement matters for chat, not for overnight batch. The policy value depends on the workload mix. | Do not average latency-sensitive and latency-tolerant requests; their slopes differ. |
| Model family / size | Do routing changes affect small and large models differently? | Large models may saturate memory; small models may not. Segment-level effects can flip signs. | A cache eviction policy might help small models and hurt large ones due to KV-cache pressure. |
| System state | Does the same policy have different effects when the queue is deep vs. shallow? | A cache policy may help off-peak and hurt peak. The same user can be highly sensitive during congestion and insensitive otherwise. | Report CATE by state, not just ATE across all states. The policy may need state-dependent rules. |
Minimum Viable CATE
You do not need machine learning to start. A linear regression with interaction terms is often enough to test whether a segment is worth targeting:
Here, β₃ is the difference in treatment effect between workload types. If β₃ is small or not significant, your segmentation by workload type is not operationally useful, regardless of how different the baseline latencies are. You can extend this to any segment by replacing workload_type with the segment of interest.
06
TK: Offline Policy Evaluation
TK: Offline Policy Evaluation
07
TK: Holdouts and Longer-Term Decisions
TK: Holdouts and Longer-Term Decisions
08
Assumption Checks and Guardrails
Assumption Checks and Guardrails
In a standard A/B test, a broken randomization check is a reason to discard the result. In shared compute, the same principle applies to identification assumptions: a violated parallel-trends assumption, poor overlap, or unobserved boundary states mean the comparison is not credible, regardless of the p-value. Run these checks before looking at effect sizes.
Identification Checks: Is the Comparison Credible?
| Check | What to look for | How it changes the decision |
|---|---|---|
| Assignment integrity | SRM, eligibility, exposure time, and whether treatment started when the logs say it did. | If assignment is broken, stop. The comparison is invalid. |
| Identification assumption | Parallel pre-trends (DiD), propensity overlap, no manipulation at cutoff (RDD), pre-period synthetic fit, cluster balance, or washout clearance (switchback). | If the core assumption is violated, downgrade to descriptive or stop. Do not report a causal effect. |
| Execution fidelity | Intended action, executed action, fallback reason, timeout path, and retry path. | If execution diverged, report the operational failure separately. Do not attribute the outcome to the policy as designed. |
| Boundary coverage | Did the experiment observe peak congestion, cache pressure, or the capacity states where the policy could fail? | If boundary states were never observed, the readout cannot support a broad rollout. Recommend a limited ramp or more data. |
Guardrail Checks: Is the System Safe?
| Check | What to look for | How it changes the decision |
|---|---|---|
| System spillover | Untreated-user latency, errors, queue depth, cache state, and utilization. | If spillovers are material, the user-level lift is overstated. Move to cluster-level evidence or stop. |
| Economic significance | User value per GPU-hour, cost per successful request, margin impact, and reliability risk. | If the practical effect does not clear the cost bar, do not ship just because the p-value is good. |
| Rollback trigger | Latency above X, error rate above Y, cost per success above Z, or no movement in task completion. | If triggers fire, pause or roll back even if the average user metric looks fine. In the router example, elevated fallback rates during congestion should stop the ramp even if premium-routed requests improved. |
09
The Decision Readout
The Decision Readout
The readout should follow a strict hierarchy: first verify that the evidence is credible, then check that the system is safe, then evaluate whether the effect is large enough to justify the change. Only then should you decide to ship, ramp, target, hold out, pause, or roll back.
Stop Gates: Do Not Launch If
- Identification assumptions are violated. Parallel trends diverge, propensity overlap is poor, the RDD cutoff shows manipulation, or synthetic control pre-period fit is weak. (See Section 8.)
- Guardrail metrics show negative movement. Untreated-user latency degrades, error rates exceed thresholds, queue depth spikes, or cost per successful request rises.
- Boundary states were unobserved. The experiment ran only off-peak; you have no evidence about how the policy behaves under congestion or cache pressure.
- Execution fidelity is broken. The policy did not run as intended—fallbacks, timeouts, or operational incidents dominated the treatment period.
If any stop gate is true, the recommendation is do not launch. Downgrade to descriptive findings, fix the instrumentation, or run a stronger design.
Proceed Criteria: Launch or Ramp If
- Success metrics are practically and statistically significant. The effect is large enough to matter to users or the business, and the uncertainty is small enough to act on. In cluster or switchback designs, use cluster-robust or time-block standard errors; conventional user-level standard errors are anti-conservative.
- All identification assumptions pass. The design's core assumption is plausibly satisfied, with documented caveats if needed.
- Guardrail metrics are flat or positive. No material spillover to untreated users, no reliability degradation, and no cost regression.
- The effect is economically significant. User lift justifies compute cost, latency cost, or operational complexity. A 2% retention lift that requires 20% more GPU-hours may not be a good trade.
If all four criteria are met, the recommendation is ship or ramp. If state coverage is limited or tail risk is moderate, recommend ramp with monitoring rather than full ship.
Tradeoff Analysis: When Results Are Mixed
If success metrics are positive but guardrails are mixed, or if effects differ across segments, do not default to "launch and monitor." Do the work to unify the evidence into a single recommendation.
1. Unify to a single metric
Combine user value and compute cost into one denominator: value per GPU-hour, cost per successful request, or margin per request. Do not let a positive user metric and a negative cost metric coexist without a common currency. If the unified metric is negative, the answer is no.
2. Segment by user groups and features
A positive average effect can mask negative effects on a specific segment, or vice versa. Report CATE by workload type, customer tier, model family, and system state. A policy should only be targeted if the segment-level treatment-effect slope differs materially. (See Section 6.)
3. Discuss overall desirability
Even with a positive unified metric, consider operational complexity, maintenance burden, and tail risk. A small average gain with high variance in outcomes—or a gain that only holds in narrow state windows—may not be worth the risk.
4. State what would change the decision
Every readout should close with a pre-specified contingency: "If error rates rise above 0.5% in the next two weeks, roll back." This prevents post-hoc rationalization when early monitoring data arrives.
Decision Matrix
| Situation | Recommendation | Contingency |
|---|---|---|
| All proceed criteria met; low tail risk | Ship | Monitor guardrails for two weeks; rollback if triggers fire. |
| All proceed criteria met; limited boundary coverage | Ramp | Expand to peak windows gradually; pause if congestion behavior diverges from off-peak. |
| Mixed results: strong positive for some segments, flat or negative for others | Target | Deploy only to the winning segment; keep a holdout for the excluded segment to validate the negative finding. |
| Positive short-run metrics; long-run outcomes (retention, LTV) unobserved | Holdout | Keep the smallest durable holdout that can answer the long-run question; ship only if the surrogate ladder is validated. |
| Assumption concerns that are not fatal, or mixed tradeoffs that need more data | Pause | Collect more data, fix instrumentation, or run a stronger design before deciding. |
| Guardrail violations, assumption failures, or negative unified metric | Roll back | Do not wait for more data. Reverse the policy and investigate. |
| Assumption checks failed due to missing data or broken logging | Instrument more | Fix logging, add decision-time state fields, and re-run before any causal claim. |