Private working draft. Please do not circulate or reproduce without permission.
Measurement for Shared Compute
This guide develops a practical framework for evaluating routing, rate limits, cache policies and other allocation decisions in AI inference systems where an intervention affects both an individual request and the shared capacity system around it.
It combines my research into state-dependent inference costs and fat-tailed demand with my working experience as a pricing and product data scientist. An earlier version of these ideas was discussed on Live with Tim O'Reilly.
The central challenge is that compute allocation decisions should usually be evaluated as policies under shared system state, rather than as isolated user-level treatments.
01
Start With the Question
Turn the operational decision into a concrete causal question before choosing a method.
Start With the Question
Turn the operational decision into a concrete causal question before choosing a method.Whenever we deal with causal questions, it is worth thinking about the ideal experiment. Always ask yourself, if you could, what would be the perfect experiment you would run to uncover this causal effect? This tends to shed some light on how we can discover the causal effect even without the ideal measurement environment.
- Name the decision. Is the decision to ship broadly, target a segment, change a routing policy, reserve more capacity, or instrument before acting?
- Choose the comparison. The counterfactual might be the old policy, the untreated users, a different cluster, a prior model checkpoint, or a threshold-adjacent group.
- Pick the estimand. Average Treatment Effect (ATE) is for broad rollout, Average Treatment Effect of the Treated (ATT) is for the users who actually received treatment, Conditional Average Treatment Effect (CATE) is for targeting, and policy value is for routing or allocation decisions.
- Unit economics as a guardrail. A policy can improve task success while burning too much compute. Pair retention or task completion with latency, errors, GPU-hours, and cost per successful request.
- End with a rule. Ship, ramp, target, hold out, pause, roll back, or instrument more. The point is to state the recommendation and the tradeoff that could change it.
Estimand Cheat Sheet
| Estimand | Use it when the decision is... | Potential Tradeoff |
|---|---|---|
| ATE | Should the policy ship to the whole eligible population? | Can hide that only one workload or customer segment benefits enough to justify the cost. |
| ATT | What happened to the users or requests that actually received the treatment? | Can be less useful for future rollout if treated users were unusually selected. |
| CATE | Who should receive scarce compute, higher limits, or better routing? | Can overfit unless the segment is stable, interpretable, and operationally usable. |
| Policy value | Which routing, scheduling, or allocation rule should run? | Needs support in the logs, system guardrails, and a clear cost/value objective. |
02
Allocation Policies are Not Static Treatments
When the action depends on workload and system state, evaluate the rule that made the decision.
Allocation Policies are Not Static Treatments
When the action depends on workload and system state, evaluate the rule that made the decision.In shared compute, the "treatment" is often a policy rather than a single static action. A router, rate limit, cache policy, or capacity allocation rule is dynamic, and chooses an action based on the request-level workload and the live system state.
This changes what has to be measured. A credible evaluation requires the inputs the policy saw, the intended action, the executed action, the fallback path, and the reason for any divergence. Otherwise the analysis can confuse the policy effect with the state that caused the policy to act.
In the router example, premium-fleet assignment is not simply "treated." The same request may be routed to premium capacity when the queue is shallow and sent to a cheaper fleet when the system is congested. The estimand must therefore attach to the allocation rule, not merely to the serving path that happened to execute.
Minimum Logging for a Policy Readout
| Field | Why it matters | Failure if missing |
|---|---|---|
| Decision-time state | Queue depth, region, cache state, incidents, and capacity pressure may drive both assignment and outcome. | You mistake a hard-to-serve state for a treatment effect. |
| Intended action | This is the policy recommendation: route, admit, defer, cache, limit, or prioritize. | You cannot report intent-to-treat. |
| Executed action | Serving paths can change because of timeouts, provider issues, or fallback logic. | You analyze the policy as if it actually ran. |
| System guardrails | Latency, errors, queue depth, cost per successful request, and untreated-user experience show spillovers. | A user-level lift hides damage to the shared system. |
03
State, Mechanisms, and Bias [needs better title]
Use the DAG as a short checklist: confounder, mediator, collider, spillover.
State, Mechanisms, and Bias [needs better title]
Use the DAG as a short checklist: confounder, mediator, collider, spillover.The control question should come from the assignment story, not from feature importance. Adjustment variables should exist before the policy decision and plausibly cause both assignment and the outcome. The analysis should not control away the mechanism by which the treatment worked.
Basic System DAG [I'll update this later]
X and S are candidates for adjustment when they are pre-treatment common causes. Latency and errors sit after the action, so they should be reported as mechanism metrics unless the question is specifically about a direct effect.
System State Is Both a Confounder and an Outcome
Shared inference systems make the timing of state measurement central. At time t, system state can be a pre-treatment confounder for the current routing, admission, or fallback decision. At time t + 1, that same class of state variables can be downstream of earlier policy decisions.
Queue depth before a routing decision may be an adjustment variable. Queue depth after the policy has run may be part of the policy's effect. In the router example, congestion can determine whether a high-value request reaches the premium fleet, while the routing decision itself can also change later congestion for untreated requests.
- Control for common causes. If queue depth, customer tier, workload length, or provider health affected both routing and success, it belongs in the adjustment story.
- Do not control for mediators in a total-effect question. Latency, errors, cache hit, and completion may be how the routing change affects retention or task success.
- Do not select only completed requests. Completion can be caused by the treatment. Dropping failures makes a bad serving policy look cleaner than it is.
- Name the spillover path. If the treatment changes a shared queue, cache, fleet, or regional capacity pool, user-level treatment and control groups may no longer be independent.
Probably Bad Controls
| Variable | How to treat it | Why |
|---|---|---|
| Queue depth before routing | Usually adjust for it. | It can drive both assignment and outcome. |
| Queue depth after routing | Report it as an outcome or mechanism metric. | It can be downstream of the policy and part of the effect on the shared system. |
| Latency after routing | Do not control for it in a total-effect estimate. | It is probably part of the mechanism. |
| Completed requests only | Avoid conditioning on this sample. | Treatment may affect completion, so failures disappear from the readout. |
| Untreated-user latency | Use as a spillover guardrail. | It shows whether treated traffic harmed the shared system. |
04
Fat-Tailed Costs and Capacity Risk**
Cost tails matter both across users and across the system states that set capacity risk.
Fat-Tailed Costs and Capacity Risk**
Cost tails matter both across users and across the system states that set capacity risk.For AI infrastructure and inference products, a plain difference-in-means readout can be fragile when the metric is tokens, cost, or GPU-hours per user. Those outcomes can be dominated by a small number of very large users, long-context workflows, or high-concurrency windows.
Tokens per request are bounded by the context window, but the variance can still be large as use cases get more heterogeneous. True resource cost is messier because it depends on live system state: cache pressure, batch concurrency, prefill/decode balance, provider health, and the current capacity boundary. The sample mean can move because the policy worked, or because one arm happened to receive more unusually large workloads.
Two tail risks should be kept separate. Cross-sectional tail risk means a small number of users, accounts, or workflows dominate cost. Temporal and system-state tail risk means peak-concurrency periods push the system toward capacity boundaries and alter batching, cache pressure, fallback behavior, and untreated-user experience.
Robust Cost Readouts [needs review]
| Move | When to use it | Decision caveat |
|---|---|---|
| Raw mean difference | Start here when the business question is aggregate cost, margin, or total capacity consumed. | The tail is part of the economics, so it should not be thrown away automatically. Sensitivity checks show whether one account or time window carries the result. |
| Winsorized or trimmed mean | Use as a robustness readout for tokens, cost, or GPU-hours per user. | This changes the estimand. The readout is no longer total average cost; it is a capped or central-user version of cost. |
| Quantile treatment effects | Use when the tail is the question: p90, p95, p99 cost, latency, or GPU-hours. | This answers a different question than average ROI. It is about tail risk and capacity exposure. |
| Log transform | Use for a more stable multiplicative readout, such as percent movement in cost or usage. | It downweights very large users and can be easier to model, but the result is no longer in raw dollars or raw GPU-hours. |
| Randomization inference | Use when CLT-based confidence intervals feel shaky because the sample is small, clustered, or tail-dominated. | It is tied to the actual assignment mechanism, which is good, but it does not solve bad randomization or spillovers. |
| CUPED / regression adjustment | Use pre-period usage as a covariate, especially for heavy users with persistent behavior. | This is often the biggest variance win, but the covariate has to be pre-treatment and measured consistently. |
For a heavy-tailed cost metric, the raw mean should remain visible because it is the business cost, but it should not be the only readout. Pair it with a winsorized or trimmed estimate, a quantile readout if tail risk matters, and CUPED or regression adjustment using pre-period usage. If the confidence interval depends heavily on normal approximations, use randomization inference tied to the actual assignment mechanism.
When the policy changes a shared system, uncertainty may need to be computed at the time-window, cluster, fleet, or account level. In the router example, a policy can look efficient on ordinary traffic while creating unacceptable fallback rates during peak-concurrency blocks. That is a capacity-risk question, not only a central-tendency question.
05
Choosing a Credible Design
Pick the strongest credible design, then say the assumption that could fail.
Choosing a Credible Design
Pick the strongest credible design, then say the assumption that could fail.| If the situation is... | Recommendation | Key identification question |
|---|---|---|
| Randomization is feasible and spillovers are low. | Run a user/workspace/request RCT, depending on the outcome. | Check SRM, balance, guardrails, and whether the randomization unit matches the decision. |
| The intervention changes shared system state. | Use cluster randomization or a switchback experiment. | Do not pretend requests are independent if the policy changes queue or cache state. |
| The rollout happened in waves. | Use DiD or an event study. | Show pre-trends and check for incidents or launches that happened at the same time. |
| Treatment changes at a threshold. | Use RDD around the cutoff. | Keep the claim local to users near the cutoff and check manipulation. |
| Assignment nudges treatment but take-up varies. | Use IV / encouragement and report LATE. | Check first stage and defend the exclusion restriction. |
| Only observational logs are available. | Use regression, propensity weighting, or doubly robust estimation if pre-treatment confounders are logged. | Say clearly that unobserved confounding is the remaining risk. |
| One region, fleet, or provider changed. | Use synthetic control if untreated units match the pre-period. | Show pre-fit and placebo tests. If pre-fit is bad, do not oversell it. |
06
Experimental Designs Under Shared Compute
The unit of randomization has to match the user experience and the capacity system.
Experimental Designs Under Shared Compute
The unit of randomization has to match the user experience and the capacity system.User or Workspace RCT
Use for: rate limits, prices, model access, priority tiers, or any change where the same user should keep seeing the same experience.
Why: if the outcome is retention, expansion, or sustained usage, the randomization unit should match how the user actually experiences the product.
Potential downside: randomizing by user usually has less statistical power than randomizing by request, and takes longer to read.
Mitigation: use short-run system metrics as early checks, keep a holdout for long-run outcomes.
Request RCT
Use for: latency, error rate, cost per request, retry rate, or whether a route succeeds.
Why: request-level randomization gives much more statistical power, especially for high-volume traffic. It is a good fit for isolated serving changes where the main outcome is request-level.
Potential downside: it can be a bad user experience if the same workflow gets mixed treatment.
Mitigation: use it only when the change is invisible or low-risk to the user.
Cluster Randomization
Use for: changes to batching, caching, regional routing, or fleet allocation where the intervention changes a shared capacity pool.
Why: if users share the same capacity pool, individual randomization can break down because spillover effects from the treated group can change the experience of the control group. Randomizing by cluster contains most interference within the randomized unit.
Potential downside: there are fewer independent units, so statistical power is worse. The clusters may also be of different sizes or otherwise look different: different chip generation, geography, user types, latency, customer value, or incident patterns.
Mitigation: stratify before randomization, check pre-period balance, analyze at the cluster level, and report system-level outcomes instead of only the treated users' outcomes.
Switchback Experiment
Use for: cache policy, global routing configuration, fleet allocation, scheduler settings, or any change that has to be turned on for a whole system at once.
Why: if the policy changes queue state or cache state, compare blocks of time under policy A versus policy B rather than pretending individual requests are independent.
Potential downside: demand changes over time. Hour-of-day, day-of-week, incidents, launches, and seasonal traffic can get mixed into the treatment effect. Carryover is also real: cache warmth, queue backlog, retry storms, and provider throttling can persist across blocks.
Mitigation: randomize across comparable time blocks, balance treatment across weekends and weekdays, record incidents, and include washout when cache or queue persistence is material. Blocks must be long enough for the policy to take effect but short enough to avoid conflating treatment with demand cycles.
07
Observational Methods
Use them when randomization is unavailable, but be honest about measured versus unmeasured confounding.
Observational Methods
Use them when randomization is unavailable, but be honest about measured versus unmeasured confounding.Observational work should not oversell the estimator. The real question is whether the historical comparison has the right pre-treatment variables and enough overlap to support the recommendation.
| Method | When to use it | Identification risk |
|---|---|---|
| Outcome regression | Good first pass when the main assignment variables are logged: model, account tier, region, time of day, queue depth, prompt length, and past usage. | A model that predicts retention well can still estimate the treatment effect badly if the assignment story is wrong. |
| Propensity weighting or matching | Useful when treated and untreated units overlap but had different probabilities of treatment. | Extreme weights or poor overlap mean the result depends on a few unusual records. Show overlap and trim if needed. |
| Doubly robust estimation | Stronger default when both assignment and outcome can be modeled. | It still cannot fix missing confounders. Lead with the identification assumption, not the estimator name. |
| Double ML | Useful with rich telemetry and many covariates where flexible models help with nuisance functions. | Better prediction is not better causality by itself. Explain cross-fitting plainly and still show diagnostics. |
08
Quasi-Experimental Designs
Use rollout timing, thresholds, instruments, or a synthetic comparison only when the assignment story supports it.
Quasi-Experimental Designs
Use rollout timing, thresholds, instruments, or a synthetic comparison only when the assignment story supports it.These are strongest when the business already created quasi-random variation: a phased rollout, a threshold, a beta invite, or one region/fleet/provider changing before the rest. Pick the design based on the assignment story, then state the assumption that could fail.
| Design | Recommendation | Tradeoff to check |
|---|---|---|
| Difference-in-differences | Use when capacity, routing, pricing, or provider changes roll out to some units before others. | Pre-trends have to look credible. Also check incidents, launches, or demand shifts at the same time. |
| Synthetic control | Use when one region, fleet, provider, or customer tier changes and there is a long pre-period. | A clean post-period gap is not enough. Show pre-period fit, placebo units, and donor weights. |
| Instrumental variables / LATE | Use when assignment nudges treatment but actual take-up is imperfect or self-selected. | The estimate is local to compliers, and the instrument has to affect the outcome through treatment only. |
| Regression discontinuity | Use when eligibility, limits, priority, or pricing changes sharply at a known cutoff. | The result is local to the cutoff. Check bunching, covariate continuity, and bandwidth sensitivity. |
09
Heterogeneity and Targeting
Use segments to make a decision, not to decorate the analysis.
Heterogeneity and Targeting
Use segments to make a decision, not to decorate the analysis.Heterogeneity should change the allocation rule, not merely decorate the analysis. If the result does not change who gets capacity, which workload is routed differently, or which customer segment is worth serving differently, it is mostly decoration.
- Start with the average effect. That shows whether there is a real lever worth segmenting.
- Pre-specify operational segments. Good candidates are workload length, customer tier, model family, region, latency sensitivity, and peak versus off-peak demand.
- Use CATE only if targeting is the decision. The segment has to be stable, interpretable, and simple enough to operate.
- Convert lift into net value. A high treatment effect can still be a bad allocation if it uses expensive capacity for low-margin traffic.
- Keep some exploration alive. If the router only exploits the current best segment, future evaluation loses support for alternatives.
| Segment lens | What it helps decide | What to watch |
|---|---|---|
| Workload shape | Whether long-context, batch, real-time, or agentic workflows should receive different routing or limits. | Aggregate usage can hide very different cost and latency profiles. |
| Customer tier | Whether scarce capacity should be prioritized for customers with higher value or stricter SLAs. | Higher value may also come with higher reliability expectations. |
| System state | Whether the policy should change during peak demand, incidents, or low-utilization periods. | The same treatment can be worth it off-peak and too expensive during congestion. |
10
Offline Evaluation of Routing Policies
Use OPE only when the logs actually support the policy you want to evaluate.
Offline Evaluation of Routing Policies
Use OPE only when the logs actually support the policy you want to evaluate.Offline policy evaluation is useful for routers, schedulers, ranking policies, and allocation rules, but only if the logs contain enough examples of the actions the new policy wants to take. Check support before estimating anything.
Logged propensities can support short-horizon contextual routing evaluation, especially when the action affects the current request and the relevant state was logged. They are less sufficient for full dynamic-policy evaluation when actions alter future queue and cache state. In that setting, IPS or doubly robust estimates can help screen policies, but live validation is still needed before broad deployment.
- Check support. Did the old policy try the same actions for similar state often enough?
- Check propensities. If action probabilities were logged, IPS and doubly robust methods become more credible.
- Inspect weights. If a few records drive the answer, the offline estimate is too fragile for a broad rollout.
- Validate live. Even a good OPE result should usually become a limited ramp or controlled exploration bucket before full deployment.
| Estimator | When to use it | Identification risk |
|---|---|---|
| Direct method | Quick baseline when rich state and enough outcomes exist for each action. | Low variance, but biased if the outcome model is wrong for rarely chosen actions. |
| IPS | Stochastic routers or experiments where action probabilities are known. | Unstable when weights explode, especially if the new policy chooses actions the old one rarely tried. |
| Doubly robust OPE | Logged propensities plus a reasonable reward model. | Stronger default, but it still needs support, the right state variables, and no major interference. |
11
Long-Run Outcomes and Short-Run Signals
Use short-term metrics as evidence only after showing why they should predict the delayed outcome.
Long-Run Outcomes and Short-Run Signals
Use short-term metrics as evidence only after showing why they should predict the delayed outcome.Short-run metrics are useful, but they should not be treated as business outcomes by default. First state what part of the mechanism they measure, then explain why that mechanism should translate into retention, expansion, or LTV.
Surrogate Ladder
- System metric: TTFT, latency, cache hit rate, error rate, retry rate, or cost per successful request.
- Session outcome: task completion, abandonment, re-run rate, or successful workflow completion.
- User behavior: return rate, deeper usage, broader adoption, or more high-value workflows.
- Business outcome: retention, expansion, renewal, LTV, or margin.
The ladder can break. A faster request is not automatically higher retention, and more usage can still hurt margin if the workload is expensive or low value. A metric that predicts retention is not necessarily a valid surrogate for retention. A valid surrogate claim requires evidence that movement in the short-run metric captures the causal path by which the policy changes the long-run outcome.
If the long-run outcome takes too long to observe, the recommendation should be conditional: ramp only if the mechanism metric improves, the cost and reliability guardrails hold, and early behavior moves in the direction past experiments suggest. For policies expected to affect retention or expansion, keep the smallest durable holdout that can answer the long-run question. If no holdout is possible, label the surrogate risk explicitly rather than pretending the short-run metric proves the whole case.
12
Checks That Change the Decision
Checks should tell you whether to ship, ramp, target, pause, roll back, or instrument more.
Checks That Change the Decision
Checks should tell you whether to ship, ramp, target, pause, roll back, or instrument more.The check is not there to show diligence for its own sake. It should change the recommendation: ship, ramp, pause, roll back, narrow the target population, or instrument more.
| Check | What to look for | How it changes the recommendation |
|---|---|---|
| Assignment | SRM, eligibility, exposure time, and whether treatment started when the logs say it did. | If assignment is broken, pause the readout. |
| Execution | Intended action, executed action, fallback reason, timeout path, and retry path. | If execution diverged, report ITT and separately analyze the operational failure. |
| Balance and overlap | Balance tables, propensity overlap, pre-trends, or pre-period synthetic fit depending on the design. | If overlap is poor, narrow the estimand or downgrade the claim. |
| System spillover | Untreated-user latency, errors, queue depth, cache state, and utilization. | If spillovers are material, move to cluster or switchback evidence. |
| Economic significance | User value per GPU-hour, cost per successful request, margin impact, and reliability risk. | If the practical effect does not clear the cost bar, do not ship just because the p-value is good. |
| Rollback trigger | Latency above X, error rate above Y, cost per success above Z, or no movement in task completion. | If those triggers fire, pause or roll back even if the average user metric looks fine. In the router example, elevated fallback rates during congestion should stop the ramp even if premium-routed requests improved. |
13
Decision Readout Template
A compact structure for a technical decision memo or experiment readout.
Decision Readout Template
A compact structure for a technical decision memo or experiment readout.State how treatment or allocation was assigned. If randomization is available and spillovers are low, use the right RCT unit. If the policy changes shared capacity, use cluster randomization or a switchback. If the evidence is historical, look for rollout timing, a threshold, an instrument, or enough pre-treatment controls.
Name the main failure mode: confounding, bad controls, noncompliance, spillovers, weak support, or tail-state risk. Mitigate it with balance checks, pre-trends, full-funnel logging, holdouts, cluster-level uncertainty, washout, or controlled exploration.
Close with the decision: ship, ramp, target a segment, keep a holdout, pause, roll back, instrument more, or run a stronger design, based on user lift, system guardrails, and compute cost.
Common Decision Patterns
| Observed situation | Decision pattern | Contingency |
|---|---|---|
| Routed users retained better. | Do not compare routed vs not-routed directly. Ask what drove routing. | If queue state or workload type drove routing, control for pre-treatment state or run an experiment. |
| The policy can be tested safely. | Randomize at the unit that matches the outcome. | If shared capacity matters, move up to cluster or switchback. |
| The policy already rolled out. | Use DiD/event study if the rollout timing gives a credible comparison. | If pre-trends fail or there was a simultaneous incident, do not lean on DiD. |
| There is a cutoff. | Use RDD near the threshold. | Check manipulation and keep the claim local. |
| Only router logs are available. | Use OPE only if propensities and support exist. | If support is weak, recommend controlled exploration. |