Methodology

The AI ROI Score, explained: how we grade investments from 0 to 150

Eric Avery·Founder & CEOMay 5, 20266 min read

Most AI ROI numbers do not survive a serious finance review. They are produced by the vendor, framed by the vendor, and quietly retired the moment a CFO asks where each input came from. We built the Umbral Intelligence ROI Score because the question deserves a defensible answer — one a board can read in ten seconds and an auditor can pressure-test for an afternoon. This piece walks through what the score is, what it measures, and why we deliberately chose a 0-to-150 range instead of the more familiar 0-to-100.

The score is a single number that summarizes how a portfolio of AI investments is performing against the cost of those investments. One hundred is the anchor. At one hundred, the realized value of an AI program — measured against the same period's all-in spend on it — exactly equals the cost. The math is unambiguous: at 100, you have recovered what you paid. The reason boards care about this anchor is that it is the only number in an AI conversation that does not require interpretation. It is the breakeven line, full stop.

Everything below 100 is a story of partial recovery, and we narrate it in four bands. A score in the 0-to-49 range reads as below cost recovery — spend is exceeding realized value, and the portfolio is, in finance terms, underwater. The 50-to-79 band reads as partial recovery: there is real value being created, but the gap to breakeven is meaningful and worth a conversation about adoption, feature usage, or scope. The 80-to-99 band reads as near full recovery — most of the cost has been recouped, and a small lift in adoption or impact would clear the line. At exactly 100, the program has reached full ROI recovery: savings equal investment.

Above 100 is where the design choice matters. A 0-to-100 scale collapses every successful AI investment into a single ceiling. A program that returned its cost and a program that returned three times its cost both map to 100, which is useful for nobody. So we extended the scale to 150 to represent exceeding recovery — the band where a portfolio is meaningfully outperforming its cost basis. One hundred and fifty is not infinite upside; it is a deliberately bounded ceiling that says, in plain language, this investment is paying back at a rate well past breakeven. The board can see at a glance not only that the program cleared the bar, but how comfortably.

Underneath the single number is a composite. We do not publish the exact weights — they are a calibration we tune against research and customer outcomes — but the conceptual structure is straightforward, and it is the same structure on every customer's dashboard. The score is built from six categories of value. Time savings: hours that come back to the workforce because a tool absorbed work that previously consumed them. Velocity: cycle times that compress, releases that ship faster, decisions that close sooner. Quality: error rates that drop, defects that get caught earlier, accuracy that improves. Cost avoidance: spend that did not happen because a tool replaced a process or deferred a hire. Revenue impact: pipeline that converted faster, deals that closed sooner, accounts that expanded. Risk reduction: compliance signals that strengthened, security posture that improved, audit exposure that fell.

Each of these categories is anchored to a different kind of evidence. Time savings come from adoption telemetry crossed with role-level baselines — how many hours of which kind of work the tool actually absorbed, not how many hours the vendor claims it should have. Velocity and quality come from the engineering and operational systems your teams already run. Cost avoidance is reconciled against headcount and procurement records. Revenue impact is traced through the CRM motion the tool was meant to influence. Risk reduction is the most conservative of the six and contributes the least; we treat it as a tiebreaker rather than a driver, because the data is harder to verify under audit.

The word that does the most work in our methodology is conservative. A lot of published research on AI productivity reports headline effects under conditions that do not hold in most enterprises — strong onboarding, motivated cohorts, narrow task scopes. We discount aggressively. For every category above, we model a distribution rather than a point estimate, and the score uses the central estimate, not the optimistic ceiling. The optimistic ceiling exists in the platform — finance leaders can see it when they want to stress-test a renewal — but it is never the number on the front page. The goal is not the largest defensible number. The goal is the number a finance leader could put in front of a skeptical auditor without flinching.

Research-adjusted means something specific in our context. The role-level baselines that anchor time savings, velocity, and cost avoidance are derived from the empirical literature on workforce automation and task-level AI impact. We update those baselines as the research evolves, and the platform is explicit about which baseline informs which input. When a customer's score moves because we updated a research input, the dashboard says so. When a score moves because the customer's adoption rose or fell, the dashboard says that too. The score is not a black box; it is a ledger.

A natural question is what makes a score change over time. The fastest movers are adoption depth and feature usage. A tool that is licensed to a thousand seats but actively used by two hundred is, in the language of the score, mostly cost. As adoption deepens — more users, more features, more persistence in month six versus month one — the realized value catches up to the spend, and the score climbs. The slowest movers are the structural ones: revenue impact and risk reduction take time to verify and are weighted accordingly. A program that is moving the score primarily through revenue and risk is a program that is delivering durable value, even if the climb is gradual.

Equally important is what a low score actually says. It does not say the investment was a mistake. It says, with specificity, where the gap is. A score of 72 with strong time savings and weak adoption is a different conversation than a score of 72 with weak time savings and strong adoption. The first is a rollout problem. The second is a fit problem. The platform decomposes the score into its categories on every page, so the conversation that follows is never about whether the number is right — it is about which lever to pull next.

We chose to expose a single composite number because boards do not read dashboards; they read summaries. A CFO who has thirty seconds to assess a portfolio of AI investments needs an entry point — one number that frames the conversation that follows. The Umbral Intelligence ROI Score is that entry point. It is not the whole story, and the platform is built to surface the rest of the story the moment anyone asks. But the score itself is the thing that turns an AI portfolio review from a narrative into a measurement, and that is the work we set out to do.

If you are running an AI portfolio today and the most precise number you can produce for it came from a vendor deck, you are operating on faith. The score is a way to operate on evidence instead. Defensible, conservative, research-adjusted, and bounded. A number a board can trust, and an auditor can verify. That is the bar we set, and the bar the platform is built to clear.