Inside the Engine · 14 min read · June 12, 2026

55 Directives Scored: Inside Allocera's Accuracy Measurement Methodology

As of May 2026, Allocera's CDAI engine has issued 56 directives and scored 55 of them through automated 30-day retest. 44 came back correct. 11 came back incorrect. The resulting accuracy: 80 percent — measured, not modeled, against actual contribution margin outcomes. This is the methodology breakdown that produces a number no other marketing analytics platform reports.

By Nick Baum · Founder, Allocera Intelligence

The marketing analytics category does not validate itself. The platforms that issue recommendations — multi-touch attribution tools, marketing mix models, the dashboard-driven advisory features built into HubSpot, Salesforce, and the major CRM stacks — produce numbers and move on. There is no native mechanism that goes back 30 days later and checks whether the number predicted reality. The industry has accepted that absence as the default state.

Allocera was built around the opposite premise. Every directive the CDAI engine issues — SCALE, HOLD, CUT, PAUSE, FLAG — is stored with its pre-directive state. Thirty days later, the engine returns automatically, queries the post-directive contribution margin, classifies the outcome against deterministic rules, and updates an aggregate directive accuracy metric. There is no review step. There is no override. The grade gets recorded whether anybody wants to see it or not.

As of May 2026 the result of that methodology is concrete. 56 directives issued. 55 scored. 44 correct. 11 incorrect. 80 percent measured directive accuracy. One directive remains within its 30-day window and is pending retest.

80% Measured directive accuracy · 55 of 56 directives scored · validated by automated 30-day retest

That number is meaningfully above the 60 to 70 percent directional accuracy typical of multi-touch attribution and marketing mix models. More importantly, it is reported under a different epistemic posture than most marketing accuracy claims. This piece walks through the full methodology — how the scoring works, what the outcome labels mean, why some directives end up classified inconclusive, what the 80 percent figure actually reveals about the directive logic, and what it does not reveal.

Directive Accuracy: What the Industry Reports vs What Allocera Reports

The distinction matters because the word "accuracy" is used differently across the category.

When a multi-touch attribution platform reports accuracy, the figure is typically a model fit metric — how well the platform's attribution model explains the variance in observed conversions, evaluated against a holdout set or via cross-validation. This is a statistical property of the model. It is not a measurement of whether the platform's recommendations led to better outcomes when acted on. Two different things share the same word.

When Allocera reports 80 percent directive accuracy, the figure is a directive outcome score. Every directive issued by the engine is a specific recommendation — increase budget on this campaign, cut this one, hold this one steady. The directive carries a prediction: if you act on this directive, contribution margin will move in the predicted direction. Thirty days later, the engine measures whether contribution margin actually moved in the predicted direction. The directive is scored CORRECT, INCORRECT, or INCONCLUSIVE on that basis alone.

Model fit is a statistical property. Directive outcome accuracy is whether the engine's recommendations produced the predicted result when acted on. These are different things measured against different ground truths.

Both methodologies are legitimate. They answer different questions. Model fit answers "how well does our model explain the past?" Directive outcome accuracy answers "did our recommendations produce the predicted future when acted on?" For a CFO making capital allocation decisions, the second question is the one that matters. We covered the broader thinking on this distinction in our 30-day retest methodology deep dive.

How the Engine Issues and Scores Each Directive Type

The engine issues directives across five publicly named types — SCALE, HOLD, CUT, PAUSE, FLAG — plus additional directive types for specific risk signals (QUARANTINE for fraud spikes, RENEGOTIATE for partner-payout overages, INVESTIGATE for quality decay). Each type has its own issuance criteria and its own outcome classification logic.

The five primary directive types covered in the scored 55-directive set work as follows:

SCALE

Issued when contribution margin runs at or above 30 percent with no risk signals (stable quality scores, no fraud rate elevation, refund and chargeback rates within normal ranges for the vertical). Confidence score typically lands at 88 percent. Recommended action: increase budget to 1.5x current spend. Scored CORRECT if post-directive contribution margin held or increased. Scored INCORRECT if margin compressed materially after the scale.

HOLD

Issued when contribution margin runs in the 18 to 30 percent range with stable signals and no major risk indicators. Confidence score typically 70 percent — the lowest confidence of the directive set because hold is the engine's "more data needed" call. Recommended action: maintain current spend. Scored CORRECT if margin stayed within plus or minus 5 percent of the pre-directive baseline. Scored INCORRECT if margin moved more than 15 percent in either direction during the retest window.

CUT

Issued when contribution margin runs below 10 percent, or when the campaign produced spend with no revenue, or when refund rate exceeded 15 percent. Confidence score typically 85 to 90 percent. Recommended action: reduce spend by 50 percent or eliminate entirely. Scored CORRECT if cutting the campaign improved overall portfolio margin or stopped active losses. Scored INCORRECT if margin actually worsened after the cut — which can happen if the campaign was a feeder for higher-quality channels and the cut produced a downstream attribution gap.

PAUSE

Issued on emergency conditions: fraud rate above 20 percent of leads, contribution margin below negative 50 percent, or chargeback rate above 15 percent. Confidence score typically 93 to 97 percent — the highest of the directive set because pause triggers only on unambiguous severe distress signals. Recommended action: emergency stop immediately. Scored CORRECT if pausing eliminated or substantially reduced the fraud or catastrophic margin signal. Scored INCORRECT if the underlying problem persisted past the pause.

FLAG

Issued when cost-per-lead distortion exceeds 50 percent against the seven-cost reconciled baseline, or when the engine detects a data anomaly the underlying directive logic cannot resolve automatically. Confidence score typically 85 to 90 percent. Recommended action: human review required, data integrity issue likely. Scored CORRECT if human review identified and resolved the data issue within the retest window. Scored INCORRECT if the data issue persisted unresolved.

The directive framework as a whole is the engine's primary client-facing output. We covered the directive logic and how it translates reconciled margin into capital allocation actions in our Scale, Hold, Cut, Pause framework breakdown.

The Math Behind the 80 Percent Directive Accuracy Figure

the directive accuracy calculation is deliberately simple. The complexity is in the outcome classification rules, not the aggregate math.

Accuracy Calculation

Accuracy % = (Correct Directives ÷ Total Scored Directives) × 100

Current state (May 2026):

Total Directives Issued = 56

Directives Scored (30+ days old) = 55

Pending Retest = 1

Inconclusive (excluded from calculation) = 0

Correct = 44

Incorrect = 11

Accuracy = 44 ÷ 55 × 100 = 80%

Metric	Count	Status
Total Directives Issued		56
Directives Scored (reached 30-day mark)		55
Directives Pending Retest	Within 30-day window	1
Scored CORRECT	Predicted outcome matched actual	44
Scored INCORRECT	Predicted outcome did not match actual	11
Scored INCONCLUSIVE	Excluded — see methodology	0
measured directive accuracy	44 ÷ 55 × 100	80%

The inconclusive count of zero on this scored set is notable. Inconclusive outcomes exist as a category — they apply when the engine cannot fairly score a directive (campaign paused by the client before the 30-day window, insufficient post-directive data, external platform outage contaminating the measurement). On the 55-directive scored set, none of those conditions applied in a way that prevented scoring. Every directive was either confirmed correct or incorrect against actual outcomes.

The Methodology Properties That Matter

The 80 percent figure sits on top of several methodological choices that determine what the number actually means. Four of them matter most.

Deterministic Outcome Rules

The outcome classifications — CORRECT, INCORRECT, INCONCLUSIVE — are deterministic functions of stored pre-directive state and current post-directive measurement. There is no human review step that could move a borderline outcome from INCORRECT to INCONCLUSIVE. There is no model output that could inflate the result by changing the classification criteria after the fact. The rules are set at directive issuance and applied at retest without modification.

Immutable Pre-Directive State

When a directive issues, the engine records the pre-directive contribution margin, the underlying cost-layer breakdown, the campaign's quality signals, and the specific reason codes that triggered the directive. That state is locked into the database and cannot be modified or overwritten. The 30-day retest measures against the recorded state, not against a state that could have been adjusted retroactively to make the directive look better.

The Engine Only Issues When It Can Stand Behind the Call

The CDAI engine refuses to issue directives when the underlying data does not support them. If incoming data is stale, or campaign attribution is missing, the engine's health monitor sets a single boolean — directive_safe = FALSE — and no directives issue. This behavior is independently verified across two real businesses in our published validation case study. The implication for the directive accuracy metric: the 80 percent figure is the directive accuracy of a curated set of high-confidence calls — the directives the engine had enough data to issue with conviction. It is not the accuracy the engine would achieve if it issued recommendations on every campaign regardless of data quality. The metric is structurally pessimistic compared to platforms that issue confident-looking outputs on incomplete data.

Inconclusive Outcomes Are Excluded From Both Numerator and Denominator

When the engine cannot fairly score a directive — campaign was paused before the 30-day window completed, insufficient post-directive data volume, external factor contaminated the measurement — the outcome is labeled INCONCLUSIVE and excluded from both sides of the directive accuracy calculation. This prevents the engine from being scored wrong for outcomes it could not have predicted, and it prevents the engine from being credited correct for outcomes that happened by chance. the directive accuracy figure reflects only the directives where a fair measurement was possible.

The 80 percent figure is the directive accuracy of a curated set of high-confidence calls. The engine only issues when it can stand behind the call. That is what makes the number conservative, not generous.

What the 11 Incorrect Directives Reveal About the Methodology

Eleven directives in the scored set came back incorrect. The methodology question worth examining is not which specific campaigns they were — that's not the level of analysis a methodology breakdown operates at — but what categories of incorrect calls a directive engine can produce, and what the engine does about them.

Pattern 1: Correlated Confounders the Engine Did Not See

A SCALE directive can come back incorrect when the engine's pre-directive analysis missed a correlated factor — a seasonality shift, a competitive ad spend change, a quality decay signal too subtle to cross the engine's threshold at issuance. The directive was right based on the data the engine had. The data the engine had was incomplete relative to what the campaign environment actually contained 30 days later. This is the most common pattern.

Pattern 2: Inferential Limits of Lookback Data

The engine's pre-directive analysis works on the data available at the moment of issuance. For some campaign types — long sales cycles, multi-touch conversion paths, seasonal verticals — 30 days of lookback may not capture the underlying performance pattern. The retest, running 30 days forward, can land in a different regime than the pre-directive baseline. The engine's response to this pattern: progressive recalibration of confidence scores per directive type per vertical as the historical dataset accumulates.

Pattern 3: Threshold Edge Cases

HOLD directives, with the lowest confidence score in the directive set (typically 70 percent), are the most likely to come back incorrect simply because they apply in the middle band of the contribution margin range — the zone where small shifts in the underlying cost stack can move the campaign out of HOLD territory in either direction. The engine accepts this. A HOLD directive's role is not to be highly predictive — it is to flag the campaigns where the engine does not have enough confidence to recommend a directional move.

The aggregate directive accuracy figure feeds back into the engine's confidence calibration. As the historical dataset grows, the relationship between confidence score and measured outcome accuracy gets tighter. A 90 percent confidence SCALE directive in the engine's framework should produce the predicted result more frequently than a 70 percent confidence HOLD directive — and the retest mechanism is what makes that calibration empirical rather than theoretical.

What This Directive Accuracy Methodology Says About the Category

The 30-day retest is, as far as documented practice goes, the only automated self-validation methodology in production in the marketing analytics category. HubSpot's attribution reports do not retest themselves. Salesforce Marketing Cloud's campaign influence reporting does not retest itself. The major attribution platforms — Rockerbox, Northbeam, Triple Whale, ProfitMetrics — surface attribution models but do not measure whether their attributed values predicted closed revenue 30 days later. We compared the major attribution platforms in detail in our analysis of Triple Whale vs Rockerbox vs Allocera and the architectural gap with Salesforce specifically in our Allocera vs Salesforce Marketing Cloud comparison.

This is not because the other platforms could not build self-validation. It is because the category has not historically required it. Marketing analytics has been graded on dashboard sophistication, integration breadth, and attribution model flexibility — not on whether the platform's recommendations produced the outcomes they predicted when acted on. That standard was sufficient for a market in which marketing was treated as an expense category with directional ROI. It is not sufficient for a market in which CFOs are asking marketing teams to defend capital allocation decisions against the same scrutiny applied to any other operational deployment.

What the Number Will Do Over Time

The published 80 percent figure reflects the engine's current state. It is not a target. It is not a marketing claim. It is the result of the methodology applied to the directives the engine has actually issued and scored.

As Allocera's CDAI engine continues to issue directives across additional client engagements, the dataset grows and the directive accuracy figure updates in real time. Several things follow from that:

The figure will move. If the next 100 directives produce a different accuracy rate, the published figure will move to reflect that. There is no methodology adjustment that would suppress an unfavorable result.
Per-directive-type accuracy will become reportable. At the current sample size, the engine reports a single aggregate accuracy. As the per-type sample sizes grow, the engine will report accuracy by directive type — SCALE accuracy versus HOLD accuracy versus CUT accuracy — and per-vertical breakdowns will become statistically meaningful.
Confidence calibration will tighten. The relationship between the engine's pre-directive confidence score and the post-directive outcome will be measurable empirically. A 90 percent confidence directive that comes back correct 90 percent of the time across a large sample is the calibration the methodology is converging on.
The methodology itself is the moat. Any analytics platform can publish a number. Few publish a number that is measured against subsequent outcomes by a methodology the platform itself does not control. The 30-day retest gate is automated, deterministic, and immutable. That posture is the differentiator.

The Question This Methodology Forces

For any marketing analytics platform currently in use, the diagnostic question raised by the 30-day retest methodology is straightforward:

For the recommendations or attributed values your platform produced 30 days ago, what percentage turned out to be correct against the actual outcomes that followed — measured against pre-recommendation state, scored automatically, reported with the same audit trail you would expect from a financial system?

If the platform cannot answer that question — and almost none of them can, because the architecture to answer it was never built — then the platform's outputs are in the gut-feel layer. They are produced. They are not measured. The next decision made on those outputs will be made on the same unvalidated basis as the last one. We covered the underlying calculation methodology that produces the retestable outputs in our guide to calculating marketing contribution margin and the foundational dashboard-versus-reality framing in our True CAC analysis.

Allocera's CDAI engine reconciles all seven cost layers nightly, issues directives on every campaign with confidence scoring and reason codes, and validates itself through the automated 30-day retest documented above. The methodology is operational. The 80 percent measured accuracy is current as of May 2026. The dataset will grow. The number will move with it. That is the difference between an analytics claim and a measured analytics result.

See a Directive Sheet With Accuracy Lineage

A 30-day distortion audit reconciles your campaign data across all seven cost layers and delivers a directive for every active campaign within seven days — each one with a confidence score, reason codes, and the retest methodology that will score it 30 days later. $2,500. If we don't surface margin distortion you weren't tracking, you don't pay.

Request a Distortion Audit