The 30-Day Retest: How an Accuracy-Scored Engine Beats Gut-Feel | Allocera Intelligence

The 30-Day Retest: How an Accuracy-Scored Engine Beats Gut-Feel

Every marketing analytics tool tells you what to do. None of them — not HubSpot, not Salesforce, not Rockerbox, not Triple Whale — measure whether they were right. Allocera's CDAI engine retests every directive 30 days after it issues, scoring itself CORRECT or INCORRECT against actual contribution margin outcomes. Here is the methodology, the math, and the 80 percent directive accuracy result.

The standard marketing analytics deal works like this. The tool surfaces data. The marketing team interprets the data. Decisions get made — scale this campaign, cut that one, hold the other. Thirty days later, the team looks at new data and makes new decisions. The cycle repeats.

What is structurally missing from that cycle is the loop back. Was the decision made 30 days ago correct? Did the SCALE actually produce higher margin? Did the CUT actually preserve capital? The platform that surfaced the data does not check. The team that made the call does not formally retest. The decision becomes ambient knowledge — somebody remembers it worked, somebody else thinks it didn't, and the next decision gets made on the same incomplete feedback loop.

This is the gut-feel layer. Even sophisticated marketing operations run on it. The reason is structural: the tools that issue recommendations were never built to score themselves. The recommendation is the output. What happens after the recommendation is somebody else's problem.

Every marketing tool tells you what to do. None of them go back 30 days later and check whether they were right.

Allocera was built around the opposite premise. Every directive the CDAI engine issues — SCALE, HOLD, CUT, PAUSE, FLAG — is recorded with its pre-directive state. Thirty days later, the engine automatically returns to that directive and measures, against actual contribution margin outcomes, whether the call was right. The result is stored. The accuracy score is updated. The engine that gave the advice is the engine that audits the advice, and the audit happens whether anybody wants to see the result or not.

This piece walks through the full methodology — why 30 days, how the retest works mathematically, what the outcome labels mean, why the engine refuses to score on incomplete data, and what the 80 percent measured directive accuracy result means for capital allocation decisions.

Why Directive Accuracy Is Missing From Marketing Analytics

The category never developed it. Multi-touch attribution platforms emerged in the late 2000s and early 2010s, focused on solving the "which touchpoint gets credit" problem. Marketing mix modeling came earlier and focused on top-down channel-level effects. Both categories produced numbers. Neither category, by design, validated its own numbers against subsequent outcomes.

The reasons are partly historical and partly practical. Historically, the attribution wars were about competing models — first-touch versus last-touch versus linear versus algorithmic — and the validation conversation was about choosing the right model rather than measuring whether the model produced correct calls. Practically, the post-recommendation reality lives in different systems than the attribution platforms — finance, payment processors, CRMs, partner ledgers — and joining that data back to the recommendation has historically required custom engineering nobody wanted to build.

The result is the industry's default posture: directional accuracy. Multi-touch attribution and marketing mix models are widely understood to operate at 60 to 70 percent directional accuracy when measured against closed revenue. That number is well-documented across the analytics literature and the practitioner community. What is less documented: most tools claiming any accuracy figure do not actually validate it post-decision. The accuracy is a model prediction, not a measured outcome.

60–70% Industry standard directional accuracy for multi-touch attribution and marketing mix models

Allocera takes a different posture. Accuracy is measured, not claimed. The result is stored in the database with auditable lineage back to the directive that produced it. We covered the underlying engine architecture in our analysis of why dashboard CPL is structurally incomplete and the calculation methodology in our guide to calculating marketing contribution margin. This piece is the deep dive on the retest itself.

Why 30 Days

Thirty days is a deliberate choice, not a default. It is long enough to let the directive's effect play out across most variables that matter and short enough to act on before market conditions drift materially.

Within 30 days of a SCALE or CUT directive, the following have either resolved or have enough data to be measurable:

  • Volume changes from budget adjustments have flushed through the campaign — usually within 5 to 10 days for active campaigns, faster for high-velocity verticals
  • Quality signals from new lead volume — refund rate, chargeback rate, close rate — have produced enough data to compare against the pre-directive baseline
  • Refund cycles have begun to land in the data — most verticals see the first refund wave within 14 to 21 days of the original conversion, well within the 30-day window
  • Partner payouts and broker reconciliations have closed for the period, allowing the seven-cost stack to be reconciled with full data
  • Platform spend has settled to a new equilibrium at the adjusted budget level, removing the noise of the transition period

Thirty days is also short enough that the market conditions that produced the original directive are still recognizable. Macroeconomic shifts, seasonal swings, competitive ad spend changes — these effects compound over 60 to 90 days and start to confound the retest. At 30 days, the comparison is still apples-to-apples in most verticals. At 90 days, it usually isn't.

Some directive types have shorter natural windows. A PAUSE directive issued on a fraud spike resolves within 48 to 72 hours of execution — the fraud either stops or doesn't, and the engine can score the outcome much faster than 30 days. Conversely, some long-cycle verticals — solar, clinical trials, complex B2B — produce delayed revenue signals that argue for longer retest windows. The 30-day cycle is the default. The engine supports configurable retest windows where the vertical requires them.

How the Retest Actually Works

The retest is a five-step automated process that runs nightly against every directive that has reached its 30-day mark. There is no manual review step. There is no human override. The outcome is deterministic based on stored state and current data.

  1. Directive issued. Engine evaluates the campaign across the full seven-cost stack and issues one of the directive types (SCALE, HOLD, CUT, PAUSE, FLAG, plus the additional directive types for specific risk signals). The directive is stored with campaign ID, action, confidence score, reason codes, and the pre-directive contribution margin percentage.
  2. Recorded. The full pre-directive state is locked into the directive_events table. This is the baseline the retest will measure against. It cannot be modified or overwritten.
  3. 30-day wait. The engine does not re-evaluate or modify the directive during the wait period. The directive lives. The campaign runs. Real-world outcomes accumulate.
  4. Retest. Thirty days after issue, the engine queries post-directive contribution margin for the campaign, calculates the delta against the pre-directive baseline, and applies the outcome classification rules for the directive type.
  5. Outcome scored. The result — CORRECT, INCORRECT, or INCONCLUSIVE — is stored in the directive_outcomes table along with the pre-CM%, post-CM%, CM delta, and the accuracy score. The aggregate directive accuracy metric for the engine updates automatically.

The retest formula at its simplest:

Accuracy Calculation
Accuracy % = (Correct Directives ÷ Total Scored Directives) × 100
 
Outcome Classification: based on directive type and CM% delta
 
Inconclusive: excluded from numerator and denominator

What CORRECT, INCORRECT, and INCONCLUSIVE Mean

Each directive type has its own outcome classification logic. The engine does not apply a single rule across all directives, because the directives themselves are doing different things and producing different signals.

DirectiveCORRECT meansINCORRECT means
SCALEContribution margin increased after budget increasedContribution margin decreased after budget increased
CUTMargin improved or campaign losses stopped after cutMargin got worse after cutting
HOLDMargin stayed within ±5% of original (stable as predicted)Margin shifted significantly (more than 15% change)
PAUSEMargin catastrophe stopped or fraud signal clearedContinued or worsened after pause
FLAGHuman review identified and resolved the data issueData issue persisted past 30-day mark

The third classification — INCONCLUSIVE — applies when the engine cannot fairly score the directive. The most common reasons:

  • Campaign paused by client before the 30-day window completed — the engine never got to see the outcome it predicted
  • Insufficient post-directive data to measure outcome reliably — typically when traffic volume dropped below the engine's minimum confidence threshold
  • External factor contaminated the measurement — platform outage, attribution gap, force majeure event that confounds the comparison

Inconclusive outcomes are excluded from both the numerator and denominator of the directive accuracy calculation. This matters: it prevents the engine from being scored wrong for outcomes it could not have predicted, and it prevents the engine from being scored right for outcomes that happened by chance. the directive accuracy metric reflects only the directives where the engine had a fair shot at being measured.

The 80 Percent Measured Directive Accuracy Result

As of May 2026, the engine has scored 55 of 56 issued directives. One directive remains within the 30-day window and is pending retest. Of the 55 scored: 44 CORRECT, 11 INCORRECT. The resulting accuracy rate: 80 percent.

80% Measured directive accuracy · 55 of 56 directives scored · validated by automated 30-day retest

That figure is meaningfully above the 60 to 70 percent directional accuracy typical of multi-touch attribution and marketing mix models. Several things are worth naming about it.

It is measured, not claimed. The retest runs automatically. The outcome labels are deterministic. There is no human review step that could inflate the result. The engine cannot decide retroactively that an INCORRECT directive should have been INCONCLUSIVE.

It is auditable. Every directive in the 56-directive set traces to a specific campaign, a specific pre-directive state, a specific issue date, and a specific outcome record. The lineage is intact in the database. Any technically literate reviewer can replay the math.

It is structurally pessimistic. The engine refuses to issue directives on bad data — the directive_safe gate ensures no directive issues when the underlying data is stale, incomplete, or missing attribution. The directives that do issue are the ones where the engine is most confident. That means the 80 percent figure is the directive accuracy of a curated set of high-confidence calls, not the accuracy the engine would achieve if it issued recommendations on every campaign regardless of data quality.

It will move. As Allocera's CDAI engine continues to issue directives across additional client engagements, the sample size grows and the directive accuracy metric updates in real time. The published number is the current measured result, not a target or a model projection. If the next 100 directives produce a different accuracy rate, the published figure will move to reflect that.

Why Directive Accuracy Matters for Capital Allocation

For any CFO making capital allocation decisions on a marketing platform's output, the validation layer is the difference between trust and skepticism. Three things change when the engine measures itself.

Confidence Has a Number Attached to It

Every Allocera directive carries a confidence score in the 70 to 97 percent range, derived from data volume, quality signals, and the strength of the underlying margin signal. The 80 percent measured directive directive accuracy across the historical retest set tells the CFO what those confidence scores actually translate to in practice. A 90 percent confidence SCALE directive in the engine's framework has produced the predicted margin increase 80 percent of the time across the scored set. That is a number a CFO can underwrite a budget decision against.

Wrong Directives Surface, Don't Hide

Most marketing tools never have to face the incorrect call. The recommendation gets made, the team acts on it, and if it doesn't work the post-mortem blames execution, market conditions, or "we should have known." Allocera's retest forces the wrong calls into the open. The engine that issued the INCORRECT directive is the engine that scores itself wrong on it. That information feeds back into the confidence calibration of subsequent directives.

The Engine Refuses to Fabricate

The validation layer works because the engine has a complementary property: it refuses to issue directives when the data does not support them. If incoming data is stale or campaign attribution is missing, the engine's health monitor flags the data set as not directive-safe and no directives issue. This behavior is independently verified across two real businesses in our published validation case study. A wrong directive is worse than no directive. The retest can score directives that did issue precisely because directives only issue when they have a fair shot at being correct.

Accuracy is measured, not claimed. The engine that gave the advice is the engine that audits the advice, and the audit happens whether anybody wants to see the result or not.

What This Means for the Category

The 30-day retest is, as far as documented practice goes, the only automated self-validation methodology in the marketing analytics category. HubSpot's attribution reports do not retest themselves. Salesforce Marketing Cloud's campaign influence reporting does not retest itself. The major attribution platforms — Rockerbox, Northbeam, Triple Whale, ProfitMetrics — surface attribution models but do not measure whether their attributed values predicted closed revenue 30 days later. We compared the major attribution platforms in our analysis of Triple Whale vs Rockerbox vs Allocera and the architectural gap with Salesforce specifically in our Allocera vs Salesforce Marketing Cloud comparison.

This is not because the other platforms could not build it. It is because the category has not historically required it. Marketing analytics has been graded on dashboard sophistication, integration count, and attribution model flexibility — not on whether the platform's recommendations produced the outcomes they predicted. That standard was sufficient for a market in which marketing was treated as an expense category with directional ROI. It is not sufficient for a market in which CFOs are asking marketing teams to defend capital allocation decisions against the same scrutiny applied to any other operational deployment.

The 30-day retest is what changes that conversation. A CFO does not have to take the engine's word for it. The engine is built to grade itself, the grade is public, and the grade is auditable.

The Question That Reveals Whether Your Stack Measures Itself

For any marketing analytics platform currently in use, one diagnostic question separates the platforms that validate themselves from the platforms that report:

Show me, for the recommendations or attributed values your platform produced 30 days ago, what percentage turned out to be correct against the actual outcomes that followed.

If the platform cannot answer that question — and almost none of them can, because the architecture to answer it was never built — then the platform is in the gut-feel layer. It produces outputs. It does not measure them. The next decision made on those outputs will be made on the same unvalidated basis as the last one. That gap is what an accuracy-scored engine closes. We covered the directive framework that produces the retestable outputs in our Scale, Hold, Cut, Pause breakdown.

See a Directive Sheet With Accuracy Lineage

A 30-day distortion audit reconciles your campaign data across all seven cost layers and delivers a directive for every active campaign within seven days — each one with a confidence score, reason codes, and the retest methodology that will score it 30 days later. $2,500. If we don't surface margin distortion you weren't tracking, you don't pay.

Request a Distortion Audit
Scroll to Top