Table of contents

The Reputation Split Test: How to Prove If Deliverability Is the Bottleneck

Hugo Pochet

Co-Founder @Mailpool and Cold Email Expert

If your cold email program is underperforming, it’s tempting to blame “deliverability” and start swapping tools, domains, or templates.
But low replies can come from four very different bottlenecks:

Deliverability (you’re not landing in the inbox)
Copy (you’re landing, but nobody cares)
Offer (your pitch isn’t compelling enough to act)
List quality (you’re emailing the wrong people)

A reputation split test is the fastest way to isolate the variable that’s actually holding you back using a controlled experiment that compares performance across two sending reputations while keeping everything else the same.
This article walks you through exactly how to run it, what to measure, and how to interpret the results.

What a reputation split test is (and why it works)

A reputation split test is a controlled A/B test where you send the same campaign to two comparable audience segments, but from two different sending “reputations.”
In practice, that usually means:

Sender A: a “clean” reputation (freshly warmed inboxes/domains, conservative volume)
Sender B: your current production reputation (the one you suspect is struggling)

Because the copy, offer, and list criteria stay constant, the only meaningful difference should be the sending reputation. That makes the outcome diagnostic.
If Sender A dramatically outperforms Sender B, deliverability is likely the bottleneck.
If they perform similarly (both good or both bad), deliverability probably isn’t the main issue and you should look at copy/offer/list.

When you should run this test

Run a reputation split test when you see any of these symptoms:

Reply rates dropped suddenly after scaling volume
Open rates are inconsistent or suspiciously low
You’re seeing more bounces or provider blocks
You’ve changed domains/inboxes/providers, and the results are still unclear
You’re debating whether to “start over” with new infrastructure

It’s also useful before you invest in:

New domains n- Dedicated IPs
A new sending platform
A major copy rewrite

Because it tells you what to fix first.

What you’re trying to prove

Your hypothesis should be explicit:

H1 (Deliverability bottleneck): A higher-trust sender reputation produces materially better outcomes with the same list + copy.
H0 (Not deliverability): Outcomes are similar across reputations; the bottleneck is likely copy, offer, or list quality.

“Materially better” depends on your baseline, but as a rule of thumb, you’re looking for a clear, repeatable gap, not a 0.2% difference.

The setup: what you need before you start

To run a clean test, you need:

Two sender groups (A and B)
A single campaign (same copy, same offer, same sequence)
A single list definition (same ICP and filters)
A way to split the list randomly and evenly
Tracking for bounces, opens (optional), replies, and positive replies

Choose your two reputations

Sender A (control / “clean”):

New or recently rotated domain(s)
Proper DNS (SPF, DKIM, DMARC) configured
Warmed for at least 2–4 weeks
Conservative sending volume (e.g., 20–40 emails/inbox/day)

Sender B (variant / “current”):

Your existing production domain(s)/inboxes
The same sending tool and sequence settings
Your current volume

Important: Don’t change multiple variables at once. If you move Sender A to a different sending platform, you’re testing platform + reputation, not just reputation.

Keep the campaign identical

To isolate deliverability, both groups must use:

Same subject line(s)
Same email body
Same personalization logic
Same sequence timing
Same CTA

If you want to test a copy later, do it after you’ve diagnosed deliverability.

Build a list that can be split fairly

Your list should be large enough to reduce noise.
Practical guidance:

Aim for at least 500–1,000 prospects per group if possible
Use the same ICP filters (industry, job title, company size, geography)
Remove duplicates and obvious low-quality records
Avoid mixing drastically different segments in one test

Then split it randomly into two equal groups.

What to measure (and what to ignore)

Deliverability is tricky because “opens” are increasingly unreliable (Apple Mail Privacy Protection, image blocking, etc.).
So focus on metrics that are harder to fake.

Primary metrics

Bounce rate (hard bounces especially)
Reply rate (total replies / delivered)
Positive reply rate (qualified interest / delivered)
Spam complaints (if your tooling exposes this)

Secondary signals

Provider-level blocks (Gmail/Outlook throttling)
Sudden delivery slowdowns
Inbox placement tests (seed tests) if you have them

Metrics to treat carefully

Open rate: Use only as a directional signal, not the final verdict.

Step-by-step: how to run the test

Step 1: Standardize sending settings

Make sure both sender groups match on:

Daily volume per inbox
Ramp schedule (if any)
Sending windows
Follow-up delays
Reply handling (so replies don’t get lost)

If Sender B is already sending at high volume, consider temporarily matching Sender A’s volume for the test window. Otherwise, volume becomes a confounder.

Step 2: Launch the same sequence to both groups

Send to Group A and Group B at the same time (or as close as possible).

Avoid:

Running Group A this week and Group B next week (seasonality and list drift)
Changing copy mid-test

Step 3: Run long enough to capture follow-ups

Most reply volume comes after follow-ups.
A good minimum test window is:

7–14 days, depending on your sequence length

Step 4: Export results and compare on delivered emails

Always normalize by delivered volume:
$$ Reply\ Rate = \frac{Replies}{Delivered} $$
$$ Positive\ Reply\ Rate = \frac{Positive\ Replies}{Delivered} $$
If Group B has higher bounces, comparing replies per sent will unfairly penalize it.

How to interpret results (the decision tree)

Here’s a practical way to read the outcome.

Scenario 1: Sender A wins by a lot

If Sender A has:

Lower bounces
Higher replies
Higher positive replies

…then deliverability (reputation) is very likely the bottleneck.
What to do next:

Reduce volume per inbox and per domain
Rotate domains/inboxes
Audit DNS and alignment (SPF/DKIM/DMARC)
Improve list hygiene (bad lists damage reputation fast)
Extend warm-up and avoid sudden spikes

Scenario 2: Both perform poorly

If both Sender A and Sender B have low replies and similar bounce rates, deliverability probably isn’t the primary constraint.

What to check next:

Offer clarity: Is the value proposition specific and credible?
CTA friction: Are you asking for too much?
ICP fit: Are you targeting buyers with an urgent problem?
Personalization: Are you relevant, or just inserting tokens?

Scenario 3: Both perform well

If both groups perform well, then deliverability isn’t your bottleneck right now.

Your next lever is usually:

Scaling volume carefully without damaging reputation
Expanding to adjacent segments
Testing offers and angles to increase positive replies

Scenario 4: Sender B wins (rare, but possible)

If your “current” reputation outperforms the “clean” sender, it usually means:

Sender A wasn’t actually warmed enough
The new domain looks suspicious (too new, too little history)
Your sending patterns differ more than you think

Double-check warm-up time, DNS, and sending behavior.

Common mistakes that ruin the test

Changing multiple variables at once (new tool + new domain + new copy)
Uneven list split (Group A gets better leads)
Too small a sample size (noise looks like signal)
Comparing on “sent” instead of “delivered”
Using opens as the deciding metric
Running the test across different weeks

A simple baseline to keep you safe while testing

If you’re scaling cold email, conservative guardrails protect reputation:

Keep to ~20 emails/inbox/day while diagnosing
Avoid more than 3 inboxes per domain (5 max)
Warm for 3–4 weeks before pushing volume
Prioritize list quality—bad data burns reputation quickly

These aren’t universal laws, but they’re solid defaults for most startups and sales teams.

Prove the bottleneck before you “fix” it

Most cold email teams don’t have a deliverability problem; they have a diagnosis problem.

A reputation split test gives you a clear answer:

If a clean reputation wins, fix deliverability and infrastructure first.
If results are similar, focus on copy, offer, and list quality.

Either way, you stop guessing and start improving the right thing.

Want help running a clean reputation split test?

If you want to run this test without burning domains or wasting weeks, book a demo and we’ll show you how to set up controlled sender groups, warm safely, and scale outreach while protecting deliverability.

Blog

Everything about cold email, outreach & deliverability