The Reputation Split Test: How to Prove If Deliverability Is the Bottleneck

If your cold email program is underperforming, it’s tempting to blame “deliverability” and start swapping tools, domains, or templates.
But low replies can come from four very different bottlenecks:
- Deliverability (you’re not landing in the inbox)
- Copy (you’re landing, but nobody cares)
- Offer (your pitch isn’t compelling enough to act)
- List quality (you’re emailing the wrong people)
A reputation split test is the fastest way to isolate the variable that’s actually holding you back using a controlled experiment that compares performance across two sending reputations while keeping everything else the same.
This article walks you through exactly how to run it, what to measure, and how to interpret the results.
What a reputation split test is (and why it works)
A reputation split test is a controlled A/B test where you send the same campaign to two comparable audience segments, but from two different sending “reputations.”
In practice, that usually means:
- Sender A: a “clean” reputation (freshly warmed inboxes/domains, conservative volume)
- Sender B: your current production reputation (the one you suspect is struggling)
Because the copy, offer, and list criteria stay constant, the only meaningful difference should be the sending reputation. That makes the outcome diagnostic.
If Sender A dramatically outperforms Sender B, deliverability is likely the bottleneck.
If they perform similarly (both good or both bad), deliverability probably isn’t the main issue and you should look at copy/offer/list.
When you should run this test
Run a reputation split test when you see any of these symptoms:
- Reply rates dropped suddenly after scaling volume
- Open rates are inconsistent or suspiciously low
- You’re seeing more bounces or provider blocks
- You’ve changed domains/inboxes/providers, and the results are still unclear
- You’re debating whether to “start over” with new infrastructure
It’s also useful before you invest in:
- New domains n- Dedicated IPs
- A new sending platform
- A major copy rewrite
Because it tells you what to fix first.
What you’re trying to prove
Your hypothesis should be explicit:
- H1 (Deliverability bottleneck): A higher-trust sender reputation produces materially better outcomes with the same list + copy.
- H0 (Not deliverability): Outcomes are similar across reputations; the bottleneck is likely copy, offer, or list quality.
“Materially better” depends on your baseline, but as a rule of thumb, you’re looking for a clear, repeatable gap, not a 0.2% difference.
The setup: what you need before you start
To run a clean test, you need:
- Two sender groups (A and B)
- A single campaign (same copy, same offer, same sequence)
- A single list definition (same ICP and filters)
- A way to split the list randomly and evenly
- Tracking for bounces, opens (optional), replies, and positive replies
Choose your two reputations
Sender A (control / “clean”):
- New or recently rotated domain(s)
- Proper DNS (SPF, DKIM, DMARC) configured
- Warmed for at least 2–4 weeks
- Conservative sending volume (e.g., 20–40 emails/inbox/day)
Sender B (variant / “current”):
- Your existing production domain(s)/inboxes
- The same sending tool and sequence settings
- Your current volume
Important: Don’t change multiple variables at once. If you move Sender A to a different sending platform, you’re testing platform + reputation, not just reputation.
Keep the campaign identical
To isolate deliverability, both groups must use:
- Same subject line(s)
- Same email body
- Same personalization logic
- Same sequence timing
- Same CTA
If you want to test a copy later, do it after you’ve diagnosed deliverability.
Build a list that can be split fairly
Your list should be large enough to reduce noise.
Practical guidance:
- Aim for at least 500–1,000 prospects per group if possible
- Use the same ICP filters (industry, job title, company size, geography)
- Remove duplicates and obvious low-quality records
- Avoid mixing drastically different segments in one test
Then split it randomly into two equal groups.
What to measure (and what to ignore)
Deliverability is tricky because “opens” are increasingly unreliable (Apple Mail Privacy Protection, image blocking, etc.).
So focus on metrics that are harder to fake.
Primary metrics
- Bounce rate (hard bounces especially)
- Reply rate (total replies / delivered)
- Positive reply rate (qualified interest / delivered)
- Spam complaints (if your tooling exposes this)
Secondary signals
- Provider-level blocks (Gmail/Outlook throttling)
- Sudden delivery slowdowns
- Inbox placement tests (seed tests) if you have them
Metrics to treat carefully
- Open rate: Use only as a directional signal, not the final verdict.
Step-by-step: how to run the test
Step 1: Standardize sending settings
Make sure both sender groups match on:
- Daily volume per inbox
- Ramp schedule (if any)
- Sending windows
- Follow-up delays
- Reply handling (so replies don’t get lost)
If Sender B is already sending at high volume, consider temporarily matching Sender A’s volume for the test window. Otherwise, volume becomes a confounder.
Step 2: Launch the same sequence to both groups
Send to Group A and Group B at the same time (or as close as possible).
Avoid:
- Running Group A this week and Group B next week (seasonality and list drift)
- Changing copy mid-test
Step 3: Run long enough to capture follow-ups
Most reply volume comes after follow-ups.
A good minimum test window is:
- 7–14 days, depending on your sequence length
Step 4: Export results and compare on delivered emails
Always normalize by delivered volume:
$$ Reply\ Rate = \frac{Replies}{Delivered} $$
$$ Positive\ Reply\ Rate = \frac{Positive\ Replies}{Delivered} $$
If Group B has higher bounces, comparing replies per sent will unfairly penalize it.
How to interpret results (the decision tree)
Here’s a practical way to read the outcome.
Scenario 1: Sender A wins by a lot
If Sender A has:
- Lower bounces
- Higher replies
- Higher positive replies
…then deliverability (reputation) is very likely the bottleneck.
What to do next:
- Reduce volume per inbox and per domain
- Rotate domains/inboxes
- Audit DNS and alignment (SPF/DKIM/DMARC)
- Improve list hygiene (bad lists damage reputation fast)
- Extend warm-up and avoid sudden spikes
Scenario 2: Both perform poorly
If both Sender A and Sender B have low replies and similar bounce rates, deliverability probably isn’t the primary constraint.
What to check next:
- Offer clarity: Is the value proposition specific and credible?
- CTA friction: Are you asking for too much?
- ICP fit: Are you targeting buyers with an urgent problem?
- Personalization: Are you relevant, or just inserting tokens?
Scenario 3: Both perform well
If both groups perform well, then deliverability isn’t your bottleneck right now.
Your next lever is usually:
- Scaling volume carefully without damaging reputation
- Expanding to adjacent segments
- Testing offers and angles to increase positive replies
Scenario 4: Sender B wins (rare, but possible)
If your “current” reputation outperforms the “clean” sender, it usually means:
- Sender A wasn’t actually warmed enough
- The new domain looks suspicious (too new, too little history)
- Your sending patterns differ more than you think
Double-check warm-up time, DNS, and sending behavior.
Common mistakes that ruin the test
- Changing multiple variables at once (new tool + new domain + new copy)
- Uneven list split (Group A gets better leads)
- Too small a sample size (noise looks like signal)
- Comparing on “sent” instead of “delivered”
- Using opens as the deciding metric
- Running the test across different weeks
A simple baseline to keep you safe while testing
If you’re scaling cold email, conservative guardrails protect reputation:
- Keep to ~20 emails/inbox/day while diagnosing
- Avoid more than 3 inboxes per domain (5 max)
- Warm for 3–4 weeks before pushing volume
- Prioritize list quality—bad data burns reputation quickly
These aren’t universal laws, but they’re solid defaults for most startups and sales teams.
Prove the bottleneck before you “fix” it
Most cold email teams don’t have a deliverability problem; they have a diagnosis problem.
A reputation split test gives you a clear answer:
- If a clean reputation wins, fix deliverability and infrastructure first.
- If results are similar, focus on copy, offer, and list quality.
Either way, you stop guessing and start improving the right thing.
Want help running a clean reputation split test?
If you want to run this test without burning domains or wasting weeks, book a demo and we’ll show you how to set up controlled sender groups, warm safely, and scale outreach while protecting deliverability.
%201.png)





