Most lifecycle teams should have a documented “Bible” of what works, but instead, they have a graveyard of old A/B tests. Everyone talks about experimentation, but few know how to do it well.

So I decided to consolidate my 15+ years of learning about experimentation, testing, and measurement into this post (I even included a copy of the actual testing framework I use so you can steal it for yourself).

Choosing the right testing framework

How should lifecycle teams think about experimentation and testing today?

Lifecycle teams should think about testing as a core part of their operations, not just a side project. How you execute that though, really depends on your data quality, scale, experience with testing, and analytics support, but the principles are the same:

  • Prioritize with intent: Use a framework like ICE (impact, confidence, ease) to decide what's worth testing. Don't just test to test... test to shift a key business metric.

  • Start simple, scale deliberately: Subject lines and creative testing are great entry points because they are low risk and quick to learn from. Once you have established a rhythm for testing, look to expand into higher-leverage opportunities like offer testing, segmentation, and channel orchestration.

  • Make testing part of your campaign planning routine. Every send, journey, or message should have a hypothesis behind it.

  • Be disciplined with measurement: Ensure all of your tests have large enough sample sizes, avoid overlapping audiences, which will muddy results, and make sure to set clear success metrics before the experiment begins.

  • Connect your tests to business goals: Testing isn't just about open and click rates. It should tie to lifts in revenue, retention, LTV, or increases in margin.

The best programs don't just optimize subject lines and creatives; they test entire experiences (how channels, timing, sequencing, etc., work together to maximize customer value).

How do you choose between A/B testing and multivariate testing?

There are a few factors that help determine which testing approach would be the best.

Start with A/B tests if: you have limited traffic (otherwise, you could end up waiting a long time for results), you're testing one clear hypothesis (for example, does free shipping increase conversion by more than 10%?), or you want the cleanest, most interpretable result.

You can use multivariate when: you have enough traffic to support multiple variations and reach statsig in a reasonable time, you want to understand how different variables interact with each other (for example, does headline A perform differently with image X or Y), or you're optimizing a high-volume funnel (checkout flow, homepage, and onboarding series).

At the end of the day, scale/volume is important to understand feasibility, but your hypothesis is going to determine the necessity of an A/B vs. Multivariate test.

If you’ve only got 1,000 people hitting the cart per day, you can’t run a four-way multivariate test because it’ll take forever to get significance. But if you’ve got 10,000 people a day, then you can start splitting more granularly and get answers in two or three weeks. Scale drives the choice. With low volume, I’d rather just run a simple A/B test and get a clean answer fast. Otherwise, you’ll be waiting months.

And then sometimes it is just faster and more actionable to string together a few focused A/B tests than it is to run one big multivariate test, even if you have the volume to do it.

Running tests in the real world

How do you know when a test is statistically significant?

When something becomes statistically significant, it means you're confident that the result you see isn't just due to random chance, and there are a few factors that determine this:

  1. Sample size: You need enough users in each group to be confident in the result.

  2. Effect size: The bigger the lift you're trying to detect, the faster you will get significant results. The smaller the lift, the more data you will need.

  3. Variance: Noisy data will make things more difficult. For example, if you have large spikes in traffic from day to day, it will require larger sample sizes.

  4. Confidence level: Most teams leverage a 95% confidence level, which means there is only a 5% chance the result is random.

It also comes down to power analysis. If you’re looking for a big change, you don’t need as much sample size. But if you’re looking for a tiny lift, then you need a lot more people and a lot more time. That’s why some tests are done in a week and others drag on for months. Ultimately, it all depends on the lift you’re trying to measure.

Once you hit Statsig, how long do results hold?

The winner should become your new control, but keep in mind that nothing is permanent. The environment is always changing (seasonality, competitors, and even your own product mix). You can compound gains over time, but you also have to keep validating, because what worked last year might not hold this year.

For example, short-term levers like offers and subject lines may wear off quicker and should be re-tested every quarter or so. Longer-term levers like journey rebuilds, rebrands, etc., can persist for years.

The reality is that some tests are just better suited for the short term than others.

How do you confirm short-term wins don’t backfire long-term?

Another solid lifecycle framework for tracking the long-term value of tests is cohort analysis. If you segment customers acquired or converted during the test and follow their journey over time, it can reveal whether the short-term gain is compounding or eroding.

Take note of tactical vs. structural improvements. Structural improvements like adding personalization logic often tend to hold value, while tactical improvements like aggressive discounting usually decay in time. I look beyond the upfront conversion and track revenue per user six months out. Did the test actually generate more long-term value, or did we just attract a bunch of one-and-dones who churned right after? Without that, you can’t really trust the result.

How do you make sure you’re testing the right thing?

You always need a structured hypothesis. Otherwise, you’re just throwing stuff at the wall. If your test wins, can you explain why it won? Was it urgency? Was it simplicity? Testing isn’t just about finding a winner; it’s about learning something you can actually apply again.

Focus on learning and not just winning. Even if a test loses, it can teach you something actionable. For example, your 50% off discount didn't improve profit, but you did learn that adding a discount can improve conversion rate. Now, you can optimize the right offer to get the win in both margin and conversion.

Lastly, ask yourself, "What will this unlock?" A good test isn't about short-term lift. It should help you make better decisions for future campaigns and strategies.

What metrics do you pay attention to when analyzing tests?

The metric should always match the question your test is asking. If you’re testing subject lines, opens, or click-to-open rate, are best because you’re isolating the impact on attention. But for creative, offers, or segmentation tests, clicks and conversions are better indicators. When you’re testing orchestration or channel strategy, you should be looking at downstream metrics like revenue per user, retention, or even lifetime value. It’s also important to layer in secondary signals like unsubscribes, spam complaints, or engagement decay so you know your “win” isn’t doing hidden damage.

The mistake many teams make is defaulting to opens for everything, but the truth is the best metric is the one that aligns most directly with the business outcome you’re trying to influence.

How do you avoid bias when running experiments?

Bias can creep in at every stage of an experiment, so it's important to be intentional. Start with randomization. If you're only testing on your most active users or you don't split groups properly, then your results won't generalize. Also, make sure your data infrastructure is ready before launch. I've witnessed tests fail simply because the tracking wasn't in place. Finally, watch for leakage between the test and control. If users can cross between both groups, then the results are instantly compromised. The safest way to avoid bias is to think about the test up front. Who is in the test? How are they assigned to their group? Can we reliably measure the impact?

Big levers vs. optimization

What are the biggest levers you’ve tested in lifecycle?

Offers are almost always the biggest lever. At Imperfect Foods, we tested countless variations of percentage off versus dollars off, free shipping versus no offer. And then we started layering in caps, like ‘20% off, max $30.’ That one in particular taught us a lot. It still drove urgency, but we weren’t giving away margin endlessly.

Those kinds of experiments moved the needle way more than a subject line or button color ever could. And once we started proving the gains, they compounded. The other big lever was segmentation, especially testing lapsed versus active customers. That’s how we figured out if our winback strategy was actually working or if we were just giving discounts to people who would have come back anyway. Together, offers and segmentation consistently delivered the biggest step-changes in performance.

How do you approach offer testing without eroding margin?

The biggest mistake teams make is running discount tests with no guardrails. You can give away way too much without realizing it. We had to be disciplined about capping. For example: ‘50% off, max savings $20.’ Customers mostly just saw the ‘50% off,’ and it converted, but the cap kept our profit and loss safe, which meant we weren’t sacrificing all of our margin just to get the sale.

But there’s a flip side. If you bury the cap in the fine print, people get upset. We saw call center complaints. Some customers even canceled because they felt tricked. So we learned to be upfront. You can still protect your economics, but you have to communicate clearly. Otherwise, you win the test and lose the customer.

Is there still value in small optimizations like buttons or copy tweaks?

There is value in small optimizations, but you have to keep it in perspective. Subject lines, button colors, and little copy tweaks are all optimizations. They add up, but they’re not going to save your business. The real step-change gains usually come from bigger levers like offers, segmentation, and end-to-end experience design. Once those are in place, optimizations are absolutely worth doing. But they shouldn’t distract you from the stuff that really drives revenue.

The culture of experimentation

How do you make testing part of the culture rather than one-off experiments?

Experimentation has to be the default. You don’t ask, ‘Should we test this?’, the assumption must be that everything is a test unless there’s a good reason not to. That’s how you normalize it.

And you have to celebrate the learnings, not just the wins. If a test loses, that’s fine, as long as you can explain why. Maybe that subject line didn’t work because the value prop wasn’t clear. Or maybe that winback offer failed because it wasn’t strong enough to re-engage churned users. The important part is: Did you learn something you can apply next time?

How do you get buy-in from leadership for a testing program?

You have to frame testing as risk reduction, not as slowing things down. Executives don’t want to wait weeks for results, but they do want to avoid wasted spend or rolling out a bad experience to the entire customer base.

That’s where testing shines. At Imperfect, we proved a point with a winback test that drove incremental revenue, not just orders pulled forward, but net-new dollars. Once leadership saw the financial impact, the conversation changed. The C-suite doesn’t care about the mechanics of the testing framework itself; they care about the numbers moving in the right direction. If you can show the incremental dollars, the buy-in follows.

How do you balance short-term wins with long-term learning?

If you only optimize for short-term revenue, you’ll never build a real testing culture. Those results can be misleading. That’s why I care about lifetime value, not just conversion rate. You might find that a 20% off discount boosts signups, but those customers churn after one order. Long-term, you’re worse off. So you need both views: short-term lift and long-term retention. Otherwise, you’re building a culture around sugar highs instead of sustainable growth.

What role do analytics teams play in lifecycle testing?

Analytics support is critical. A marketer can run a subject line test, sure. But once you get into measuring LTV, you need analysts who can model impact over months or even years. At Imperfect, we leaned heavily on analytics to validate whether a winback test actually drove incremental revenue, or if we were just pulling forward an order that would have happened anyway. Without that kind of support, you’re just guessing. And guessing doesn’t scale.

The future of testing & experimentation

How do you see experimentation evolving?

Experimentation is not just subject lines or offers. The real next level is orchestration and decisioning. Do you send an email first or SMS first? What content and creative should you include? Should you include an offer or no offer? What time should you send the message? These things are really hard to coordinate, and most teams aren’t there yet, but that’s where the next big wins are, and new technologies like AI Decisioning are already making this easier.

What role will AI play in that shift?

Even if AI solves the scale and decisioning problem to help you run experiments and uncover new insights, you can’t outsource the creative side of the lifecycle. If you don’t have a clear hypothesis, then no amount of insights the AI generates for you is going to help you improve your program. The biggest unlock is when you stop thinking about conversion rate today and start thinking about who’s still a customer six months from now. That’s where testing gets really powerful.

Reply

Avatar

or to participate

Keep Reading