A/B Test Significance Calculator — Free Tool

Test Your A/B Results

Control (A)

Visitors

Conversions

Variant (B)

Visitors

Conversions

How does this A/B test significance calculator work?

This calculator uses a two-proportion Z-test to determine whether the difference in conversion rates between your control (A) and variant (B) is statistically significant. It calculates a pooled proportion from both groups, computes the standard error, derives a Z-score, and converts that to a two-tailed p-value. If the p-value falls below 0.05, the result is statistically significant at the 95% confidence level.

Statistical significance means the observed difference between two variations is unlikely to be caused by random chance alone. At the standard 95% confidence level, a p-value below 0.05 indicates significance.

The sample size calculator uses the same Z-test framework in reverse. Given your baseline conversion rate, minimum detectable effect, confidence level, and statistical power, it calculates how many visitors you need per variation before starting the test. This prevents the common mistake of calling a test too early with insufficient data.

How should you use your test results?

Enter your control data. Input the total visitors and conversions for your original version (A). Conversions can be purchases, signups, clicks, or any action you’re testing.
Enter your variant data. Input the same metrics for your test version (B). Both variations should have run simultaneously with randomized traffic split.
Read the verdict. If p-value is below 0.05 and the uplift is positive, implement the variant. If not significant, you need more traffic or the difference is too small to matter.
Check the uplift percentage. Even a significant result with 0.5% uplift may not be worth the implementation cost. Consider the business impact, not just statistical significance.
Use the sample size calculator first. Before running your next test, calculate the sample size you need. Running a test without enough traffic wastes time and produces inconclusive results.

What formulas does this calculator use?

Significance test (two-proportion Z-test)

The Z-score formula for comparing two conversion rates:

      p̂pool = (conversions_A + conversions_B) / (visitors_A + visitors_B)
      SE = √( p̂pool × (1 – p̂pool) × (1/nA + 1/nB) )
      Z = (pB – pA) / SE
      p-value = 2 × (1 – Φ(|Z|))
    

Where p̂_pool is the pooled conversion rate across both groups, SE is the standard error, and Φ is the standard normal cumulative distribution function. The two-tailed test checks for differences in either direction.

Sample size formula

For a two-proportion Z-test with equal group sizes:

      n = ( Zα/2 × √(2p̄(1-p̄)) + Zβ × √(p1(1-p1) + p2(1-p2)) )2 / (p2 – p1)2
    

Where p₁ is the baseline rate, p₂ is the expected variant rate, p̄ is their average, Z_α/2 corresponds to the confidence level, and Z_β corresponds to the statistical power. This is the standard formula used by Evan Miller’s calculator and Optimizely’s sample size tool.

What do the results mean?

Metric	What It Tells You	What to Look For
P-value	Probability that the observed difference happened by chance	Below 0.05 for 95% confidence. Below 0.01 for 99% confidence.
Confidence Level	How sure you can be that the result isn’t random	95%+ is the industry standard. Some teams use 90% for faster decisions.
Uplift %	Relative improvement of Variant B over Control A	Consider business impact. A 2% uplift on $1M revenue = $20K. A 2% uplift on $10K = $200.
Sample Size	Visitors needed per variation to detect a given effect	The smaller the expected uplift, the more traffic you need. A 5% MDE needs 4x more traffic than a 10% MDE.

When should you trust your A/B test results?

Statistical significance is necessary but not sufficient. A test can reach significance at p=0.04 and still mislead you. Here are the conditions that make a test result trustworthy. First, you need an adequate sample size. Calculate it before the test starts, not after. CXL’s research shows that the most common A/B testing mistake is stopping tests early when results “look significant” (CXL, 2025). A p-value below 0.05 with only 200 visitors per variation is unreliable because the result can flip with the next 200 visitors. Second, run the test for at least one full business cycle. For most websites, that’s 7 days to account for weekday vs. weekend behavior differences. Ecommerce sites with purchase cycles should run 2-4 weeks. Stopping a test on a Tuesday because it reached significance over the weekend is a recipe for false positives. Third, consider practical significance alongside statistical significance. VWO’s research notes that a statistically significant 0.3% uplift in conversion rate rarely justifies the development resources needed to implement the change (VWO, 2025). Set a minimum meaningful uplift before starting: typically 5-10% relative lift for most businesses.

“I’ve seen teams celebrate a ‘winning’ test that had p=0.04 but only 800 visitors per arm. That’s not a win, that’s a coin flip dressed up in math. Calculate your sample size before you start, commit to running the full duration, and resist the temptation to peek early. The math only works if you follow the protocol.” Hardik Shah, Founder of ScaleGrowth.Digital

Related Resources

Related tools and resources

Marketing ROI Calculator

Calculate return on investment for any marketing channel or campaign. Use Calculator →

Conversion Rate Optimization Guide

Complete CRO guide covering research, testing frameworks, and quick wins. Read Guide →

Landing Page Checklist

30-point checklist for landing pages that convert. Get Checklist →

FAQ

Frequently Asked Questions

What p-value indicates statistical significance in A/B testing?

The standard threshold is p < 0.05, which corresponds to 95% confidence. This means there's less than a 5% probability that the observed difference happened by chance. Some teams use p < 0.10 (90% confidence) for faster decision-making, while high-stakes tests may require p < 0.01 (99% confidence).

How many visitors do I need for an A/B test?

It depends on your baseline conversion rate and the minimum effect you want to detect. For a 3% baseline conversion rate and a 10% relative MDE at 95% confidence with 80% power, you need roughly 30,000 visitors per variation. Higher baseline rates and larger expected effects require smaller samples. Use the sample size calculator above to get your specific number.

What is minimum detectable effect (MDE)?

Minimum detectable effect is the smallest relative change in conversion rate your test is designed to detect. An MDE of 10% on a 5% baseline means you’re testing whether the variant can move the rate from 5.0% to at least 5.5%. Smaller MDEs require larger sample sizes. A practical MDE for most businesses is 5-15%.

Can I check my A/B test results before the test is complete?

Looking at results early (called “peeking”) inflates your false positive rate. If you check a test 10 times during its run, your actual false positive rate could be 30% instead of the intended 5%. Either commit to running the test to full sample size without checking, or use a sequential testing method that adjusts for multiple looks. Most standard A/B testing tools don’t account for peeking.

Should I use a one-tailed or two-tailed test?

This calculator uses a two-tailed test, which is the standard for A/B testing. A two-tailed test checks whether the variant is different from the control in either direction (better or worse). A one-tailed test only checks one direction and requires half the sample size, but it can’t detect if your variant is actually performing worse. Use two-tailed unless you have a strong statistical reason not to.

Free A/B Test Significance Calculator