Enter your test results to calculate statistical significance, p-value, confidence level, and uplift percentage. Includes a sample size calculator to plan your next test. Uses a two-proportion Z-test for accurate results.
Last updated: March 2026 · Reading time: 9 min
This calculator uses a two-proportion Z-test to determine whether the difference in conversion rates between your control (A) and variant (B) is statistically significant. It calculates a pooled proportion from both groups, computes the standard error, derives a Z-score, and converts that to a two-tailed p-value. If the p-value falls below 0.05, the result is statistically significant at the 95% confidence level.
Statistical significance means the observed difference between two variations is unlikely to be caused by random chance alone. At the standard 95% confidence level, a p-value below 0.05 indicates significance.
The sample size calculator uses the same Z-test framework in reverse. Given your baseline conversion rate, minimum detectable effect, confidence level, and statistical power, it calculates how many visitors you need per variation before starting the test. This prevents the common mistake of calling a test too early with insufficient data.
The Z-score formula for comparing two conversion rates:
Where p̂pool is the pooled conversion rate across both groups, SE is the standard error, and Φ is the standard normal cumulative distribution function. The two-tailed test checks for differences in either direction.
For a two-proportion Z-test with equal group sizes:
Where p1 is the baseline rate, p2 is the expected variant rate, p̄ is their average, Zα/2 corresponds to the confidence level, and Zβ corresponds to the statistical power. This is the standard formula used by Evan Miller’s calculator and Optimizely’s sample size tool.
| Metric | What It Tells You | What to Look For |
|---|---|---|
| P-value | Probability that the observed difference happened by chance | Below 0.05 for 95% confidence. Below 0.01 for 99% confidence. |
| Confidence Level | How sure you can be that the result isn’t random | 95%+ is the industry standard. Some teams use 90% for faster decisions. |
| Uplift % | Relative improvement of Variant B over Control A | Consider business impact. A 2% uplift on $1M revenue = $20K. A 2% uplift on $10K = $200. |
| Sample Size | Visitors needed per variation to detect a given effect | The smaller the expected uplift, the more traffic you need. A 5% MDE needs 4x more traffic than a 10% MDE. |
Statistical significance is necessary but not sufficient. A test can reach significance at p=0.04 and still mislead you. Here are the conditions that make a test result trustworthy.
First, you need an adequate sample size. Calculate it before the test starts, not after. CXL’s research shows that the most common A/B testing mistake is stopping tests early when results “look significant” (CXL, 2025). A p-value below 0.05 with only 200 visitors per variation is unreliable because the result can flip with the next 200 visitors.
Second, run the test for at least one full business cycle. For most websites, that’s 7 days to account for weekday vs. weekend behavior differences. Ecommerce sites with purchase cycles should run 2-4 weeks. Stopping a test on a Tuesday because it reached significance over the weekend is a recipe for false positives.
Third, consider practical significance alongside statistical significance. VWO’s research notes that a statistically significant 0.3% uplift in conversion rate rarely justifies the development resources needed to implement the change (VWO, 2025). Set a minimum meaningful uplift before starting: typically 5-10% relative lift for most businesses.
“I’ve seen teams celebrate a ‘winning’ test that had p=0.04 but only 800 visitors per arm. That’s not a win, that’s a coin flip dressed up in math. Calculate your sample size before you start, commit to running the full duration, and resist the temptation to peek early. The math only works if you follow the protocol.”
Hardik Shah, Founder of ScaleGrowth.Digital
Calculate return on investment for any marketing channel or campaign.
Complete CRO guide covering research, testing frameworks, and quick wins.
The standard threshold is p < 0.05, which corresponds to 95% confidence. This means there's less than a 5% probability that the observed difference happened by chance. Some teams use p < 0.10 (90% confidence) for faster decision-making, while high-stakes tests may require p < 0.01 (99% confidence).
It depends on your baseline conversion rate and the minimum effect you want to detect. For a 3% baseline conversion rate and a 10% relative MDE at 95% confidence with 80% power, you need roughly 30,000 visitors per variation. Higher baseline rates and larger expected effects require smaller samples. Use the sample size calculator above to get your specific number.
Minimum detectable effect is the smallest relative change in conversion rate your test is designed to detect. An MDE of 10% on a 5% baseline means you’re testing whether the variant can move the rate from 5.0% to at least 5.5%. Smaller MDEs require larger sample sizes. A practical MDE for most businesses is 5-15%.
Looking at results early (called “peeking”) inflates your false positive rate. If you check a test 10 times during its run, your actual false positive rate could be 30% instead of the intended 5%. Either commit to running the test to full sample size without checking, or use a sequential testing method that adjusts for multiple looks. Most standard A/B testing tools don’t account for peeking.
This calculator uses a two-tailed test, which is the standard for A/B testing. A two-tailed test checks whether the variant is different from the control in either direction (better or worse). A one-tailed test only checks one direction and requires half the sample size, but it can’t detect if your variant is actually performing worse. Use two-tailed unless you have a strong statistical reason not to.
Our CRO practice handles test design, implementation, statistical analysis, and rollout. We find the tests that move revenue, not just conversion rate.