Statistical evaluation can be a challenging aspect of peer review. While researchers are typically familiar with the methods used in their own field, assessing analyses that rely on different or more advanced approaches may require additional care.
If you’re new to peer review, feeling uncertain about evaluating statistical analyses is understandable. However, you do not need to be a statistician to provide useful feedback on the statistical content of a manuscript. What you need is a framework for identifying the most common problems and an honest approach to the limits of your evaluation. This post helps you with what you should look for when assessing statistical analyses and how you can raise concerns or flag the limits of your expertise, in a way that is useful to the editor.
Is the right test being used?
The most fundamental question about any statistical analysis is whether the approach chosen is appropriate for the data type, research design, and question being asked. Different data types require different types of statistical tests: continuous data, binary outcomes, count data, and time-to-event data each have methods suited to their characteristics.
You do not need to know every statistical test to spot mismatches between data and methods. If you can identify what the data looks like and what question the analysis is supposed to answer, you will often be able to assess whether the approach makes sense at a basic level. If it does not seem right, but you are not confident enough to specify the problem, say so in your report.
Sample size and statistical power
An underpowered study, one with too few participants or observations to reliably detect the effect it is looking for, is a well-recognized issue in research. When a study finds no significant effect, it may reflect a true null result or it may simply lack the statistical power to detect an effect that is genuinely there.
Look for a power calculation or a sample size justification in the methods section. If this is present, check that the assumed effect size is reasonable in light of the existing literature. If this is absent and the sample is small, make a note it. An underpowered study that reports a positive finding may be overestimating the effect size due to sampling variability, a problem that has contributed significantly to the replication crisis across disciplines. The PubMed Central (PMC) overview of statistical power practices provides useful context on why this matters for the published record, especially in fields like psychology.
Reporting standards: Beyond p-values
Probability values (p-values) are widely misunderstood and widely misused. A p-value below 0.05 does not tell you that the effect is large, important, or real; it tells you that the result is unlikely under the null hypothesis at a particular threshold, assuming the analysis was planned in advance.
Check whether effect sizes are reported alongside the p-values. An effect size tells you how large the difference or relationship is, which is often more informative than whether it crosses a significance threshold. Confidence intervals are also important: they show the range within which the true population parameter is likely to fall, giving a much richer picture of the precision of the estimate. If p-values are reported without effect sizes or confidence intervals, note this as a concern.
Multiple testing and data dredging
When researchers conduct multiple statistical tests in a single study, the probability of finding at least one false positive increases with each additional test. If a study reports a large number of comparisons without adjusting for multiple testing, the significant results may be an artifact of the testing process rather than genuine effects.
Related to this is a practice sometimes called data dredging or p-hacking: running analyses until a significant result is found and then reporting only that result. Signs of this include an unusual number of subgroup analyses, outcomes that do not match the study’s stated aims, or a pattern of results that are suspiciously consistent. If you suspect outcome switching, where the results reported do not match the primary outcomes specified in the methods, flag it clearly in your review.
When to flag an issue
If a statistical approach is beyond your expertise, say so in your report. Be specific: For example, “The mixed-effects modeling approach used in section 3 is outside my area of statistical expertise, and I would recommend the editor seek input from a reviewer with quantitative methods experience in this area.” This is more useful than either ignoring the issue or offering an uninformed opinion.
What you can almost always assess, regardless of statistical expertise: whether a power calculation is present, effect sizes are reported, the results section maps onto the methods, or the conclusions are proportionate to what the statistics actually show. Statistical concerns are a legitimate and important part of peer review, and raising them, even when you are not entirely certain, is better than doing nothing about it.
We’d like to hear from you
Do you find peer reviewing statistics particularly challenging? What approaches have you found helpful? Share your experience in the comments. This is a topic where reviewers across disciplines can learn from each other.
For more guidance on evaluating manuscripts, including their statistical content, join the ReviewerOne community.

Leave a Comment