Hypothesis Testing: Foundations and Classical Methods

Jose Sanz
Dec 2, 2025
5 min read

In our last post, we explored how to turn broad ideas into well-defined research questions and testable hypotheses. Now, it’s time to move from asking a good question to testing it. Statistical hypothesis testing sits at the heart of scientific inference, giving us a structured way to evaluate claims about populations using sample data.

This post introduces the foundations of hypothesis testing, briefly introduces classical methods like t-tests, chi-square tests, and ANOVA, and explains why choosing the right test depends not just on our question but also on data type and assumptions.

And if you still have questions, do not hesitate to contact us. Outtadesk exists to help you extract more from your data!

Foundations of hypothesis testing

Statistical hypothesis testing is a cornerstone of scientific inference, providing a structured framework for evaluating claims about populations based on sample data.

Figure 1. Example of a sample (a set from a population), and inference (claims about the population based on the sample).

The process of hypothesis testing involves formulating a null hypothesis (typically representing no effect or no difference) and an alternative hypothesis, then using sample data to assess the plausibility of the null hypothesis through test statistics and p-values.

For example, based on Figure 1, we could have the following claim:

Black trees and white trees have different heights.

In this case, our hypothesis to be tested would be:

H0 (null hypothesis): Black trees and white trees have the same height.

Ha or H1 (alternative hypothesis): Black trees and white trees have different heights.

In the next step, we would collect height measurements from a sample of black trees and white trees and organize the data for analysis.

Types of errors in statistical inference

In our example with black and white trees, a Type I error would occur if we conclude that the two types of trees have different heights when, in reality, their true average heights are the same. In other words, we detect a difference that doesn’t actually exist. A Type II error, on the other hand, would happen if we conclude that black and white trees have the same height when they truly differ. Here, we fail to detect a real difference that is present in the population.

Both types of errors are possible whenever we make decisions based on sample data, which is why sample size, test choice, and assumptions play such an important role in hypothesis testing.

Brief historical background

The development of hypothesis testing has been shaped by foundational contributions from Fisher, Neyman, and Pearson, leading to widely used procedures such as t-tests, chi-square tests, and ANOVA, as well as alternative approaches like Bayesian inference and nonparametric methods.

The t-test was developed by William Sealy Gosset in 1908 while working at the Guinness Brewery. Due to company policy, he published under the pseudonym "Student," so the test is often called "Student’s t-test". Gosset’s work addressed the need for reliable inference with small samples, a common issue in industrial quality control at the time.

t-tests are statistical tests used to compare the means of two groups to determine if they are significantly different from each other. Common types include the independent samples t-test, paired samples t-test, and one-sample t-test. They are based on the t-distribution, which is appropriate for small sample sizes and when the population standard deviation is unknown.

The chi-square test was introduced by Karl Pearson in 1900. Pearson’s work provided a method for assessing goodness-of-fit and independence in categorical data, laying the foundation for modern inferential statistics involving categorical variables. The test has since evolved to address various problems, including independence and homogeneity.

Chi-square tests are used to examine the association between categorical variables or to test how well observed data fit an expected distribution. The most common forms are the test of independence, the test of homogeneity, and the goodness-of-fit test. These tests compare observed frequencies to expected frequencies under the null hypothesis.

ANOVA was developed by Ronald Fisher in 1918. Fisher’s innovation addressed the problem of increasing Type I error rates when conducting multiple t-tests. ANOVA provided a unified approach for comparing multiple groups simultaneously, revolutionizing experimental design and analysis in many scientific fields.

ANOVA is a statistical method used to compare means across three or more groups to determine if at least one group mean is significantly different from the others. It analyzes the variance within and between groups and is an extension of the t-test for more than two groups.

Choosing a test

The choice of test depends on the research question, data type, and underlying assumptions, and the interpretation of results requires careful consideration of statistical significance, effect size, and potential errors. Despite its ubiquity, hypothesis testing is often misunderstood, with common pitfalls including misinterpretation of p-values and overreliance on statistical significance.

In our example, in the next step on black and white trees, the general workflow would look like this:

1) Choose an appropriate statistical test.

Because we’re comparing the mean height of two groups, a two-sample t-test would typically be suitable, assuming the data meet the assumptions of normality and equal variances. If those assumptions don’t hold, we might instead use a nonparametric alternative.

2) Check assumptions.

Before running the test, we would explore the data through summary statistics and visualizations, and evaluate normality, variance homogeneity, and potential outliers. These checks ensure that the test results are reliable.

3) Calculate the test statistic and p-value.

Using the chosen method, we compute a test statistic that reflects how different the observed sample means are relative to what we’d expect if the null hypothesis were true. The p-value then quantifies the probability of observing a difference at least this large by chance.

4) Make a decision.

We compare the p-value to a predetermined significance level (e.g., α = 0.05). If the p-value is below this threshold, we reject the null hypothesis and conclude that black and white trees differ in height. If not, we fail to reject the null and interpret the evidence accordingly.

5) Interpret results beyond significance.

Finally, we would evaluate the effect size, confidence intervals, and the biological or ecological relevance of the difference. Statistical significance alone is not enough. Always remember that: the context matters.

Conclusions

In summary, statistical hypothesis testing remains a powerful and essential tool for scientific research, but using it effectively requires more than running a test and reporting a p-value. It depends on thoughtful methodological choices, awareness of assumptions, clear interpretation of results, and an understanding of potential errors. As research questions become more complex, ongoing refinement of statistical approaches ensures that hypothesis testing continues to support robust and reliable scientific inference.

Further reading

Hogg, R. V., Tanis, E. A., & Zimmerman, D. L. (1977). Probability and statistical inference (Vol. 993). New York: Macmillan.

Hypothesis Testing: Foundations and Classical Methods

Recent Posts

Comments