Principal component analysis – What it is and how to run it?

Amanda Duim Ferreira
Jan 19
5 min read

In the world of data science, having "too much information" can sometimes be as problematic as having too little. When you are dealing with a dataset containing dozens of variables, identifying patterns becomes a struggle. In this case, a Principal Component Analysis (PCA) can be a powerful ally in reducing data dimension. So, let’s get started with this amazing statistical technique that turns complexity into clarity.

Whether you are a student, a researcher, or a data analyst, this guide will walk you through what PCA is, how to check if your data is ready for it, and how to avoid common pitfalls. At the end, you can download a dataset and an R code to run a PCA!

If you want help running your PCA or having an expert team do it for you, check our Services page to learn more about our Data Analysis packages and how Outtadesk can help you turn data into publishable results.

Principal Component Analysis (PCA): what it is

Principal Component Analysis (PCA) is a multivariate statistical method that transforms many possibly correlated variables into a smaller number of new, uncorrelated variables (principal components) that capture most of the variation in the data.

These new variables, called principal components, are linear combinations of the original variables and are mutually uncorrelated (orthogonal). The components are ordered so that the first component explains the maximum possible variance, the second explains the next most, and so on. Mathematically, it is based on the eigen-decomposition of the covariance/correlation matrix or singular value decomposition (SVD).

Imagine you are photographing a teapot. If you take the photo from the top, you lose the information about its height. If you take it from the side, you see the height but might miss the width. PCA essentially finds the "best angle" to take the photo so that you capture the maximum amount of detail in a 2D image, minimizing the information lost.

Figure 1. What angle would better describe a teapot? — **Figure 1.** What angle would better describe a teapot?

What Are the Assumptions?

Before running PCA, you must ensure your data meets specific requirements. PCA relies on mathematical geometry, and using it in a not appropriate dataset can lead to mistakes.

Continuous Variables: PCA is best suited for continuous data (e.g., height, weight, temperature). It is not designed for categorical data.
Linearity: PCA assumes that the relationships between variables are linear. It relies on Pearson correlation coefficients. Attention: Your samples should be truly correlated. Avoid using variables that are calculated in function of others (i.e., derived variables). Examples: In Soil Science, base saturation (V%) is calculated as the sum of Ca²⁺, Mg²⁺, K⁺, and Na⁺ contents and cation exchange capacity (CEC). Do not include base saturation in the spreadsheet for PCA along with those other variables.
Large Sample Size: You generally need a minimum of 5 observations per variable, though more is always better to ensure stability.
Sampling Adequacy: There must be enough correlation between variables to warrant combining them. If variables are completely independent, PCA will not work.
No Significant Outliers: Outliers can heavily distort the components because PCA minimizes the mean squared distance.

How to Check if My Data is Suitable?

Here are the specific statistical tests and checks you need to perform before running a PCA:

1. Correlation Matrix

Generate a correlation matrix for your variables. If the majority of correlation coefficients are below 0.3, PCA may not be useful because the variables do not share enough common variance to be grouped.

2. Kaiser-Meyer-Olkin (KMO) Measure

The KMO test measures the proportion of variance among variables that might be common variance. It ranges from 0 to 1. A KMO value > 0.6 is generally considered adequate.

0.8 - 1.0: Excellent
0.6 - 0.7: Acceptable
< 0.5: Unacceptable because PCA will not yield reliable results.

3. Bartlett’s Test of Sphericity

This hypothesis test checks if your correlation matrix is significantly different from an identity matrix (a matrix where all variables are perfectly independent). You want a significant result (p < 0.05). This confirms that your variables correlate enough to meaningfully reduce dimensions.

How to Run a PCA Analysis

Running a PCA typically follows this workflow:

Data Selection: Choose your continuous numerical variables.
Assumption Checking: Run KMO and Bartlett’s tests.
Standardization. This is Crucial: Scale and center your data. This ensures that variables with large ranges do not dominate variables with small ranges.
Compute Components: Calculate the eigenvectors and eigenvalues.

Main questions during the process of running a PCA

1) Is it worth it?

PCA is only useful when there is a correlation between variables. If variables are independent (no correlation), PCA cannot group them or reduce dimensions. Each component simply mirrors the variance of a single variable, providing no new insight. PCA is effective when the data is correlated, and it is because the principal components can capture and summarize the shared patterns, explaining more variance than any individual variable could on its own.

2) How many components to use?

With the Kaiser Criterion, you would keep components with Eigenvalues > 1. Or you can use the scree plot and look for the "elbow" in the plot where the variance explained levels off. However, if you are interested in the plot, choose 2 or 3 components. 2D graphs are easier to interpret.

3) How to interpret the components?

Variables' loadings are the heart of interpretation. They represent the correlation between the original variables and the Principal Component. The closer the absolute value is to 1.0, the stronger the influence of that variable on the component. The direction (Positive vs. Negative) also tells you how variables move together.

You can also “name” the components. For instance, if PC1 has high loadings for Nitrogen, Phosphorus, and Potassium contents in the soil, you might name PC1 "Soil nutrient status." And if PC2 loads heavily on plant height, leaf area, and stem diameter, you might name it "Vegetative development".

4) How to interpret the biplot?

The biplot overlays on the points representing your individual samples and the vectors (arrows) representing your variables. To read it, you need to analyze two things: the angle between the arrows and the direction of these arrows. Narrow angles indicate high correlation, while a 90-degree angle indicates no correlation. Otherwise, 180-degree angles indicate a strong negative correlation. Samples located in the same direction as an arrow have high values for that variable.

Main Mistakes in Interpreting a PCA Analysis

Once you have your results, be careful how you read them:

Over-interpreting Low Variance: If PC1 and PC2 combined only explain 30%of the total variance, your PCA model is weak. Drawing strong conclusions from a model that leaves out 70% of the information is dangerous.
Confusing Correlation with Causality: Just because variables group in a component doesn't mean one causes the other; it just means they vary together.
Ignoring the Sign Arbitrariness: The sign of a component loading (positive or negative) is somewhat arbitrary. Focus on the magnitude and the direction relative to other variables, not just whether it is positive or negative in isolation.

Ready to try it yourself? We have prepared a clean dataset and an annotated R script that performs the Assumption Checks (KMO & Bartlett) and the full PCA visualization. Put your best email below to receive the code and the data! The file includes annotated code snippets you can drop into your project and adapt.

Get the code

Recommended resources

Greenacre, M., Groenen, P., D’Enza, A., Markos, A., & Tuzhilina, E. (2022). Principal component analysis. Nature Reviews Methods Primers, 2. https://doi.org/10.1038/s43586-022-00184-w.

Gewers, F., Ferreira, G., Arruda, H., Silva, F., Comin, C., Amancio, D., & Costa, L. (2018). Principal Component Analysis. ACM Computing Surveys (CSUR), 54, 1 - 34. https://doi.org/10.1145/3447755.

Lever, J., Krzywinski, M., & Altman, N. (2017). Points of Significance: Principal component analysis. Nature Methods, 14, 641-642. https://doi.org/10.1038/nmeth.4346.

1 Comment

Rated 0 out of 5 stars.

No ratings yet

Maricar

Jan 24

Rated 5 out of 5 stars.

Dr. Amanda Ferreira presented this similar material in my class, and my students truly enjoyed running the R code she provided. They learned a great deal from her lecture. I cannot thank Dr. Ferreira enough for this excellent lecture package and the enriching experience it provided for my graduate students.