R Biplot Pca
biplot.princomp {stats} | R Documentation |
Function biplot now permits the argument choices to be supplied, which should be a vector of length two indicated the two PC axes to be plotted. An object of class phyl.pca which is a list with some or all of the following components. Source: R/ggplotpca.R Produces a ggplot2 variant of a so-called biplot for PCA (principal component analysis), but is more flexible and more appealing than the base R biplot function. PCA, 3D Visualization, and Clustering in R It’s fairly common to have a lot of dimensions (columns, variables) in your data. You wish you could plot all the dimensions at the same time and look for patterns.
Biplot for Principal Components
Description
Produces a biplot (in the strict sense) from the output ofprincomp
or prcomp
Usage
Arguments
x | an object of class |
choices | length 2 vector specifying the components to plot. Only the defaultis a biplot in the strict sense. |
scale | The variables are scaled by |
pc.biplot | If true, use what Gabriel (1971) refers to as a 'principal componentbiplot', with |
... | optional arguments to be passed to |
Details
This is a method for the generic function biplot
. There isconsiderable confusion over the precise definitions: those of theoriginal paper, Gabriel (1971), are followed here. Gabriel andOdoroff (1990) use the same definitions, but their plots actuallycorrespond to pc.biplot = TRUE
.
Side Effects
a plot is produced on the current graphics device.
References
Gabriel, K. R. (1971).The biplot graphical display of matrices with applications toprincipal component analysis.Biometrika, 58, 453–467.
Gabriel, K. R. and Odoroff, C. L. (1990).Biplots in biomedical research.Statistics in Medicine, 9, 469–485.
See Also
biplot
,princomp
.
Examples
8 min readWednesday, May 13, 2020Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is widely used to explore data. This technique allows you visualize and understand how variables in the dataset varies. Therefore, PCA is particularly helpful where the dataset contain many variables.This is a method of unsupervised learning that allows you to better understand the variability in the data set and how different variables are related.
The Components in PCA are the underlying structure in the data. They indicates the directions where there is the most variance, the directions where the data is most spread out. This means that PCA helps us to find the straight line that best spreads the data out when it is projected along it. This is the first principal component, the straight line that shows the most substantial variance in the data.
PCA is a type of linear transformation on a given data set that has values for a certain number of variables (coordinates) for a certain amount of spaces. This linear transformation fits this dataset to a new coordinate system in such a way that the most significant variance is found on the first coordinate, and each subsequent coordinate is orthogonal to the last and has a lesser variance. In this way, you transform a set of x correlated variables over y samples to a set of p uncorrelated principal components over the same samples. We need to load some packages in R session that we are going to use in this post. I prefer loading the packages in my session using the require()
function, but you can load using library()
function.
Exploring the data
Before we dive in to the analysis, we want to explore our data set and become familiar with it.We use a simple and easy to understand dataset. This dataset consists of data on 120 observations sampled in Pemba and Zanzibar channel during the wet and dry season. Table 1 shows the sampled ten observations of the the dataset. There nine variables, two are factor (channel and season variables) and the other seven are numerical variables.
channel | season | sst | pH | salinity | do | chl | po4 | nitrate |
---|---|---|---|---|---|---|---|---|
Pemba | Dry | 28.2 | 8.04 | 35.0 | 5.86 | 0.001 | 0.836 | 0.462 |
Zanzibar | Wet | 28.0 | 8.09 | 35.0 | 6.01 | 0.531 | 0.420 | 0.501 |
Pemba | Wet | 27.6 | 7.98 | 34.0 | 6.98 | 0.921 | 0.166 | 0.574 |
Zanzibar | Wet | 27.8 | 8.10 | 35.0 | 6.26 | 0.199 | 0.443 | 0.391 |
Pemba | Dry | 28.6 | 7.94 | 34.8 | 5.20 | 0.001 | 0.767 | 0.511 |
Zanzibar | Wet | 28.0 | 8.04 | 34.0 | 6.26 | 1.126 | 0.397 | 0.623 |
Pemba | Dry | 28.5 | 8.04 | 34.0 | 6.20 | 0.001 | 0.767 | 0.694 |
Pemba | Dry | 28.0 | 8.03 | 35.0 | 5.30 | 0.001 | 0.698 | 0.464 |
Zanzibar | Wet | 28.0 | 8.05 | 34.8 | 6.00 | 0.026 | 0.065 | 0.550 |
Pemba | Wet | 28.9 | 8.06 | 34.5 | 5.34 | 0.062 | 0.305 | 0.317 |
Figure 1 is a pairplot that compare each pair of variables as scatterplots in the lower diagonal, densities on the diagonal and correlations written in the upper diagonal. I picked spearman
rank correlation to evaluate the correlation of environmental variables to chlorophyll concentration at dry and wet season. We notice that physical and chemical variables influence chlorophyll-a either positive or negative at different seasons.
Figure 1: A pairplot showing the asoociation of numerical values sampled in dry and wet seeasons
Compute the Principal Components
PCA prefer numerical data, therefore, we need to trim off the dataset channel and season variables, because they are categorical variables. Once we have removed the categorical variables, we also need to filter variables for a particular season. I will start with the dry season. We use the filter
function from dpyr(Wickham et al. 2018) package to drop all observation collected during the rain season.
Our dataset is reduced to seven numerical variables and 60 observation collected during the dry season in Pemba and Zanzibar channel. To compute PCA, we simply parse the arguments data = dry.season
and scale = TRUE
in prcomp()
function, which performs a principal components analysis and assign the output as dry.pca
.
Then We can summarize our PCA object with summary()
.
R Biplot Pca Medical
We get seven principal components, called PC1-9. Each of these explains a percentage of the total variation in the dataset. That is to say: PC1
explains 32% of the total variance, which means that nearly one-thirds of the information in the dataset can be encapsulated by just that one Principal Component. PC2
explains 25% of the variance. So, by knowing the position of a sample in relation to just PC1
and PC2
, you can get a very accurate view on where it stands in relation to other samples, as just PC1
and PC2
can explain 57% of the variance.
Plotting PCA
Kassambara and Mundt (2020) developed a factoextra package that provide tools to extract and visualize the output of exploratory multivariate data analyses, including PCA (R Core Team 2018). However, in this post will make a biplot using a ggbiplot package (Vu 2011). A biplot allows to visualize how the samples relate to one another in PCA (which samples are similar and which are different) and simultaneously reveal how each variable contributes to each principal component.
A ggbiplot package is easy to use and offers a user-friendly and pretty function to plot biplots (Vu 2011). If biplot package is yet in your machine, you can simply install it from github as the code below shows;
Figure 2 is a biplot generated using ggbiplot
function in the code below. The axes are seen as arrows originating from the center point. Here, you we that the variables (PO_4^{-}), (O_2), (Chl-a), and (NO_3^{-}) a all contribute to PC1, with higher values in those variables moving the samples to the right on this plot. This lets you see how the data points relate to the axes, but it’s not very informative without knowing which point corresponds to which sample season.
Since we know the channel the data were collected, we can put the points into Pemba and Zanzibar channels. We can further customize the biplot by parsing argument ellipse = TRUE
, which will draw an ellipse around each group. The code below generates figure 3
Figure 3: Customized biplot
A customized figure 3 reveal a distinct of data for the two channel. By looking on PC, we find that the points and ellipse to the left is purely Pemba channel whereas to the right is Zanzibar channel. Looking at the axes, we also notice that the data at Pemba channel are characterized by low values of sst, phosphate and dissolved oxygen for PC1 and high values of SST PC2. The Zanzibar channel on contrary is characterized with positive values of pH, nitrate and chl for PC1. Salinity and chl are somehow in the middle.
Of course, we have many principal components available, each of which map differently to the original variables. We can ask ggbiplot
to plot these other components, by parsing the choices
argument. Figure 4 was generated using PC5
and PC6
:
We don’t see much in figure 4 because PC5 and PC6 explain very small percentages of the total variation, so it would be surprising if we found that they were very informative and separated the groups or revealed apparent patterns.
R Biplot Pca Services
Customize ggbiplot
As ggbiplot
is based on the ggplot function, you can use the same set of graphical parameters to alter our biplots as you would for any ggplot. For instance, figure 5 we simply added the reference line with geom_vline
and geom_hline()
. We also changed from the default totheme_pubclean()
from ggpubr(Kassambara 2020) and strip off the legend title and position legend to the top of the plot with theme()
.
Refeences
Kassambara, Alboukadel. 2020. Ggpubr: ’Ggplot2’ Based Publication Ready Plots. https://CRAN.R-project.org/package=ggpubr.
Kassambara, Alboukadel, and Fabian Mundt. 2020. Factoextra: Extract and Visualize the Results of Multivariate Data Analyses. https://CRAN.R-project.org/package=factoextra.
R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Vu, Vincent Q. 2011. Ggbiplot: A Ggplot2 Based Biplot. http://github.com/vqv/ggbiplot.
R Biplot Pcare
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.