R Biplot Pca

biplot.princomp {stats}R Documentation

Function biplot now permits the argument choices to be supplied, which should be a vector of length two indicated the two PC axes to be plotted. An object of class phyl.pca which is a list with some or all of the following components. Source: R/ggplotpca.R Produces a ggplot2 variant of a so-called biplot for PCA (principal component analysis), but is more flexible and more appealing than the base R biplot function. PCA, 3D Visualization, and Clustering in R It’s fairly common to have a lot of dimensions (columns, variables) in your data. You wish you could plot all the dimensions at the same time and look for patterns.

Biplot for Principal Components

Description

Produces a biplot (in the strict sense) from the output ofprincomp or prcomp

Usage

Arguments

x

an object of class 'princomp'.

choices

length 2 vector specifying the components to plot. Only the defaultis a biplot in the strict sense.

scale

The variables are scaled by lambda ^ scale and theobservations are scaled by lambda ^ (1-scale) wherelambda are the singular values as computed byprincomp. Normally 0 <= scale <= 1, and a warningwill be issued if the specified scale is outside this range.

pc.biplot

If true, use what Gabriel (1971) refers to as a 'principal componentbiplot', with lambda = 1 and observations scaled up by sqrt(n) andvariables scaled down by sqrt(n). Then inner products betweenvariables approximate covariances and distances between observationsapproximate Mahalanobis distance.

...

optional arguments to be passed tobiplot.default.

Details

This is a method for the generic function biplot. There isconsiderable confusion over the precise definitions: those of theoriginal paper, Gabriel (1971), are followed here. Gabriel andOdoroff (1990) use the same definitions, but their plots actuallycorrespond to pc.biplot = TRUE.

Side Effects

Biplot

a plot is produced on the current graphics device.

References

Gabriel, K. R. (1971).The biplot graphical display of matrices with applications toprincipal component analysis.Biometrika, 58, 453–467.

Gabriel, K. R. and Odoroff, C. L. (1990).Biplots in biomedical research.Statistics in Medicine, 9, 469–485.

See Also

biplot,princomp.

Biplot

Examples

8 min readWednesday, May 13, 2020

Principal Component Analysis (PCA)

R biplot pca school

Principal Component Analysis (PCA) is widely used to explore data. This technique allows you visualize and understand how variables in the dataset varies. Therefore, PCA is particularly helpful where the dataset contain many variables.This is a method of unsupervised learning that allows you to better understand the variability in the data set and how different variables are related.

The Components in PCA are the underlying structure in the data. They indicates the directions where there is the most variance, the directions where the data is most spread out. This means that PCA helps us to find the straight line that best spreads the data out when it is projected along it. This is the first principal component, the straight line that shows the most substantial variance in the data.

PCA is a type of linear transformation on a given data set that has values for a certain number of variables (coordinates) for a certain amount of spaces. This linear transformation fits this dataset to a new coordinate system in such a way that the most significant variance is found on the first coordinate, and each subsequent coordinate is orthogonal to the last and has a lesser variance. In this way, you transform a set of x correlated variables over y samples to a set of p uncorrelated principal components over the same samples. We need to load some packages in R session that we are going to use in this post. I prefer loading the packages in my session using the require() function, but you can load using library() function.

Exploring the data

R Biplot Pca

Before we dive in to the analysis, we want to explore our data set and become familiar with it.We use a simple and easy to understand dataset. This dataset consists of data on 120 observations sampled in Pemba and Zanzibar channel during the wet and dry season. Table 1 shows the sampled ten observations of the the dataset. There nine variables, two are factor (channel and season variables) and the other seven are numerical variables.

Table 1: A sample of dataset
channelseasonsstpHsalinitydochlpo4nitrate
PembaDry28.28.0435.05.860.0010.8360.462
ZanzibarWet28.08.0935.06.010.5310.4200.501
PembaWet27.67.9834.06.980.9210.1660.574
ZanzibarWet27.88.1035.06.260.1990.4430.391
PembaDry28.67.9434.85.200.0010.7670.511
ZanzibarWet28.08.0434.06.261.1260.3970.623
PembaDry28.58.0434.06.200.0010.7670.694
PembaDry28.08.0335.05.300.0010.6980.464
ZanzibarWet28.08.0534.86.000.0260.0650.550
PembaWet28.98.0634.55.340.0620.3050.317

Figure 1 is a pairplot that compare each pair of variables as scatterplots in the lower diagonal, densities on the diagonal and correlations written in the upper diagonal. I picked spearman rank correlation to evaluate the correlation of environmental variables to chlorophyll concentration at dry and wet season. We notice that physical and chemical variables influence chlorophyll-a either positive or negative at different seasons.

Figure 1: A pairplot showing the asoociation of numerical values sampled in dry and wet seeasons

Compute the Principal Components

PCA prefer numerical data, therefore, we need to trim off the dataset channel and season variables, because they are categorical variables. Once we have removed the categorical variables, we also need to filter variables for a particular season. I will start with the dry season. We use the filter function from dpyr(Wickham et al. 2018) package to drop all observation collected during the rain season.

Our dataset is reduced to seven numerical variables and 60 observation collected during the dry season in Pemba and Zanzibar channel. To compute PCA, we simply parse the arguments data = dry.season and scale = TRUE in prcomp() function, which performs a principal components analysis and assign the output as dry.pca.

Then We can summarize our PCA object with summary().

R Biplot Pca Medical

We get seven principal components, called PC1-9. Each of these explains a percentage of the total variation in the dataset. That is to say: PC1 explains 32% of the total variance, which means that nearly one-thirds of the information in the dataset can be encapsulated by just that one Principal Component. PC2 explains 25% of the variance. So, by knowing the position of a sample in relation to just PC1 and PC2, you can get a very accurate view on where it stands in relation to other samples, as just PC1 and PC2 can explain 57% of the variance.

Plotting PCA

Kassambara and Mundt (2020) developed a factoextra package that provide tools to extract and visualize the output of exploratory multivariate data analyses, including PCA (R Core Team 2018). However, in this post will make a biplot using a ggbiplot package (Vu 2011). A biplot allows to visualize how the samples relate to one another in PCA (which samples are similar and which are different) and simultaneously reveal how each variable contributes to each principal component.

A ggbiplot package is easy to use and offers a user-friendly and pretty function to plot biplots (Vu 2011). If biplot package is yet in your machine, you can simply install it from github as the code below shows;

Figure 2 is a biplot generated using ggbiplot function in the code below. The axes are seen as arrows originating from the center point. Here, you we that the variables (PO_4^{-}), (O_2), (Chl-a), and (NO_3^{-}) a all contribute to PC1, with higher values in those variables moving the samples to the right on this plot. This lets you see how the data points relate to the axes, but it’s not very informative without knowing which point corresponds to which sample season.

Since we know the channel the data were collected, we can put the points into Pemba and Zanzibar channels. We can further customize the biplot by parsing argument ellipse = TRUE, which will draw an ellipse around each group. The code below generates figure 3

Figure 3: Customized biplot

A customized figure 3 reveal a distinct of data for the two channel. By looking on PC, we find that the points and ellipse to the left is purely Pemba channel whereas to the right is Zanzibar channel. Looking at the axes, we also notice that the data at Pemba channel are characterized by low values of sst, phosphate and dissolved oxygen for PC1 and high values of SST PC2. The Zanzibar channel on contrary is characterized with positive values of pH, nitrate and chl for PC1. Salinity and chl are somehow in the middle.

Of course, we have many principal components available, each of which map differently to the original variables. We can ask ggbiplot to plot these other components, by parsing the choices argument. Figure 4 was generated using PC5 and PC6:

We don’t see much in figure 4 because PC5 and PC6 explain very small percentages of the total variation, so it would be surprising if we found that they were very informative and separated the groups or revealed apparent patterns.

R Biplot Pca Services

Customize ggbiplot

As ggbiplot is based on the ggplot function, you can use the same set of graphical parameters to alter our biplots as you would for any ggplot. For instance, figure 5 we simply added the reference line with geom_vline and geom_hline(). We also changed from the default totheme_pubclean() from ggpubr(Kassambara 2020) and strip off the legend title and position legend to the top of the plot with theme().

Refeences

Kassambara, Alboukadel. 2020. Ggpubr: ’Ggplot2’ Based Publication Ready Plots. https://CRAN.R-project.org/package=ggpubr.

Kassambara, Alboukadel, and Fabian Mundt. 2020. Factoextra: Extract and Visualize the Results of Multivariate Data Analyses. https://CRAN.R-project.org/package=factoextra.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Vu, Vincent Q. 2011. Ggbiplot: A Ggplot2 Based Biplot. http://github.com/vqv/ggbiplot.

R Biplot Pcare

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.