12 Practical: Principal component analysis
In this practical we will practice some of the ideas outlined in the lecture on Principal Component Analysis (PCA), this will include computing principal components, visualisation techniques and an application to real data.
12.1 Data
For this practical we will use some data that is built into R and we require two additional files you will download:
12.2 Introduction
We use PCA in order to explore complex datasets. By performing dimensionality reduction we can better visualize the data that has many variables. This technique is probably the most popular tool applied across bioscience problems (e.g. for gene expression problems).
In many real-world dataset we deal with a high dimensional data, e.g. for a number of individuals we can take a number of health related measurement (called variables). This is great, however having a large number of variables also means that it is difficult to plot the data as it is (in its “raw” format), and in turn it might be difficult to understand if this dataset contains any interesting patterns/trends/relationships across individuals. Using PCA we visualize such data in a more “human friendly” fashion.
Recall:
- PCA performs a linear transformation to data.
- This means that any input data can be visualized in a new coordinate system. The first coordinate (PC 1) variance is found on the first coordinate; each subsequent coordinate is orthogonal to the previous one and contains the larges variance from what was left.
- Each principal component is associated with certain percentage of the total variation in the dataset.
- If variables are strongly correlated with one another, a first few principal components will enable us to visualize the relationships present in any dataset.
- Eigenvectors describe new directions, whereas accompanying eigenvalues tell us how much variance there is in the data in given direction.
- The eigenvector with the highest eigenvalue is called the first principal component. The second highest eigenvalue would correspond to a second principle component and etc.
- There exist a \(d\) number of eigenvalues and eigenvectors; \(d\) is also equal to the size of the data (number of dimensions).
- For the purpose of visualization we preselect the first \(q\) components, where \(q < d\).
12.3 Exercise I
There are many datasets built into R
. Wed will look at the mtcars
dataset. Type ?mtcars
to get a description of data. Then use head()
function to have a look at the first few rows; and dim()
to get the dimensions of the data.
library(ggplot2)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
dim(mtcars)
## [1] 32 11
In this case we have \(32\) examples (cars in this case), and \(11\) features.
Now we can perform a principal component analysis, in R
it is implemented as the prcomp()
function. We can type ?prcomp
to see a description of the function and some help on possible arguments. Here we set center
and scale
arguments to TRUE
, recall from the lecture why this is important. We can try to perform PCA without scaling and centering and compare.
<- prcomp(mtcars, center = TRUE, scale = TRUE) cars_pca
We can use the summary()
function to summarise the results from PCA, it will return the standard deviation, the proportion of variance explained by each principal component, and the cumulative proportion.
<- summary(cars_pca)
pca_summary print(pca_summary)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 2.5707 1.6280 0.79196 0.51923 0.47271 0.46000 0.3678 0.35057
## Proportion of Variance 0.6008 0.2409 0.05702 0.02451 0.02031 0.01924 0.0123 0.01117
## Cumulative Proportion 0.6008 0.8417 0.89873 0.92324 0.94356 0.96279 0.9751 0.98626
## PC9 PC10 PC11
## Standard deviation 0.2776 0.22811 0.1485
## Proportion of Variance 0.0070 0.00473 0.0020
## Cumulative Proportion 0.9933 0.99800 1.0000
Note, Proportion of Variance
will always add up to \(1\). Here the PC1
explain \(60.08%\) of the variance, and PC2
explains \(24.09%\), which means together PC1
and PC2
account for \(84.17%\) of the variance.
Using the str()
function we can see the full structure of an R
object, or alternatively using ?prcomp
in the Value section. In this case the cars_pca
variable is a list containing several variables, x
is the data represented using the new principal components. We can now plot the data in the first two principal components:
<- data.frame(cars_pca$x, make = stringr::word(rownames(mtcars), 1))
pca_df
ggplot(pca_df, aes(x = PC1, y = PC2, col = make)) +
geom_point(size = 3) +
labs(x = "PC1 60.08%",
y = "PC2 24.09 %",
title = "Principal components for mtcars") +
theme(legend.position = "bottom")
Here we added a color based on the make of each car. We can observe which samples (or cars) cluster together. Have a look at these variables and decide why certain cars or models would cluster together.
We created this plot using the ggplot2
package, it is also possible to do this using base plot if you prefer.
plot(pca_df$PC1, pca_df$PC2)
12.4 Exercise II
Next we look at another representation of the data, the biplot. This is a combination of a PCA plot of the data and a score plot. We saw the PCA plot in the previous section in a biplot we add the original axis as arrows.
biplot(cars_pca)
We can see the original axis starting from the origin. Therefore we can make observations about the original variables (e.g. cyl
and mpg
contribute to PC1) and how the data points relates to these axes.
12.5 Exercise III
Now try to perform a PCA on the USArrests
data also build into R
. Typing ?USArrests
will give you further information on the data. Perform the analysis on the subset USArrests[, -3]
data.
12.6 Exercise IV: Single cell data
We can now try to apply what we learned above on a more realistic datasets. You can download the data either on canvas or using these links Pollen2014.txt
and SupplementaryLabels.txt
. Her we will be dealing with single cell RNA-Seq (scRNA-Seq) data, which consist of \(300\) single cells measured across \(8686\) genes.
<-read.table("Pollen2014.txt", sep=',', header = T,row.names=1)
pollen_df
<-read.table("SupplementaryLabels.txt", sep=',', header = T)
label_df
1:10, 1:6] pollen_df[
## Cell_2338_1 Cell_2338_10 Cell_2338_11 Cell_2338_12 Cell_2338_13
## MTND2P28 78 559 811 705 384
## MTATP6P1 2053 1958 4922 4409 2610
## NOC2L 1 125 126 0 487
## ISG15 2953 4938 580 523 2609
## CPSF3L 2 42 19 0 37
## MXRA8 0 0 0 0 0
## AURKAIP1 302 132 64 492 11
## CCNL2 0 235 0 84 13
## MRPL20 330 477 288 222 44
## SSU72 604 869 2046 158 530
## Cell_2338_14
## MTND2P28 447
## MTATP6P1 3709
## NOC2L 66
## ISG15 1
## CPSF3L 12
## MXRA8 0
## AURKAIP1 182
## CCNL2 11
## MRPL20 282
## SSU72 272
dim(pollen_df)
## [1] 8686 300
Measurements of scRNA-Seq data are integer counts, this data does not have good properties so we perform a transformation on the data. The most commonly used transformation on RNA-Seq count data is \(\log_2\). We will also transpose the data matrix to rows representing cells and columns representing genes. This is the data we can use to perform PCA.
# scRNA-Seq data transformation
<- log2(as.matrix(pollen_df) + 1)
pollen_mat # transpose the data
<- t(pollen_mat) pollen_mat
We will now use information that we read into the label_df
variable to rename cells.
# Check which columns we have available
colnames(label_df)
## [1] "Cell_Identifier" "Population" "Cell_names"
## [4] "TrueLabel_CellLevel" "Tissue_name" "TrueLabel_TissuelLevel"
# rename rows
rownames(pollen_mat) <- label_df$Cell_names
Next we perform PCA on the data and extract the proportion of variance explained by each component.
<- prcomp(pollen_mat)
sc_pca
# variance is the square of the standard deviation
<- sc_pca$sdev^2
pr_var
# compute the variance explained by each principal component
<- pr_var / sum(pr_var) prop_var_exp
Think about the calculation and what exactly it means. We can visualise this
<- data.frame(variance = prop_var_exp, pc = 1:length(prop_var_exp))
var_exp
ggplot(var_exp[1:30, ], aes(x = pc, y = variance)) +
geom_bar(stat = "identity") +
labs(x = "Principal Component",
y = "Variance explained")
We see that the first few principal components explain significant variance, but after about the PC10, there is very limited contribution. Next we will plot the data using the first two Principal components as before.
<- data.frame(sc_pca$x, cell = rownames(sc_pca$x),
sc_pca_df var_exp = prop_var_exp)
ggplot(sc_pca_df, aes(x = PC1, y = PC2, col = cell)) +
geom_point(size = 2) +
theme(legend.position = "bottom")
Why is it not useful to create biplot for this example?