do3PCA tutorial

Probabilistic PCA using do3PCA

The do3PCA package is able to do standard (non-phylogenetic) probabilistic PCA, probabilistic phylogenetic PCA (3PCA), and probabilistic PCA of phylogenetic independent contrasts (PIC-PCA). Here we generate a dataset using simulation tools and show how to perform probabilistic PCA analyses, interpret the results, and create plots.

library( do3PCA )

Non-phylogenetic PCA

We will perform regular PCA and probabilistic PCA and compare the results using the iris data. First the regular PCA:

data(iris)
iris_mat <- iris[,1:4]
pca <- prcomp(iris_mat)
# Use the Broken Stick function to select the number of principal components to retain:
BrokenStick(n_dim = 4, var_vec = pca$sdev)

## $broken_stick
## [1] 0.5208333 0.2708333 0.1458333 0.0625000
## 
## $n_pc_to_keep
## [1] 1
# The result is a single dimension.

Now let’s repeat the analysis with probabilistic PCA. The difference is that we need to set the number of dimensions to keep because it is part of the model. So here we will perform model selection using a probabilistic approach as implemented in the modelChoice function.

best <- modelChoice(x = iris_mat)
# The best model:
best$`best # PCs`
## 3 
## 3
# The distribution of log_likelihood:
best$`log(p(D|k)`
##        1        2        3 
## 360.0199 417.7342 439.5163

Note that the likelihood method selected 3 principal components to keep whereas the Broken Stick selected only a single PC. This is not an error! These two approaches are both valid, but they try to do different things. Let’s see what the Broken Stick method return when we use the probabilistic PCA. For that we need to estimate the model with the largest number of PCs, which is the total number of dimensions of the data -1. For that we will use the function ProbPCA that does non-phylogenetic probabilistic PCA:

full_model <- ProbPCA(x = as.matrix(iris_mat), ret_dim = 3)
BrokenStick(n_dim = 3, var_vec = full_model$percent_var_pcs)

## $broken_stick
## [1] 0.6111111 0.2777778 0.1111111
## 
## $n_pc_to_keep
## [1] 1

We find that the Broken Stick approach also recommends a single principal component with the results of the probabilistic PCA.

Now let’s plot the results of the probabilistic PCA using a biPlot:

doBiplot(x = full_model)

The data and labels can be non-optimal, so we can adjust things. We can also change the colors of the plot elements:

doBiplot(x = full_model, xlim = c(-0.3, 0.3), ylim = c(-0.3, 0.3), col = c("lightblue", "darkgreen", "pink"))

3PCA: probabilistic phylogenetic PCA

Here we will use data from the ratematrix package to perform a quick phylogenetic PCA. The steps are very similar to the one we did with regular PCA. Here we will also compare the scores of the do3PCA package with the results using phytools and the non-probabilistic phylogenetic PCA.

library( ratematrix )
data("anoles")
anole_dt <- anoles$data[,1:3]
anole_phy <- mergeSimmap(anoles$phy[[1]], drop.regimes = TRUE)

First let’s do a standard phylogenetic PCA:

library( phytools )
## Loading required package: ape
## Loading required package: maps
phypca <- phyl.pca(tree = anole_phy, Y = anole_dt)
# Check the best number of PCs using the Broken Stick function:
BrokenStick(n_dim = 3, var_vec = diag(phypca$Eval))

## $broken_stick
## [1] 0.6111111 0.2777778 0.1111111
## 
## $n_pc_to_keep
## [1] 1
# Again only a single dimension.
# We can plot the scores of the PCA using the do3PCA plotting function:
doBiplot(x = phypca)

Now let’s repeat this analysis using the 3PCA approach and the phylProbPCA function. We will set the option scale = FALSE because phytools does not scale the data prior to conducting a PCA and do3PCA package does.

pca3 <- phylProbPCA(phy = anole_phy, x = anole_dt, scale = FALSE, ret_dim = 2)
## Error in `phylProbPCA()`:
## ! x needs to be a matrix. Cannot be a data.frame.

Oh! We got an error! That is because the x parameter for the do3PCA package needs to be a matrix type object. Let’s use as.matrix to fix this:

pca3 <- phylProbPCA(phy = anole_phy, x = as.matrix(anole_dt), scale = FALSE, ret_dim = 2)
doBiplot(x = pca3)

The results are nearly identical!

However, note that PCA rotates the data to fit the principal axes. The direction of this rotation (e.g., if to the right or left of PC1) is arbitrary and can change depending on the implementation. Do some research about princomp and prcomp functions and you will learn that they can return inverted PC axes. Both are correct results.

Here we show that the scores are very similar:

head( pca3$scores)
##                [,1]       [,2]
## ahli      0.2854764  0.2091089
## alayoni   0.7666921  0.3264510
## alfaroi   0.8501569 -0.2343907
## aliniger  0.2554546  0.1911126
## allisoni -0.5639831  0.1245916
## allogus   0.2275666  0.1199435
## Plot the 3PCA scores with x axis inverted
plot(x = pca3$scores[,1], y = pca3$scores[,2])
## Add the phytools pPCA as red + symbols
points(x = phypca$S[,1], y = phypca$S[,2], pch = 3, col = "red")

PIC-PCA: Probabilistic PCA of independent contrasts

Doing a PIC-PCA is very similar to a 3PCA. We just need to use the function picProbPCA intead of the phylProbPCA function. The input objects and options are the same.

pic_pca <- picProbPCA(phy = anole_phy, x = as.matrix(anole_dt), scale = FALSE, ret_dim = 2)
doBiplot(x = pic_pca)