library(tidyverse)
library(ggmap) # for ggimage
library(ggfortify) # for autoplot
12 Principal Component Analysis
Goal
Introduce Principal Component Analysis (PCA), one of the most popular techniques to perform “dimensionality reduction” of complex data sets. If we see the data as points in a high-dimensional space, we can project the data onto a new set of coordinates such that the first coordinate captures the largest share of the variance in the data, the second coordinates captures the largest share of the remaining variance and so on. In this way, we can project large-dimensional data sets onto low-dimensional spaces and lose the least information about the data.
12.1 Input
We have collected the
For example, take the iris data set:
data("iris")
<- iris %>% dplyr::select(-Species)
ir <- iris %>% dplyr::select(Species)
sp pairs(ir, col = sp$Species)
We can separate the clusters better by finding the best projection in 2D:
autoplot(prcomp(ir, center = TRUE),
data = iris,
colour = "Species",
scale = FALSE) +
coord_equal()
12.2 Singular Value Decomposition
At the hearth of PCA is a particular matrix decomposition (or factorization): we represent the matrix
(Note: this defines the “full” SVD of R
returns a “thin” SVD; read the details here).
The values on the diagonal of
Through SVD, the matrix
Where
Let’s look at a concrete example. A monochromatic image can be represented as a matrix where the entries are pixels taking values in (for example, using 8 bits)
<- as.matrix(read.csv("data/stefano.txt"))
stefano # invert y axis and transpose for visualization
<- t(stefano[,ncol(stefano):1])
stefano # rescale values to suppress warning from ggimage
<- stefano / max(stefano)
stefano ggimage(stefano)
Now let’s perform SVD, and show that indeed we have factorized the image:
<- svd(stefano)
s_svd <- s_svd$u
U <- s_svd$v
V <- diag(s_svd$d)
Sigma # this should be equal to the original matrix
<- U %*% Sigma %*% t(V)
stefano_2 # let's plot the difference
ggimage(round(stefano - stefano_2, 10))
Now we can visualize the approximation we’re making when we take only the first few singular values. We’re going to plot
<- 7
r <- array(0, c(dim(stefano), r))
Xdec <- array(0, c(dim(stefano), r))
Xsum # store the first matrix
1] <- (U[,1] %*% t(V[,1])) * Sigma[1,1]
Xdec[,,# the first term in the sum is the matrix itself
1] <- Xdec[,,1]
Xsum[,,# store the other rank one matrices, along with the partial sum
for (i in 2:r){
<- (U[,i] %*% t(V[,i])) * Sigma[i,i]
Xdec[,,i] <- Xsum[,,i - 1] + Xdec[,,i]
Xsum[,,i]
}# now plot all matrices and their sum
<- list()
plots for (i in 1:r){
length(plots) + 1]] <- ggimage(Xdec[,,i])
plots[[length(plots) + 1]] <- ggimage(Xsum[,,i])
plots[[
}::grid.arrange(grobs = plots, ncol = 2) gridExtra
12.3 SVD and PCA
Let’s go back to our data matrix
For example, let’s take the Petal.Lenght
and Petal.Width
in iris
:
<- iris %>% dplyr::select(Petal.Length, Petal.Width) %>% as.matrix()
X <- scale(X, center = TRUE, scale = FALSE) # remove mean
X <- iris$Species
colors plot(X, col = colors)
You can see that now the points are centered at (0,0).
In practice, we want to produce a new “data matrix”
where
The new columns (i.e., the transformed “measurements”)
Where
Note that
Which is maximized (over
Therefore, the first column of
Note that the first axis captures
# build sample covariance matrix
<- (1 / (nrow(X) - 1)) * t(X) %*% X
S # compute eigenvalues and eigenvectors
<- eigen(S, symmetric = TRUE)
eS # W is the matrix of eigenvectors
<- eS$vectors
W # check
<- X %*% W
Y plot(Y, col = colors)
$values eS
[1] 3.66123805 0.03604607
apply(Y, 2, var)
[1] 3.66123805 0.03604607
Therefore, PCA amounts to simply taking the eigenvectors of
where
Therefore, we can perform PCA efficiently by decomposing
12.3.1 PCA in R—from scratch
<- read_csv("data/handwritten_digits.csv") %>%
dt arrange(id, x, y)
Rows: 123392 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (6): id, label, pixel, value, x, y
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dt)
# A tibble: 6 × 6
id label pixel value x y
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 0 1 1
2 1 0 16 0 1 2
3 1 0 32 0 1 3
4 1 0 48 0 1 4
5 1 0 64 0 1 5
6 1 0 80 0 1 6
# make into a data matrix with pixels as cols
<- pivot_wider(dt %>% dplyr::select(-x, -y),
dt_wide names_from = pixel,
values_from = value)
<- (as.matrix(dt_wide %>% dplyr::select(-id, -label)))
X # make col means = 0
<- scale(X, center = TRUE, scale = FALSE)
Xs # compute SVD
<- svd(Xs)
X_svd # Y = US is the transformed data
<- X_svd$u %*% diag(X_svd$d) Y
<- dt_wide %>%
PCA_1 ::select(id, label) %>%
dplyrmutate(label = as.character(label)) %>%
add_column(PC1 = Y[,1], PC2 = Y[,2])
ggplot(PCA_1) +
aes(x = PC1, y = PC2, label = id, group = label, colour = label) +
geom_text()
Pretty good! Let’s see some of the poorly classified points:
# This should be a 0
ggimage(matrix(X[122,], 16, 16, byrow = FALSE), fullpage = FALSE)
# This should be a 1
ggimage(matrix(X[141,], 16, 16, byrow = FALSE), fullpage = FALSE)
# This should be a 5
ggimage(matrix(X[322,], 16, 16, byrow = FALSE), fullpage = FALSE)
You can also scale the variables turning the sample covariance matrix
12.3.2 PCA in R — the easy way
library(ggfortify)
# for prcomp, you need only numeric data
<- dt_wide %>% dplyr::select(-id, -label)
X <- prcomp(X)
PCA_3 autoplot(PCA_3,
data = dt_wide %>% mutate(label = as.character(label)),
colour = "label",
frame = TRUE, frame.type = 'norm')
12.4 Multidimensional scaling
The input is the matrix of dissimilarities
(non-negativity) only if (identity) (symmetry) (triangle inequality)
Given a set of dissimilarities, we can therefore ask whether they are distances, and particularly whether they represent Euclidean distances.
12.4.1 Goal of MDS
Given the
As such, if we can find a perfect solution, then the dissimilarities can be mapped into Euclidean distances in a
12.4.2 Classic MDS
Suppose that the elements of
Note that, because of the centering:
Now we compute:
Similarly (distances are symmetric):
And, finally:
From these three equations, we obtain:
and
Therefore:
With some algebra, one can show that this is equivalent to:
Where
(where
For example, let’s look at the driving distance in km between cities in the US:
# read distances US
<- read_csv("data/dist_US.csv") usa
Rows: 265356 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): from, to
dbl (1): dist
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# make into a matrix of distances
<- usa %>% pivot_wider(names_from = to, values_from = `dist`) %>%
M ::select(-from) %>%
dplyras.matrix()
is.na(M)] <- 0
M[rownames(M) <- colnames(M)
# make symmetric
<- M + t(M)
M 1:2, 1:2] M[
Abilene, TX, United States
Abilene, TX, United States 0.00
Ahwatukee Foothills, AZ, United States 1487.19
Ahwatukee Foothills, AZ, United States
Abilene, TX, United States 1487.19
Ahwatukee Foothills, AZ, United States 0.00
And perform classic MDS using two dimensions:
<- cmdscale(M, k = 2) # k is the dimension of the embedding
mds_fit <- tibble(id = rownames(M),
mds_fit x = mds_fit[,1], y = mds_fit[,2])
<- mds_fit %>%
pl ggplot() +
aes(x = x, y = y) +
geom_point() +
xlim(2 * range(mds_fit$x))
show(pl)
# highlight some major cities
<- c(122, 175, 177, 373, 408, 445, 572, 596, 691)
hh <- mds_fit %>% slice(hh)
mds_highlight show(pl + geom_point(data = mds_highlight, aes(colour = rownames(M)[hh])))
12.5 Readings
SVD is the most important decomposition, but several interesting variations have been proposed for data science. Read this very cool paper on face recognition using Non-negative Matrix Factorization.
12.5.1 Exercise: PCA sommelier
The file Wine.csv
contains several measures made on 178 wines from Piedmont, produced using three different grapes (column Grape
, with 1 = Barolo, 2 = Grignolino, 3 = Barbera). Use the 13 measured variables (i.e., all but Grape
) to perform a PCA. First, do it “the hard way” using SVD, and then, calling the prcomp
function. Can you recover the right classification of grapes?