Title: | Multi-Omic Spectral Clustering using the Flag Manifold |
---|---|
Description: | Multi-omic (or any multi-view) spectral clustering methods often assume the same number of clusters across all datasets. We supply methods for multi-omic spectral clustering when the number of distinct clusters differs among the omics profiles (views). |
Authors: | Charlie Carpenter [aut, cre] |
Maintainer: | Charlie Carpenter <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.1.0 |
Built: | 2024-11-03 04:05:58 UTC |
Source: | https://github.com/cran/MMOC |
Generates multiple data sets from a multivariate normal distribution using the mvrnorm function from the MASS
package.
clustStruct(n, p, k, noiseDat = "random", randNoise = 2)
clustStruct(n, p, k, noiseDat = "random", randNoise = 2)
n |
An integer, the sample size for all generated data sets |
p |
An integer, the number of columns (features) in each generated data set |
k |
An integer or vector, the number of distinct clusters in each generated data set. |
noiseDat |
Either the character string |
randNoise |
The value along the diagonal when |
The function accepts k
as a vector. It splits data into k
groups with means c(0, 2^( 1:(kk-1) ) )
, e.g., when k=3
the data will be split into 3 groups with means 0, 2, and 4, respectively. The covariance matrix is either a diagonal matrix with randNoise
(an integer) along the diagonal, or a given matrix.
A list of np data frames with the specified number of groups
## A single view with 30 variables and 3 groups s1 <- clustStruct(n=120, p=30, k=3, noiseDat='random')[[1]] ## Multiple views with 30 variables ## View 1 has 2 groups and View 2 has 3 groups s2 <- clustStruct(n=120, p=30, k=c(2,3), noiseDat='random') ## Multiple views with 30 variables ## View 1 has 2 groups, View 2 has 3, and View 3 has 3 groups s3 <- clustStruct(n=120, p=30, k=c(2,3,3), noiseDat='random') ## Three view study. # View 1: 2 groups, 30 variables, random noise = 5 # View 2: 3 groups, 60 variables, random noise = 2 # View 3: 4 groups, 45 variables, random noise = 4 s4 <- clustStruct(n=120, k=c(2,3,4), p=c(30,60,45), randNoise=c(5,2,4))
## A single view with 30 variables and 3 groups s1 <- clustStruct(n=120, p=30, k=3, noiseDat='random')[[1]] ## Multiple views with 30 variables ## View 1 has 2 groups and View 2 has 3 groups s2 <- clustStruct(n=120, p=30, k=c(2,3), noiseDat='random') ## Multiple views with 30 variables ## View 1 has 2 groups, View 2 has 3, and View 3 has 3 groups s3 <- clustStruct(n=120, p=30, k=c(2,3,3), noiseDat='random') ## Three view study. # View 1: 2 groups, 30 variables, random noise = 5 # View 2: 3 groups, 60 variables, random noise = 2 # View 3: 4 groups, 45 variables, random noise = 4 s4 <- clustStruct(n=120, k=c(2,3,4), p=c(30,60,45), randNoise=c(5,2,4))
Calculate the flag-mean of multiple subspaces. This method allows you to find the extrinsic mean of a finite set of subspaces. You can think of this as a median subspace. This method is also able to handle subspaces with different dimensions. See the references for more details
flagMean(LapList, k, laplacian = c("shift", "Ng", "sym", "rw"), plots = TRUE)
flagMean(LapList, k, laplacian = c("shift", "Ng", "sym", "rw"), plots = TRUE)
LapList |
A list of Laplacian matrices |
k |
A vector indicating how many eigenvectors to take from each Laplacian, i.e., the number of clusters in each view |
laplacian |
One of |
plots |
Whether or not to plot the singular values from SVD |
Despite the complex linear algebra to achieve this result, the opperation is very simple. This function concatonates (cbind) the given subspaces and then performs singular value decomposition on the resulting matrix. This gives the 'median' subspace of the given set of subspaces. We would then cluster on the columns of the U
matrix just as we do in standard spectral clustering
The output from a singular value decomposition. See svd
https://www.semanticscholar.org/paper/Flag-Manifolds-for-the-Characterization-of-in-Large-Marrinan-Beveridge/7d306512b545df98243f87cb8173df83b4672b18 https://www.sciencedirect.com/science/article/pii/S0024379514001669?via%3Dihub
## Generating data with 2 and 3 distinct clusters ## Note that 'clustStruct' returns a list n=120; k <- c(2,3) dd <- clustStruct(n=n, p=30, k=k, noiseDat='random') ## Laplacians L_list <- lapply(dd, kernelLaplacian, kernel="Spectrum", plots=FALSE, verbose=FALSE) ## Calculating the flag mean fm <- flagMean(L_list, k=k, laplacian='shift') ## Knowing the true structure makes it much easier to know how ## many right singular vectors to grab. There are 4 distinct ## groups in these data from 'clustStruct' trueGroups(n=n, k=k) kmeans(fm$u[, 1:4], centers=4)
## Generating data with 2 and 3 distinct clusters ## Note that 'clustStruct' returns a list n=120; k <- c(2,3) dd <- clustStruct(n=n, p=30, k=k, noiseDat='random') ## Laplacians L_list <- lapply(dd, kernelLaplacian, kernel="Spectrum", plots=FALSE, verbose=FALSE) ## Calculating the flag mean fm <- flagMean(L_list, k=k, laplacian='shift') ## Knowing the true structure makes it much easier to know how ## many right singular vectors to grab. There are 4 distinct ## groups in these data from 'clustStruct' trueGroups(n=n, k=k) kmeans(fm$u[, 1:4], centers=4)
Calculate the graph laplacian from a given data set with subjects as rows and features as columns.
kernelLaplacian( dat, kernel = c("Gaussian", "ZM", "Spectrum", "Linear"), laplacian = c("shift", "Ng", "sym", "rw"), grf.type = c("full", "knn", "e-graph"), k = 5, p = 5, rho = NULL, epsilon = 0, mutual = FALSE, binary.grf = FALSE, plots = TRUE, verbose = TRUE )
kernelLaplacian( dat, kernel = c("Gaussian", "ZM", "Spectrum", "Linear"), laplacian = c("shift", "Ng", "sym", "rw"), grf.type = c("full", "knn", "e-graph"), k = 5, p = 5, rho = NULL, epsilon = 0, mutual = FALSE, binary.grf = FALSE, plots = TRUE, verbose = TRUE )
dat |
A matrix like object with subjects as rows and features as columns. |
kernel |
The type of kernel used to calculate the graph's adjacency matrix: |
laplacian |
One of |
grf.type |
Type of graph to calculate: |
k |
An integer value for |
p |
An integer value for the p-nearest neighbor in the |
rho |
A value for the dispersion parameter in the Gaussian kernel. It is in the denominator of the exponent, so higher values correspond to lower similarity. By default it is the median pairwise Gaussian distance |
epsilon |
The cutoff value for the |
mutual |
Make a "mutual" knn graph. Only keeps edges when two nodes are both in each others k-nearest set |
binary.grf |
Set all edges >0 to 1 |
plots |
Whether or not to plot the final graph, a heatmap of calculated kernel, and the eigen values of the Laplacian |
verbose |
Whether or not to give some summary statistics of the pairwise distances |
The four Lapalacians are defined as ,
,
, and
. The shifted Laplacian,
, is recommended for multi-view spectral clustering.
An nn matrix where
n
is the number of rows in dat
.
https://academic.oup.com/bioinformatics/article/36/4/1159/5566508#199177546
## Generating data with 3 distinct clusters ## Note that 'clustStruct' returns a list dd <- clustStruct(n=120, p=30, k=3, noiseDat='random')[[1]] kernelLaplacian(dd, kernel="Spectrum")
## Generating data with 3 distinct clusters ## Note that 'clustStruct' returns a list dd <- clustStruct(n=120, p=30, k=3, noiseDat='random')[[1]] kernelLaplacian(dd, kernel="Spectrum")
Calculate the graph laplacian from a given kernel matrix that represents the full graph weighted adjacency matrix
Laplacian( A, laplacian = c("shift", "Ng", "sym", "rw"), grf.type = c("full", "knn", "e-graph"), k = 5, rho = NULL, epsilon = 0, mutual = FALSE, binary.grf = FALSE, plots = TRUE )
Laplacian( A, laplacian = c("shift", "Ng", "sym", "rw"), grf.type = c("full", "knn", "e-graph"), k = 5, rho = NULL, epsilon = 0, mutual = FALSE, binary.grf = FALSE, plots = TRUE )
A |
An n by n kernel matrix, where n is the sample size, that represents your initial adjacency matrix. Kernel matrices are symmetric, positive semi-definite distance matrices |
laplacian |
One of |
grf.type |
Type of graph to calculate: |
k |
An integer value for |
rho |
A value for the dispersion parameter in the Gaussian kernel. It is in the denominator of the exponent, so higher values correspond to lower similarity. By default it is the median pairwise Gaussian distance |
epsilon |
The cutoff value for the |
mutual |
Make a "mutual" knn graph. Only keeps edges when two nodes are both in each others k-nearest set |
binary.grf |
Set all edges >0 to 1 |
plots |
Whether or not to plot the final graph, a heatmap of calculated kernel, and the eigen values of the Laplacian |
The four Lapalacians are defined as ,
,
, and
. The shifted Laplacian,
, is recommended for multi-view spectral clustering.
An nn matrix where
n
is the number of rows in dat
.
https://academic.oup.com/bioinformatics/article/36/4/1159/5566508#199177546
## Generating data with 3 distinct clusters ## Note that 'clustStruct' returns a list dd <- clustStruct(n=120, p=30, k=3, noiseDat='random')[[1]] ## Gaussian kernel rho <- median(dist(dd)) A <- exp(-(1/rho)*as.matrix(dist(dd, method = "euclidean", upper = TRUE)^2)) Laplacian(A, laplacian='shift', grf.type = 'knn')
## Generating data with 3 distinct clusters ## Note that 'clustStruct' returns a list dd <- clustStruct(n=120, p=30, k=3, noiseDat='random')[[1]] ## Gaussian kernel rho <- median(dist(dd)) A <- exp(-(1/rho)*as.matrix(dist(dd, method = "euclidean", upper = TRUE)^2)) Laplacian(A, laplacian='shift', grf.type = 'knn')
k
approximation of a graph LaplacianThis function calculates the rank-k
approximation of a graph Laplacian (or any symmetric matrix). This function performs eigen decomposition on the given matrix L
and reconstructs it using only the LAST k
eigenvectors and eigenvalues.
Lapprox(LapList, k, laplacian = c("shift", "Ng", "sym", "rw"), plots = TRUE)
Lapprox(LapList, k, laplacian = c("shift", "Ng", "sym", "rw"), plots = TRUE)
LapList |
A list of Laplacian matrices |
k |
A vector indicating how many eigenvectors to take from each Laplacian, i.e., the number of clusters in each view |
laplacian |
One of |
plots |
Whether or not to plot the eigenvalues from the rank approximated Laplacians |
An nn matrix
## Generating data with 2 and 3 distinct clusters ## Note that 'clustStruct' returns a list n=120; k <- c(2,3) set.seed(23) dd <- clustStruct(n=n, p=30, k=k, noiseDat='random') ## Laplacians L_list <- lapply(dd, kernelLaplacian, kernel="Spectrum", laplacian='shift', plots=FALSE, verbose=FALSE) trueGroups(n,k) La <- Lapprox(L_list, k=k, plots=FALSE) kmeans(La$vectors[,1:4], centers=4)
## Generating data with 2 and 3 distinct clusters ## Note that 'clustStruct' returns a list n=120; k <- c(2,3) set.seed(23) dd <- clustStruct(n=n, p=30, k=k, noiseDat='random') ## Laplacians L_list <- lapply(dd, kernelLaplacian, kernel="Spectrum", laplacian='shift', plots=FALSE, verbose=FALSE) trueGroups(n,k) La <- Lapprox(L_list, k=k, plots=FALSE) kmeans(La$vectors[,1:4], centers=4)
Get the unique groups generated by the clustStruct function for a given k
. The number of rows of the resulting matrix gives the number of unique groups.
trueGroups(n, k)
trueGroups(n, k)
n |
An integer, the sample size for all generated data sets |
k |
An integer or vector, the number of distinct clusters in each generated data set. |
A matrix with the unique groups/clusters from the multi-view data generated from clustStruct. The final column Grps
enumerates these groups.
trueGroups(n=120, k=c(2,3,4))
trueGroups(n=120, k=c(2,3,4))