Background on unsupervised cluster analysis:
The heterogeneity of kidney disease, heart failure, and other chronic diseases suggest that multiple biomarkers reflecting different pathways may be needed to represent the spectrum of each condition. A challenge is how to understand the interrelationships of these markers and how to integrate them in risk prediction and patient management. Cluster analysis can be used to identify disease subtypes and to predict treatment outcomes. Our work has used unsupervised cluster analyses (i.e., blinded to the outcome variable) to categorize participants into separate subgroups based on the results of multiple biomarkers to obtain maximum prognostic utility. We hypothesize that in combination, a complementary, parsimonious set of biomarkers can identify separate clusters of risk for a disease.
Challenges in using cluster analysis include choosing an appropriate algorithm, determining the optimal number of clusters, and assessing the quality and stability of the resulting clusters. Sample code below (in R) is included to illustrate a few of these methods. (Special thanks to Sanjiv Shah for introducing me to R code to assess these issues.)
References:
Charrad M., Ghazzali N., Boiteau V., Niknafs A. (2014). "NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set." Journal of Statistical Software, 61(6), 1-36. http://www.jstatsoft.org/v61/i06/
Brock, G., Pihur, V., Datta, S., & Datta, S. (2008). clValid: An R Package for Cluster Validation. Journal of Statistical Software, 25(4), 1 - 22. https://www.jstatsoft.org/article/view/v025i04
Scherzer R, Lin H, Abraham A, Thiessen-Philbrook H, Parikh CR, Bennett M, Cohen MH, Nowicki M, Gustafson DR, Sharma A, Young M, Tien P, Jotwani V, Shlipak MG. Use of urine biomarker-derived clusters to predict the risk of chronic kidney disease and all-cause mortality in HIV-infected women. Nephrol Dial Transplant. 2016 Sep;31(9):1478-85.
Shah SJ, Katz DH, Selvaraj S, Burke MA, Yancy CW, Gheorghiade M, et al. Phenomapping for novel classification of heart failure with preserved ejection fraction. Circulation 2015,131:269-279.
Helpful websites for further reading:
http://www.sthda.com/english/wiki/cluster-analysis-in-r-unsupervised-machine-learning
https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html
Sample code in R:
# Useful packages:
# stats, cluster: k-means, hierarchical, and partitioning methods of clustering
# mclust: model-based clustering based on finite normal mixture modelling
# NbClust: Determine optimal number of clusters
# clValid: Determine stability and quality of clusters
# Install and load each package:
http://www.r-bloggers.com/installing-r-packages/
# Then, import your data to R:
> mydata <- as.matrix(read.csv(file="c:/forcluster.csv", sep=",", header=TRUE))
# view first row of data
> mydata[1,]
# create a data frame (x) to store biomarkers (a-f)
> x<- mydata[,c("a","b","c","d","e","f")]
# use correlation matrix and heat map to visualize biomarkers
> corr<-cor(x, method = c("spearman"))
> heatmap(x = corr)
# determine optimal number of clusters, using k-means, Ward’s method, or another algorithm
> nb1<-NbClust(x, distance = "euclidean", min.nc=2, max.nc=6, method = "kmeans", index = "all")
> nb2<-NbClust(express, distance = "euclidean", min.nc=2, max.nc=6, method = "ward.D2", index = "all")
*******************************************************************
* Among all indices:
* 8 proposed 2 as the best number of clusters
* 6 proposed 3 as the best number of clusters
* 4 proposed 4 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 4 proposed 6 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
# model based clustering uses Bayesian Information Criterion (BIC) to choose optimal method and number of clusters:
> BIC = mclustBIC(x)
> plot(BIC)
> BIC
Bayesian Information Criterion (BIC):
EII VII EEI VEI EVI VVI EEE EVE VEE VVE EEV VEV EVV VVV
1 -5693.351 -5693.351 -5722.091 -5722.091 -5722.091 -5722.091 -5613.773 -5613.773 -5613.773 -5613.773 -5613.773 -5613.773 -5613.773 -5613.773
...
9 -5690.616 -5677.422 -5720.826 -5698.038 -5861.489 -5805.836 -5763.887 -5830.149 -5748.894 -5906.638 -6106.605 -6197.244 -6299.658 -6344.029
Top 3 models based on the BIC criterion:
VII,2 VII,4 VII,3
-5546.589 -5559.009 -5561.427
> mod1 = Mclust(x)
> mod1
'Mclust' model object:
best model: spherical, varying volume (VII) with 2 components
# principal component plots can be used to visualize cluster separation:
> clusplot(x,mod1$classification)
> saveclust<-cbind(mydata$idno, mod1$classification)
> write.table(saveclust, "c:/mclust.csv", sep=",", row.names=FALSE)
## internal validation (assess quality of 2-6 clusters using different methods)
> intern <- clValid(x, 2:6, clMethods=c("hierarchical","kmeans","pam"), validation="internal")
> summary(intern)
> optimalScores(intern)
> plot(intern)
## stability measures (assess reproducibility of clusters. This is essentially leave-one-out validation. For this method, you may need to reorient your data structure to remove one participant at a time rather than one biomarker at a time.)
> stab <- clValid(x, 2:6, clMethods = c("hierarchical","kmeans","pam"), validation="stability")
>optimalScores(stab)
>plot(stab)