LASSO for Correlated Variable Selection

Background on LASSO:

We often wish to identify which of multiple serum or urine biomarkers are independently associated with kidney disease outcomes. Traditional variable selection methods may perform poorly when evaluating multiple, inter-correlated biomarkers. An alternative is the LASSO (Least Absolute Shrinkage and Selection Operator) procedure, which uses cross-validation to determine both the number of included predictors and the degree of shrinkage to avoid over-fitting. LASSO is well suited for so called high-dimensional data, where the number of predictors may be large relative to the sample size, and the predictors may be correlated.

 

Reference:

Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Statist Soc B. 1996;58:267-288.

Helpful websites for further reading:

https://cran.r-project.org/web/packages/glmnet/vignettes/glmnet_beta.html

http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

 

Sample code in R:

# Install and then load the glmnet package:

http://www.r-bloggers.com/installing-r-packages/

# Then, import your data to R:

> mydata <- as.matrix(read.csv(file="c:/forlasso.csv", sep=",", header=TRUE))

# view first row of data

> mydata[1,]

# outcome y is in column 1

> y<-mydata[,1]

# covariates start in column 2

> x<-mydata[,-1]

# optional penalty factor to force variable (0 = no shrinkage) or not (1 = default shrinkage)

# In this example we are forcing only predictor 1

> penalty<-c(0,1,1,1,1,1,1,1,1,1,1,1,1)

# Run LASSO

# Does k-fold cross-validation for glmnet, produces a plot, and returns a value for lambda

# (glmnet can fit linear, logistic, multinomial, poisson, and Cox regression models)

> fit1=glmnet(x,y, family="gaussian", penalty.factor=penalty)

> cvob1=cv.glmnet(x,y)

> plot(cvob1)

# The output from cvob1 will give us the lambda value associated with the smallest cross-validated error:

> cvob1$lambda.min

[1] 0.0344316

# In this example, the minimum lambda value is 0.0344316. We will apply that lambda value to obtain fitted values for the sparse solution to the model:

> coef(fit1,s=cvob1$lambda.min)

# same as this:

> coef(fit1,s=0.0344316)

14 x 1 sparse Matrix of class "dgCMatrix"

1

(Intercept) -0.010844903

predictor1 0.046949347

predictor2 .

predictor3 .

predictor4 .

predictor5 0.022958117

predictor6 .

predictor7 .

predictor8 .

predictor9 0.007559065

predictor10 0.060811847

predictor11 .

predictor12 -0.031175263

predictor13 -0.034677144

These are standardized regression beta coefficients, showing us that predictors 12 and 13 are associated with lower values of the outcome, while predictors 1, 5, 9, and 10 are associated with higher values.

These selected predictors can be used in an ordinary regression model to obtain adjusted estimates, confidence intervals, and p-values. As a next step, run your adjusted linear regression model, retaining only the predictors selected by LASSO (plus any other important background covariates, as appropriate).