Introduction to ptLasso: drilling deeper

Background
Multinomial response data
Multi-response data
Multi-response data with time ordered columns
Wrap-up

Background

In Part 1, we learned about pretraining for the lasso, as it applies to “input grouped” data – data where each observation belongs to one of \(k\) distinct groups (e.g. cancer classes). With simulated data, we saw how to fit pretrained models in this context using the R package ptLasso.

Now, we will show how to do pretraining for three different settings:

data with a multinomial response
multi-response data, and
time series data.

For all of these examples, we’ll use the ptLasso package - so let’s load it now.

require(ptLasso)

In all cases, our modeling pipeline will look like this:

# Data: features X, response y,
# and "useCase" (targetGroups, multiresponse or timeSeries)

# Fit a model:
fit = ptLasso(X, y, use.case = "useCase")
# ...optionally using cross-validation for parameter selection:
fit = cv.ptLasso(X, y, use.case = "useCase")

# Visualize the model:
plot(fit)

# And make predictions:
predict(fit, Xtest)

# And make predictions:
coef(fit, Xtest)

Multinomial response data

Multinomial response data has a categorical response: each response is in one of three or more categories. For example, observations may be from people with cancer, and our goal can be to predict the type of cancer.
The pretraining philosophy for modeling data with a multinomial response is analogous to that for input grouped data.
We wish to learn across all classes and learn a specific model for each class.
We again use a two-stage training strategy.

Fit an overall model:

Fit a multinomial model using a group lasso penalty on the coefficients \(\beta\).

The group lasso penalty forces all responses to have the same support, but allows different coefficient values.

Fit group-specific models:

Train individual one-vs-rest models using the penalty factor and offset defined by the overall model (as in the input grouped setting).

This step allows us to discover features that are specific to each class.

Example with simulated data

First, simulate data:

set.seed(1234)

n = 900  # number of observations
p = 100   # number of features
k = 3    # number of classes
class.sizes = rep(n/k, k) # size of each class
ncommon = 15  # number of shared features
nindiv = 10    # number of individual features
base.seq = seq(-.5, .5, length.out = ncommon + 1)
shift.common = matrix(
  c(sample(base.seq, ncommon), sample(base.seq, ncommon), sample(base.seq, ncommon)),
  nrow = 3
)
base.seq = seq(-.1, .1, length.out = nindiv + 1)
shift.indiv = matrix(
  c(sample(base.seq, nindiv), sample(base.seq, nindiv), sample(base.seq, nindiv)),
  nrow = 3
)
x     = matrix(rnorm(n * p), n, p)
xtest = matrix(rnorm(n * p), n, p)
y = ytest = c(sapply(1:length(class.sizes), function(i) rep(i, class.sizes[i])))

# Simulate our data.
# Shift x according to our coefficients, which differ by class.
start = ncommon + 1
for (i in 1:k) {
  end = start + nindiv - 1
  
  for(j in 1:ncommon){
    x[y == i, j] = x[y == i, j] + shift.common[i, j]
    xtest[ytest == i, j] = xtest[ytest == i, j] + shift.common[i, j]
  }
  jj = 1
  for(j in start:end){
    x[y == i, j] = x[y == i, j] + shift.indiv[i, jj]
    xtest[ytest == i, j] = xtest[ytest == i, j] + shift.indiv[i, jj]
    jj = jj + 1
  }
  
  start = end + 1
}

Model fitting and visualization

Now we are ready to model.

By default, ptLasso uses type.measure = "deviance", but for ease of interpretability, we use type.measure = "class" (the misclassification rate).

ptLasso uses cv.glmnet to fit \(7\) models:

the overall model (one multinomial model trained using all \(3\) groups),
the 3 pretrained models (a one-vs-rest model for each group) and
the 3 individual models (a one-vs-rest model for each group).

fit = ptLasso(x = x, y = y, 
              use.case = "targetGroups", 
              type.measure = "class",
              alpha = 0.5)

predict(fit, xtest, ytest = ytest)

## 
## Call:  
## predict.ptLasso(object = fit, xtest = xtest, ytest = ytest) 
## 
## 
## 
## alpha =  0.5 
## 
## Performance (Misclassification error):
## 
##            overall   mean group_1 group_2 group_3
## Overall     0.3322                               
## Pretrain    0.3367 0.2267  0.1956  0.2633  0.2211
## Individual  0.3356 0.2337  0.2011  0.2633  0.2367
## 
## Support size:
##                                          
## Overall    28                            
## Pretrain   70 (16 common + 54 individual)
## Individual 50

Now, our performance metrics include:

the “overall” misclassification rate, computed by treating this as a multinomial problem, and
the “mean” and group-specific misclassification rates, computed by treating this as a set of one-vs-rest problems.

We can plot each of the models fitted:

plot(fit)

And extract coefficients:

coefs = coef(fit)

Choosing \(\alpha\) with cross-validation

To choose the pretraining hyperparameter \(\alpha\), we can use cross-validation. Plotting and predicting are done using the same syntax as before.

set.seed(2345)
cvfit = cv.ptLasso(x = x, y = y, 
              use.case = "targetGroups", 
              type.measure = "class")

plot(cvfit)

predict(cvfit, xtest, ytest = ytest)

## 
## Call:  
## predict.cv.ptLasso(object = cvfit, xtest = xtest, ytest = ytest) 
## 
## 
## 
## alpha =  0.9 
## 
## Performance (Misclassification error):
## 
##            overall   mean group_1 group_2 group_3
## Overall     0.3344                               
## Pretrain    0.3367 0.2341  0.2067  0.2589  0.2367
## Individual  0.3489 0.2348  0.2089  0.2611  0.2344
## 
## Support size:
##                                          
## Overall    34                            
## Pretrain   33 (14 common + 19 individual)
## Individual 76

Multi-response data

Multi-response data has multiple responses for each observation. Data like this may come from a survey, where each person answered multiple questions. In this case, the people would be the observations and their answers would be the responses.
ptLasso supports the special case where all responses are continuous (Gaussian).
Pretraining for multi-response data is quite similar to the multinomial setting.
We again wish to learn across all classes and learn a specific model for each class, and we use a two-stage training strategy.

Fit an overall model:

Fit a multi-response model using a group lasso penalty on the coefficients \(\beta\).

Fit response-specific models:

Train individual models for each response, using the penalty factor and offset defined by the overall model.

This is all done within ptLasso. Note that we only support data with multiple Gaussian responses.

Example with simulated data

We start by simulating data with two Gaussian responses. The first 5 features are predictive for both responses; there is an extra set of 5 coefficients that are predictive for each individual response.

set.seed(1234)

# Define constants
n = 1000         # Total number of samples
ntrain = 650     # Number of training samples
p = 500          # Number of features
sigma = 2        # Standard deviation of noise
     

# Generate covariate matrix
x = matrix(rnorm(n * p), n, p)

# Define coefficients for responses 1 and 2
beta1 = c(rep(1, 5), rep(0.5, 5), rep(0, p - 10))
beta2 = c(rep(1, 5), rep(0, 5), rep(0.5, 5), rep(0, p - 15))

mu = cbind(x %*% beta1, x %*% beta2)
y  = cbind(mu[, 1] + sigma * rnorm(n), 
           mu[, 2] + sigma * rnorm(n))

# Split into train and test
xtest = x[-(1:ntrain), ]
ytest = y[-(1:ntrain), ]

x = x[1:ntrain, ]
y = y[1:ntrain, ]

cat("Correlation between two tasks:", cor(y[, 1], y[, 2]))

## Correlation between two tasks: 0.5119585

Model fitting and visualization

fit = ptLasso(x = x, y = y, 
              use.case = "multiresponse",
              alpha = 0.5)

predict(fit, xtest, ytest = ytest)

## 
## Call:  
## predict.ptLasso(object = fit, xtest = xtest, ytest = ytest) 
## 
## 
## 
## alpha =  0.5 
## 
## Performance (Mean squared error):
## 
##            allResponses  mean response_1 response_2
## Overall           9.217 4.608      4.092      5.125
## Pretrain          9.072 4.536      4.132      4.940
## Individual        9.324 4.662      4.168      5.157
## 
## Support size:
##                                         
## Overall    57                           
## Pretrain   23 (20 common + 3 individual)
## Individual 75

plot(fit)

Choosing \(\alpha\) with cross-validation

cvfit = cv.ptLasso(x = x, y = y, 
                   use.case = "multiresponse")

predict(cvfit, xtest, ytest = ytest)

## 
## Call:  
## predict.cv.ptLasso(object = cvfit, xtest = xtest, ytest = ytest) 
## 
## 
## 
## alpha =  0.3 
## 
## Performance (Mean squared error):
## 
##            allResponses  mean response_1 response_2
## Overall           9.217 4.608      4.092      5.125
## Pretrain          9.006 4.503      4.149      4.857
## Individual        9.324 4.662      4.168      5.157
## 
## Support size:
##                                         
## Overall    57                           
## Pretrain   22 (20 common + 2 individual)
## Individual 75

plot(cvfit)

Multi-response data with time ordered columns

Here, we again have multiple responses for each observation. Now, however, these responses are measurements of the same outcome at different moments in time. For example, we may observe patients across many hospital visits, and measure the same biomarker at each visit. In this case, the time ordered biomarker measurements would be the responses.
ptLasso has built-in support for data like this – we’ll call it time series data – where all responses are either Gaussian or binomial.
We wish to leverage signal across time points and fit a model specifically tuned for each time point.
To do pretraining with time series data, we:

Fit an initial model:

Fit a lasso penalized model for the first time point.

Fit the next time-specific model:

Train a model for time 2 using the penalty factor and offset defined by previous model.

Continue:

Train a model for time 3 using the offset from time 2 and the union of supports from times 1 and 2.

Train a model for time 4 using the offset from time 3 and the union of supports from times 1, 2 and 3.

etc.

Example with simulated data

We will again start by simulating data; here we imagine data with two time points. These two time points have a shared support, with different coefficient values.

set.seed(1234)

# Define constants
n = 600          # Total number of samples
ntrain = 300     # Number of training samples
p = 100          # Number of features
sigma = 2        # Standard deviation of noise

# Generate covariate matrix
x = matrix(rnorm(n * p), n, p)

# Define coefficients for time points 1 and 2
beta1 = c(rnorm(10), rep(0, p - 10))  # Coefs at time 1
beta2 = 0.5 * beta1 + c(rep(0, 10), runif(5, min = 0, max = 2), rep(0, p-15))

# Generate response variables for times 1 and 2
y = cbind(
  x %*% beta1 + sigma * rnorm(n),
  x %*% beta2 + sigma * rnorm(n)
)

# Split data into training and testing sets
xtest = x[-(1:ntrain), ]  # Test covariates
ytest = y[-(1:ntrain), ]  # Test response

x = x[1:ntrain, ]  # Train covariates
y = y[1:ntrain, ]  # Train response

Model fitting and visualization

Modeling is just as before, only now we specify use.case = "timeSeries".

Notice that our performance metrics no longer include an “Overall” model – we imagine that the alternative here is to just treat each time point separately. For Gaussian responses, one could alternatively try use.case = "multiresponse". This ignores the time ordering of the responses, but is otherwise a valid approach to consider.

fit = ptLasso(x = x, y = y, 
              use.case = "timeSeries",
              alpha = 0.5)

plot(fit)

predict(fit, xtest, ytest = ytest)

## 
## Call:  
## predict.ptLasso(object = fit, xtest = xtest, ytest = ytest) 
## 
## 
## 
## alpha =  0.5 
## 
## Performance (Mean squared error):
## 
##             mean response_1 response_2
## Pretrain   4.706      4.554      4.859
## Individual 4.413      4.554      4.272
## 
## Support size:
##                                         
## Pretrain   35 (6 common + 29 individual)
## Individual 50

Choosing \(\alpha\) with cross-validation

set.seed(2345)
cvfit = cv.ptLasso(x = x, y = y, 
                   use.case = "timeSeries")

plot(cvfit)

predict(cvfit, xtest, ytest = ytest)

## 
## Call:  
## predict.cv.ptLasso(object = cvfit, xtest = xtest, ytest = ytest) 
## 
## 
## 
## alpha =  0.1 
## 
## Performance (Mean squared error):
## 
##             mean response_1 response_2
## Pretrain   4.343      4.573      4.112
## Individual 4.502      4.573      4.431
## 
## Support size:
##                                         
## Pretrain   37 (8 common + 29 individual)
## Individual 33

Wrap-up

Now we know how to do pretraining for all of the settings covered natively by ptLasso.

If you want to see more example use cases of pretraining – including conditional average treatment effect estimation and multi-response data with different outcome types – check out the ptLasso vignette!

Introduction to ptLasso: drilling deeper

Erin Craig and Rob Tibshirani

2024-06-25

Background

Multinomial response data

Example with simulated data

Model fitting and visualization

Choosing \(\alpha\) with cross-validation

Multi-response data

Example with simulated data

Model fitting and visualization

Choosing \(\alpha\) with cross-validation

Multi-response data with time ordered columns

Example with simulated data

Model fitting and visualization

Choosing \(\alpha\) with cross-validation

Wrap-up