Skip to contents

Introduction to pretraining

Suppose we have a dataset spanning ten cancers and we want to fit a lasso penalized Cox model to predict survival time. Some of the cancer classes in our dataset are large (e.g. breast, lung) and some are small (e.g. head and neck). There are two obvious approaches: (1) fit a “pancancer model” to the entire training set and use it to make predictions for all cancer classes and (2) fit a separate (class specific) model for each cancer and use it to make predictions for that class only.

Pretraining (Craig et al. (2024)) is a method that bridges these two options; it has a parameter that allows you to fit the pancancer model, the class specific models, and everything in between. ptLasso is a package that fits pretrained models using the glmnet package (Friedman, Tibshirani, and Hastie (2010)), including lasso, elasticnet and ridge models.

Our example dataset consisting of ten different cancers is called input grouped. There is a grouping on the rows of XX and each row belongs to one of the cancer classes. We want to fit a specific model for each of the ten cancers, but we also want to share information across all cancers. Importantly, pretraining is a general method to pass information from one model to another – it has many uses beyond what has already been discussed here, including time series data, multi-response data and conditional average treatment effect estimation. Some of these modeling tasks are not supported by the ptLasso package, and the final section in this vignette shows how to do pretraining for them using the glmnet package. The remainder of this introduction describes the input grouped setting.

Before we describe pretraining in more detail, we will first give a quick review of the lasso.

Review of the lasso

For the Gaussian family with data (xi,yi),i=1,2,n(x_i,y_i), i=1,2,\ldots n, the lasso has the form $$\begin{equation} {\rm argmin}_{\beta_0, \beta} \frac{1}{2} \sum_{i=1}^n(y_i- \beta_0 -\sum_{j=1}^p x_{ij}\beta_j)^2 + \lambda \sum_{j=1}^p |\beta_j |. \end{equation}$$ Varying the regularization parameter λ0\lambda \ge 0 yields a path of solutions: an optimal value λ̂\hat\lambda is usually chosen by cross-validation, using for example the cv.glmnet function from the package glmnet.

In GLMs and 1\ell_1-regularized GLMs, one can include an offset: a pre-specified nn-vector that is included as an additional column to the feature matrix, but whose weight βj\beta_j is fixed at 1. Secondly, one can generalize the 1\ell_1 norm to a weighted norm, taking the form $$\begin{equation} \sum_j {\rm pf}_j |\beta_j | \end{equation}$$ where each ${\rm pf}_j \ge 0$ is a penalty factor for feature jj. At the extremes, a penalty factor of zero implies no penalty and means that the feature will always be included in the model; a penalty factor of ++\infty leads to that feature being discarded (i.e., never entered into the model).

Details of pretraining

For the input grouped setting, pretraining model fitting happens in two steps. First, train a model using the full data: μ̂0,θ̂1,,θ̂k,β̂0=argminμ0,θ1,,θk,β012k=1Kyk(μ0𝟏+θk𝟏+Xkβ0)22+λ||β||1,\begin{equation} \hat{\mu}_0, \hat{\theta}_1, \dots, \hat{\theta}_k, \hat{\beta}_0 = \arg \min_{\mu_0, \theta_1, \dots, \theta_k, \beta_0} \frac{1}{2} \sum_{k=1}^K \| y_k - \left(\mu_0 \mathbf{1} + \theta_k \mathbf{1} + X_k \beta_0\right) \|_2^2 + \lambda ||\beta||_1, \end{equation} where:

  • Xk,ykX_k, y_k are the observations in group kk,
  • θk\theta_k is the group specific intercept for group kk (by convention, θ̂1=0\hat{\theta}_1 = 0),
  • μ,β\mu, \beta are the overall intercept and coefficients,
  • and λ\lambda is a parameter that has been chosen (perhaps the value minimizing the CV error).

Define S(β̂0)S(\hat\beta_0) to be the support set (the nonzero coefficients) of β̂0\hat{\beta}_0.

Then, for each group kk, fit an individual model: find β̂k\hat{\beta}_k and μ̂k\hat{\mu}_k such that $$\begin{eqnarray} && \hat{\mu}_k, \hat{\beta}_k = \arg \min_{\mu_k, \beta_k} \frac{1}{2} \| y_k - (1-\alpha) \left(\hat{\mu}_0 \mathbf{1} + \hat{\theta}_k \mathbf{1} + X_k \hat{\beta}_0\right) - (\mu_k \mathbf{1} + X_k \beta_k) \|_2^2 + \cr && \phantom{\hat{\mu}_k, \hat{\beta}_k} \lambda_2 \sum_{j=1}^p \Bigl[ I(j \in S(\hat{\beta}_0))+ \frac{1}{\alpha} I(j \notin S(\hat{\beta}_0)) \Bigr] |\beta_{kj}|, \label{eq:model} \end{eqnarray}$$ where λ2>0\lambda_2 > 0 and α[0,1]\alpha\in [0,1] are hyperparameters that may be chosen through cross validation.

This is a lasso linear regression model with two additional components: offset (1α)(μ̂0𝟏+θ̂k𝟏+Xkβ̂0)(1-\alpha) \left(\hat{\mu}_0 \mathbf{1} + \hat{\theta}_k \mathbf{1} + X_k \hat{\beta}_0\right), and penalty factor for coefficient jj which is 1 if jS(β̂0)j \in S(\hat{\beta}_0) and 1α\frac{1}{\alpha} otherwise.

Notice that when α=0\alpha=0, this returns the overall model fine tuned for each group: this second stage model is only allowed to fit the residual yk(μ̂0𝟏+θ̂k𝟏+Xkβ̂0)y_k - \left(\hat{\mu}_0 \mathbf{1} + \hat{\theta}_k \mathbf{1} + X_k \hat{\beta}_0\right), and the penalty factor only allows the use of βj\beta_j if it was already selected by the overall model.

At the other extreme, when α=1\alpha=1, this is equivalent to fitting a separate model for each class. There is no offset, and the lasso penalty is 1 for all features (the usual lasso penalty).

ptLasso under the hood

All model fitting in ptLasso is done with cv.glmnet. The first step of pretraining is a straightforward call to cv.glmnet; the second step is done by calling cv.glmnet with:

  1. offset (1α)(μ0̂𝟏+θ̂k𝟏+Xkβ0̂)(1-\alpha) \left(\hat{\mu_0} \mathbf{1} + \hat{\theta}_k \mathbf{1} + X_k \hat{\beta_0}\right) and
  2. penalty.factor, the jthj^\text{th} entry of which is 11 if jS(β0̂)j \in S(\hat{\beta_0}) and 1α\frac{1}{\alpha} otherwise.

Because ptLasso uses cv.glmnet, it inherits most of the virtues of the glmnet package: for example, it handles sparse input-matrix formats, as well as range constraints on coefficients.

Additionally, one call to ptLasso fits an overall model, pretrained class specific models, and class specific models for each group (without pretraining). The ptLasso package also includes methods for prediction and plotting, and a function that performs K-fold cross-validation.

Quick start

ptLasso uses the same syntax as glmnet

For those familiar with glmnet, ptLasso has a similar structure: ptLasso has functions to train, plot and predict, and it follows the syntax of glmnet.

Additionally, ptLasso has a parameter α\alpha that is analogous to the elasticnet parameter also called α\alpha. To avoid confusion, we will refer to the elasticnet parameter as αen\alpha_{\text{en}}. As with αen\alpha_{\text{en}} in glmnet, you must specify the value of α\alpha that you want to use when calling ptLasso; the default is α=0.5\alpha = 0.5.

# The typical glmnet pipeline: train, plot and predict,
# using elasticnet parameter 0.2.
fit = glmnet(X, y, alpha = 0.2)
plot(fit)
test.preds = predict(fit, Xtest)

# The typical ptLasso pipeline: train, plot and predict,
# using pretraining parameter 0.5.
fit = ptLasso(X, y, groups, alpha = 0.5)
plot(fit)
test.preds = predict(fit, Xtest, groupstest)

There are a few big differences between ptLasso and glmnet:

  • ptLasso calls cv.glmnet under the hood: cross validation over λ\lambda is done automatically, and
  • the ptLasso package includes cv.ptLasso: a function to do cross validation over α\alpha.

With cross validation, the typical ptLasso pipeline looks like:

fit = cv.ptLasso(X, y, groups, alpha = 0.5)
plot(fit)
test.preds = predict(fit, Xtest, groupstest)

The predict function uses the value of α\alpha that achieved the best average CV performance across groups. But it is possible to instead use a different α\alpha for each group (specifically the α\alpha that achieved the best CV performance for each group). An example is at the end of this section.

An example

First, we load the ptLasso package:

require(ptLasso)
#> Loading required package: ptLasso
#> Loading required package: ggplot2
#> Loading required package: glmnet
#> Loading required package: Matrix
#> Loaded glmnet 4.1-8
#> Loading required package: gridExtra

To show how to use ptLasso, we’ll simulate data with 5 groups and a continuous response using the helper function gaussian.example.data. There are n=200n = 200 observations in each group and p=120p = 120 features. All groups share 10 informative features; though the features are shared, they have different coefficient values. Each group has 10 additional features that are specific to that group, and all other features are uninformative.

set.seed(1234)

out = gaussian.example.data()
x = out$x; y = out$y; groups = out$groups

outtest = gaussian.example.data()
xtest = outtest$x; ytest = outtest$y; groupstest = outtest$groups

Now we are ready to fit a model using ptLasso. We’ll use the pretraining parameter α=0.5\alpha = 0.5 (randomly chosen).

fit <- ptLasso(x, y, groups, alpha = 0.5)

The function ptLasso used cv.glmnet to fit 11 models:

  • the overall model (using all 5 groups),
  • the 5 pretrained models (one for each group) and
  • the 5 individual models (one for each group).

A call to plot displays the cross validation curves for each model. The top row shows the overall model, the middle row the pretrained models, and the bottom row the individual models.

plot(fit)

predict makes predictions from all 1111 models. It returns a list containing:

  1. yhatoverall (predictions from the overall model),
  2. yhatpre (predictions from the pretrained models) and
  3. yhatind (predictions from the individual models).

By default, predict uses lambda.min for all 1111cv.glmnet models; you could instead specify s = lambda.1se or use a numeric value. Whatever value of λ\lambda you choose will be used for all models (overall, pretrained and individual).

preds = predict(fit, xtest, groupstest=groupstest)

If you also provide ytest (for model validation), predict will additionally compute performance measures.

preds = predict(fit, xtest, groupstest=groupstest, ytest=ytest)
preds
#> 
#> Call:  
#> predict.ptLasso(object = fit, xtest = xtest, groupstest = groupstest,  
#>     ytest = ytest) 
#> 
#> 
#> alpha =  0.5 
#> 
#> Performance (Mean squared error):
#> 
#>            allGroups  mean group_1 group_2 group_3 group_4 group_5    r^2
#> Overall        755.7 755.7   836.0   554.9   565.4   777.9  1044.0 0.5371
#> Pretrain       503.2 503.2   550.6   443.3   553.5   505.6   462.9 0.6918
#> Individual     532.8 532.8   584.1   443.2   567.2   550.5   518.9 0.6736
#> 
#> Support size:
#>                                          
#> Overall    64                            
#> Pretrain   94 (21 common + 73 individual)
#> Individual 109

To access the coefficients of the fitted models, use coef as usual. This returns a list with the coefficients of the individual models, pretrained models and overall models, as returned by glmnet.

all.coefs = coef(fit, s= "lambda.min")
names(all.coefs)
#> [1] "individual" "pretrain"   "overall"

The entries for the individual and pretrained models are lists with one entry for each group. Because we have 5 groups, we’ll have 5 sets of coefficients.

length(all.coefs$pretrain)
#> [1] 5

The first few coefficients for group 1 from the pretrained model are:

head(all.coefs$pretrain[[1]])
#> 6 x 1 sparse Matrix of class "dgCMatrix"
#>                     s1
#> (Intercept)  0.5088629
#> V1          -4.0203684
#> V2           .        
#> V3           .        
#> V4          -0.1923623
#> V5          -0.6581933

When we used ptLasso to fit a model, we chose α=0.5\alpha = 0.5. In practice we recommend choosing α\alpha more thoughtfully by using (1) a validation set to measure performance for a few different choices of α\alpha (e.g. 0,0.25,0.5,0.75,1.00, 0.25, 0.5, 0.75, 1.0) or (2) the function cv.ptLasso.

The call to cv.ptLasso is nearly identical to that for ptLasso. By default, cv.ptLasso will try α=0,0.1,0.2,,1\alpha = 0, 0.1, 0.2, \dots, 1, but this can be changed with the argument alphalist. After fitting, printing the cv.ptLasso object shows the cross validated mean squared error for all models.

cvfit <- cv.ptLasso(x, y, groups)
cvfit
#> 
#> Call:  
#> cv.ptLasso(x = x, y = y, groups = groups, family = "gaussian",  
#>     type.measure = "mse", use.case = "inputGroups", group.intercepts = TRUE) 
#> 
#> 
#> 
#> type.measure:  mse 
#> 
#> 
#>            alpha overall  mean wtdMean group_1 group_2 group_3 group_4 group_5
#> Overall            696.4 696.4   696.4   739.5   500.8   566.4   669.4  1005.9
#> Pretrain     0.0   523.2 523.2   523.2   511.2   475.1   513.8   521.5   594.3
#> Pretrain     0.1   512.9 512.9   512.9   417.7   471.3   554.8   537.5   583.4
#> Pretrain     0.2   501.0 501.0   501.0   415.3   449.9   539.1   496.4   604.2
#> Pretrain     0.3   494.5 494.5   494.5   409.4   432.8   536.9   512.2   581.3
#> Pretrain     0.4   486.9 486.9   486.9   390.7   420.2   536.0   522.4   565.1
#> Pretrain     0.5   507.2 507.2   507.2   411.3   451.4   577.4   532.5   563.7
#> Pretrain     0.6   506.9 506.9   506.9   382.7   448.0   573.1   497.0   633.6
#> Pretrain     0.7   504.9 504.9   504.9   377.7   485.2   582.1   507.0   572.7
#> Pretrain     0.8   496.4 496.4   496.4   395.9   471.5   573.7   488.8   552.1
#> Pretrain     0.9   526.0 526.0   526.0   384.4   482.8   605.6   522.9   634.2
#> Pretrain     1.0   538.8 538.8   538.8   422.4   506.6   604.4   533.8   626.8
#> Individual         538.8 538.8   538.8   422.4   506.6   604.4   533.8   626.8
#> 
#> alphahat (fixed) = 0.4
#> alphahat (varying):
#> group_1 group_2 group_3 group_4 group_5 
#>     0.7     0.4     0.0     0.8     0.8

Plotting the cv.ptLasso object visualizes performance as a function of α\alpha.

plot(cvfit)

And, as with ptLasso, we can predict. By default, predict uses the α\alpha that minimized the cross validated MSE.

preds = predict(cvfit, xtest, groupstest=groupstest, ytest=ytest)
preds
#> 
#> Call:  
#> predict.cv.ptLasso(object = cvfit, xtest = xtest, groupstest = groupstest,  
#>     ytest = ytest) 
#> 
#> 
#> alpha =  0.4 
#> 
#> Performance (Mean squared error):
#> 
#>            allGroups  mean group_1 group_2 group_3 group_4 group_5    r^2
#> Overall        757.1 757.1   815.7   542.6   567.1   792.7  1067.5 0.5362
#> Pretrain       511.1 511.1   579.7   460.1   547.5   502.9   465.6 0.6869
#> Individual     527.9 527.9   563.5   441.8   567.2   548.0   518.9 0.6766
#> 
#> Support size:
#>                                          
#> Overall    50                            
#> Pretrain   86 (29 common + 57 individual)
#> Individual 109

We could instead use the argument alphatype = "varying" to use a different α\alpha for each group – we choose the α\alpha that minimizes the CV MSE for each group:

preds = predict(cvfit, xtest, groupstest=groupstest, ytest=ytest, 
                alphatype="varying")
preds
#> 
#> Call:  
#> predict.cv.ptLasso(object = cvfit, xtest = xtest, groupstest = groupstest,  
#>     ytest = ytest, alphatype = "varying") 
#> 
#> 
#> alpha:
#> [1] 0.7 0.4 0.0 0.8 0.8
#> 
#> 
#> Performance (Mean squared error):
#>            overall  mean wtdMean group_1 group_2 group_3 group_4 group_5
#> Overall      757.1 757.1   757.1   815.7   542.6   567.1   792.7  1067.5
#> Pretrain     505.0 505.0   505.0   502.6   460.1   542.4   537.9   481.8
#> Individual   527.9 527.9   527.9   563.5   441.8   567.2   548.0   518.9
#> 
#> 
#> Support size:
#>                                           
#> Overall    50                             
#> Pretrain   103 (29 common + 74 individual)
#> Individual 109

Other details

Choosing α\alpha, the pretraining parameter

Selecting the parameter α\alpha is an important part of pretraining. The simplest way to do this is to use cv.ptLasso – this will automatically perform pretraining for a range of α\alpha values and return the CV performance for each. The default values for α\alpha are 0,0.1,0.2,,10, 0.1, 0.2, \dots, 1.

cvfit <- cv.ptLasso(x, y, groups)
cvfit
#> 
#> Call:  
#> cv.ptLasso(x = x, y = y, groups = groups, family = "gaussian",  
#>     type.measure = "mse", use.case = "inputGroups", group.intercepts = TRUE) 
#> 
#> 
#> 
#> type.measure:  mse 
#> 
#> 
#>            alpha overall  mean wtdMean group_1 group_2 group_3 group_4 group_5
#> Overall            699.7 699.7   699.7   748.4   501.9   575.6   663.0  1009.9
#> Pretrain     0.0   518.6 518.6   518.6   470.1   471.5   547.0   540.7   563.7
#> Pretrain     0.1   506.0 506.0   506.0   429.7   452.1   538.7   551.1   558.3
#> Pretrain     0.2   495.3 495.3   495.3   393.6   460.6   565.5   530.9   526.1
#> Pretrain     0.3   490.4 490.4   490.4   390.4   436.5   546.3   511.6   567.4
#> Pretrain     0.4   487.5 487.5   487.5   383.7   438.8   545.6   509.4   560.3
#> Pretrain     0.5   481.2 481.2   481.2   364.9   429.7   548.5   513.4   549.7
#> Pretrain     0.6   504.1 504.1   504.1   393.1   460.0   586.4   531.9   549.0
#> Pretrain     0.7   511.5 511.5   511.5   393.2   462.7   584.3   492.9   624.3
#> Pretrain     0.8   509.1 509.1   509.1   382.4   496.2   597.9   503.4   565.6
#> Pretrain     0.9   501.5 501.5   501.5   404.0   481.6   581.9   488.3   552.0
#> Pretrain     1.0   517.1 517.1   517.1   409.1   488.9   612.7   484.7   590.1
#> Individual         517.1 517.1   517.1   409.1   488.9   612.7   484.7   590.1
#> 
#> alphahat (fixed) = 0.5
#> alphahat (varying):
#> group_1 group_2 group_3 group_4 group_5 
#>     0.5     0.5     0.1     1.0     0.2

Of course, you can specify the values of α\alpha to consider:

cvfit <- cv.ptLasso(x, y, groups, alphalist = c(0, 0.5, 1))
cvfit
#> 
#> Call:  
#> cv.ptLasso(x = x, y = y, groups = groups, alphalist = c(0, 0.5,  
#>     1), family = "gaussian", type.measure = "mse", use.case = "inputGroups",  
#>     group.intercepts = TRUE) 
#> 
#> 
#> type.measure:  mse 
#> 
#> 
#>            alpha overall  mean wtdMean group_1 group_2 group_3 group_4 group_5
#> Overall            708.8 708.8   708.8   739.0   514.4   575.4   665.0  1050.1
#> Pretrain     0.0   524.4 524.4   524.4   481.7   485.7   529.4   526.8   598.4
#> Pretrain     0.5   496.3 496.3   496.3   365.0   448.5   569.3   507.5   591.1
#> Pretrain     1.0   526.4 526.4   526.4   399.4   513.5   611.8   492.9   614.6
#> Individual         526.4 526.4   526.4   399.4   513.5   611.8   492.9   614.6
#> 
#> alphahat (fixed) = 0.5
#> alphahat (varying):
#> group_1 group_2 group_3 group_4 group_5 
#>     0.5     0.5     0.0     1.0     0.5

At prediction time, cv.ptLasso uses the α\alpha that had the best CV performance on average across all groups. We could instead choose to use a different α\alpha for each group, as cv.ptLasso already figured out which α\alpha optimizes the CV performance for each group. To use group-specific values of α\alpha, specify alphatype = "varying" at prediction time. In this example, the best group-specific α\alpha values all happen to be 0.50.5 – the same as the overall α\alpha.

###############################################
# Common alpha for all groups:
###############################################
predict(cvfit, xtest, groupstest, ytest=ytest)
#> 
#> Call:  
#> predict.cv.ptLasso(object = cvfit, xtest = xtest, groupstest = groupstest,  
#>     ytest = ytest) 
#> 
#> 
#> alpha =  0.5 
#> 
#> Performance (Mean squared error):
#> 
#>            allGroups  mean group_1 group_2 group_3 group_4 group_5    r^2
#> Overall        757.1 757.1   815.7   542.6   567.1   792.7  1067.5 0.5362
#> Pretrain       507.0 507.0   556.6   446.3   556.6   504.1   471.4 0.6894
#> Individual     527.9 527.9   572.6   443.2   562.4   550.5   510.7 0.6766
#> 
#> Support size:
#>                                          
#> Overall    50                            
#> Pretrain   95 (25 common + 70 individual)
#> Individual 110

###############################################
# Different alpha for each group:
###############################################
predict(cvfit, xtest, groupstest, ytest=ytest, alphatype = "varying")
#> 
#> Call:  
#> predict.cv.ptLasso(object = cvfit, xtest = xtest, groupstest = groupstest,  
#>     ytest = ytest, alphatype = "varying") 
#> 
#> 
#> alpha:
#> [1] 0.5 0.5 0.0 1.0 0.5
#> 
#> 
#> Performance (Mean squared error):
#>            overall  mean wtdMean group_1 group_2 group_3 group_4 group_5
#> Overall      757.1 757.1   757.1   815.7   542.6   567.1   792.7  1067.5
#> Pretrain     517.3 517.3   517.3   556.6   446.3   561.5   550.5   471.4
#> Individual   527.9 527.9   527.9   572.6   443.2   562.4   550.5   510.7
#> 
#> 
#> Support size:
#>                                          
#> Overall    50                            
#> Pretrain   99 (25 common + 74 individual)
#> Individual 110

Choosing λ\lambda, the lasso path parameter, for the first stage of pretraining

The first step of pretraining fits the overall model with cv.glmnet and selects a model along the λ\lambda path. The second stage uses the overall model’s support and predictions to train the group-specific models.

At train time, we need to know choose a value of λ\lambda to use for the first stage. This can be specified in ptLasso with the argument overall.lambda. The default value is “lambda.1se”, but overall.lambda can accept “lambda.1se” or “lambda.min”.

Whatever choice is made at train time will be automatically used at test time, and this cannot be changed. The fitted model from the second stage of pretraining expects the offset to have been computed using a particular model – it does not make sense to compute the offset using a model with a different λ\lambda.

# Default:
fit <- ptLasso(x, y, groups, alpha = 0.5, overall.lambda = "lambda.1se")

# Alternative:
fit <- ptLasso(x, y, groups, alpha = 0.5, overall.lambda = "lambda.min")

Fitting elasticnet or ridge models

By default, ptLasso fits lasso penalized models; in glmnet, this corresponds to the elasticnet parameter αen=1\alpha_\text{en} = 1 (where the subscript en stands for “elasticnet”). Fitting pretrained elasticnet or ridge models is also possible with ptLasso: use argument en.alpha between 00 (ridge) and 11 (lasso). Here is an example using the pretraining parameter alpha = 0.5 and the elasticnet parameter en.alpha = 0.2.

fit <- ptLasso(x, y, groups, 
               alpha = 0.5,    # pretraining parameter
               en.alpha = 0.2) # elasticnet parameter

Printing progress during model training

When models take a long time to train, it can be useful to print out progress during training. ptLasso has two ways to do this (and they can be combined). First, we can simply print out which model is being fitted using verbose = TRUE:

fit <- ptLasso(x, y, groups, alpha = 0.5, verbose = TRUE)
#> Fitting overall model
#> Fitting individual models
#>  Fitting individual model 1 / 5
#>  Fitting individual model 2 / 5
#>  Fitting individual model 3 / 5
#>  Fitting individual model 4 / 5
#>  Fitting individual model 5 / 5
#> Fitting pretrained lasso models
#>  Fitting pretrained model 1 / 5
#>  Fitting pretrained model 2 / 5
#>  Fitting pretrained model 3 / 5
#>  Fitting pretrained model 4 / 5
#>  Fitting pretrained model 5 / 5

We can also print out a progress bar for each model that is being fit – this functionality comes directly from cv.glmnet, and follows its notation. (To avoid cluttering this document, we do not run the following example.)

fit <- ptLasso(x, y, groups, alpha = 0.5, trace.it = TRUE)

And of course, we can combine these to print out (1) which model is being trained and (2) the corresponding progress bar.

fit <- ptLasso(x, y, groups, alpha = 0.5, verbose = TRUE, trace.it = TRUE)

Using individual and overall models that were previously trained

ptLasso will fit the overall and individual models. However, if you have already trained the overall or individual models, you can save compute time by passing them directly to ptLasso – they will not be refitted. ptLasso expects that these models were fitted using the same training data that you pass to ptLasso, and that they were fitted with the argument keep = TRUE. Here is an example. We will fit an overall model and individual models, and then we will show how to pass them to ptLasso. Using verbose = TRUE in the call to ptLasso shows us what models are being trained (and confirms that we are not refitting the overall and individual models).

overall.model = cv.glmnet(x, y, keep = TRUE)
individual.models = lapply(1:5, 
                           function(kk) cv.glmnet(x[groups == kk, ], 
                                                  y[groups == kk], 
                                                  keep = TRUE))

fit <- ptLasso(x, y, groups, 
               fitoverall = overall.model,
               fitind = individual.models,
               verbose = TRUE)
#> Fitting pretrained lasso models
#>  Fitting pretrained model 1 / 5
#>  Fitting pretrained model 2 / 5
#>  Fitting pretrained model 3 / 5
#>  Fitting pretrained model 4 / 5
#>  Fitting pretrained model 5 / 5

Of course we could pass just the overall or individual models to `ptLasso:

fit <- ptLasso(x, y, groups, fitoverall = overall.model, verbose = TRUE)
#> Fitting individual models
#>  Fitting individual model 1 / 5
#>  Fitting individual model 2 / 5
#>  Fitting individual model 3 / 5
#>  Fitting individual model 4 / 5
#>  Fitting individual model 5 / 5
#> Fitting pretrained lasso models
#>  Fitting pretrained model 1 / 5
#>  Fitting pretrained model 2 / 5
#>  Fitting pretrained model 3 / 5
#>  Fitting pretrained model 4 / 5
#>  Fitting pretrained model 5 / 5
fit <- ptLasso(x, y, groups, fitind = individual.models, verbose = TRUE)
#> Fitting overall model
#> Fitting pretrained lasso models
#>  Fitting pretrained model 1 / 5
#>  Fitting pretrained model 2 / 5
#>  Fitting pretrained model 3 / 5
#>  Fitting pretrained model 4 / 5
#>  Fitting pretrained model 5 / 5

Fitting the overall model without group-specific intercepts

When we fit the overall model with input grouped data, we solve the following: μ0̂,θ2̂,,θK̂,β0̂=argminμ,θ2,,θk,β12k=1Kyk(μ𝟏+θk𝟏+Xkβ)22+λ||β||1,\begin{equation} \hat{\mu_0}, \hat{\theta_2}, \dots, \hat{\theta_K}, \hat{\beta_0} = \arg \min_{\mu, \theta_2, \dots, \theta_k, \beta} \frac{1}{2} \sum_{k=1}^K \| y_k - \left(\mu \mathbf{1} + \theta_k \mathbf{1} + X_k \beta\right) \|_2^2 + \lambda ||\beta||_1, \end{equation} where θ1̂\hat{\theta_1} is defined to be 00. We can instead omit θ1,,θK\theta_1, \dots, \theta_K and instead fit the following: μ0̂,β0̂=argminμ,β12k=1Kyk(μ𝟏+Xkβ)22+λ||β||1.\begin{equation} \hat{\mu_0}, \hat{\beta_0} = \arg \min_{\mu, \beta} \frac{1}{2} \sum_{k=1}^K \| y_k - \left(\mu \mathbf{1} + X_k \beta\right) \|_2^2 + \lambda ||\beta||_1. \end{equation} This may be useful in settings where the groups are different between train and test sets (see “Different groups in train and test data” under “Input grouped data”). To do this, use the argument group.intercepts = FALSE.

cvfit <- cv.ptLasso(x, y, groups, group.intercepts = FALSE)
cvfit
#> 
#> Call:  
#> cv.ptLasso(x = x, y = y, groups = groups, group.intercepts = FALSE,  
#>     family = "gaussian", type.measure = "mse", use.case = "inputGroups") 
#> 
#> 
#> 
#> type.measure:  mse 
#> 
#> 
#>            alpha overall  mean wtdMean group_1 group_2 group_3 group_4 group_5
#> Overall            696.2 696.2   696.2   694.7   489.2   580.5   670.0  1046.4
#> Pretrain     0.0   508.9 508.9   508.9   463.3   457.0   525.2   536.0   562.8
#> Pretrain     0.1   491.0 491.0   491.0   431.3   456.8   526.8   497.9   542.1
#> Pretrain     0.2   487.7 487.7   487.7   395.9   457.7   522.4   510.4   552.3
#> Pretrain     0.3   486.6 486.6   486.6   398.3   451.9   522.6   511.7   548.6
#> Pretrain     0.4   494.9 494.9   494.9   384.1   475.8   529.1   516.0   569.6
#> Pretrain     0.5   494.7 494.7   494.7   382.9   444.1   553.7   488.4   604.5
#> Pretrain     0.6   480.4 480.4   480.4   352.8   455.2   545.6   489.1   559.4
#> Pretrain     0.7   512.8 512.8   512.8   381.0   517.0   573.2   501.6   591.4
#> Pretrain     0.8   509.6 509.6   509.6   389.9   470.0   602.6   510.4   574.9
#> Pretrain     0.9   512.4 512.4   512.4   393.5   477.9   600.7   497.6   592.1
#> Pretrain     1.0   526.8 526.8   526.8   409.5   502.1   622.5   492.3   607.6
#> Individual         526.8 526.8   526.8   409.5   502.1   622.5   492.3   607.6
#> 
#> alphahat (fixed) = 0.6
#> alphahat (varying):
#> group_1 group_2 group_3 group_4 group_5 
#>     0.6     0.5     0.2     0.5     0.1

Arguments for use in cv.glmnet

Because model fitting is done with cv.glmnet, ptLasso can take and pass arguments to cv.glmnet. Notable choices include penalty.factor, weights, upper.limits, lower.limits and en.alpha (known as alpha in glmnet). Please refer to the glmnet documentation for more information on their use.

ptLasso does not support the arguments intercept, offset, fit and check.args.

Parallelizing model fitting

For large datasets, we can parallelize model fitting within the calls to cv.glmnet. As in cv.glmnet, pass the argument parallel = TRUE, and register parallel beforehand:

require(doMC)
registerDoMC(cores = 4)
fit = ptLasso(x, y, groups = groups, family = "gaussian", type.measure = "mse", 
              parallel=TRUE)

Input grouped data

Base case: input grouped data with a binomial outcome

In the Quick Start, we applied ptLasso to data with a continuous response. Here, we’ll use data with a binary outcome. This creates a dataset with k=3k = 3 groups (each with 100100 observations), 5 shared coefficients, and 5 coefficients specific to each group.

set.seed(1234)

out = binomial.example.data()
x = out$x; y = out$y; groups = out$groups

outtest = binomial.example.data()
xtest = outtest$x; ytest = outtest$y; groupstest = outtest$groups

We can fit and predict as before. By default, predict.ptLasso will compute and return the deviance on the test set.

fit = ptLasso(x, y, groups, alpha = 0.5, family = "binomial")

predict(fit, xtest, groupstest, ytest = ytest)
#> 
#> Call:  
#> predict.ptLasso(object = fit, xtest = xtest, groupstest = groupstest,  
#>     ytest = ytest) 
#> 
#> 
#> alpha =  0.5 
#> 
#> Performance (Deviance):
#> 
#>            allGroups  mean wtdMean group_1 group_2 group_3
#> Overall        1.359 1.359   1.359   1.334   1.321   1.421
#> Pretrain       1.279 1.279   1.279   1.272   1.169   1.397
#> Individual     1.283 1.283   1.283   1.265   1.186   1.399
#> 
#> Support size:
#>                                        
#> Overall    7                           
#> Pretrain   12 (3 common + 9 individual)
#> Individual 20

We could instead compute the AUC by specifying the type.measure in the call to ptLasso. Note: type.measure is specified during model fitting and not prediction because it is used in each call to cv.glmnet.

fit = ptLasso(x, y, groups, alpha = 0.5, family = "binomial", 
              type.measure = "auc")

predict(fit, xtest, groupstest, ytest = ytest)
#> 
#> Call:  
#> predict.ptLasso(object = fit, xtest = xtest, groupstest = groupstest,  
#>     ytest = ytest) 
#> 
#> 
#> alpha =  0.5 
#> 
#> Performance (AUC):
#> 
#>            allGroups   mean wtdMean group_1 group_2 group_3
#> Overall       0.6026 0.6039  0.6039  0.6161  0.6877  0.5080
#> Pretrain      0.6407 0.6524  0.6524  0.6936  0.7447  0.5190
#> Individual    0.6442 0.6618  0.6618  0.6936  0.7732  0.5186
#> 
#> Support size:
#>                                         
#> Overall    15                           
#> Pretrain   39 (3 common + 36 individual)
#> Individual 40

To fit the overall and individual models, we can use elasticnet instead of lasso by defining the parameter en.alpha (as in glmnet and described in the section “Fitting elasticnet or ridge models”).

fit = ptLasso(x, y, groups, alpha = 0.5, family = "binomial", 
              type.measure = "auc", 
              en.alpha = .5)
predict(fit, xtest, groupstest, ytest = ytest)
#> 
#> Call:  
#> predict.ptLasso(object = fit, xtest = xtest, groupstest = groupstest,  
#>     ytest = ytest) 
#> 
#> 
#> alpha =  0.5 
#> 
#> Performance (AUC):
#> 
#>            allGroups   mean wtdMean group_1 group_2 group_3
#> Overall       0.6041 0.6018  0.6018  0.5928  0.6704  0.5422
#> Pretrain      0.6270 0.6547  0.6547  0.6781  0.7720  0.5141
#> Individual    0.6387 0.6598  0.6598  0.6756  0.7820  0.5218
#> 
#> Support size:
#>                                         
#> Overall    3                            
#> Pretrain   39 (3 common + 36 individual)
#> Individual 36

Using cross validation is the same as in the Gaussian case:

##################################################
# Fit:
##################################################
fit = cv.ptLasso(x, y, groups, family = "binomial", type.measure = "auc")
#> Warning: from glmnet C++ code (error code -100); Convergence for 100th lambda
#> value not reached after maxit=100000 iterations; solutions for larger lambdas
#> returned
#> Warning: from glmnet C++ code (error code -100); Convergence for 100th lambda
#> value not reached after maxit=100000 iterations; solutions for larger lambdas
#> returned
#> Warning: from glmnet C++ code (error code -92); Convergence for 92th lambda
#> value not reached after maxit=100000 iterations; solutions for larger lambdas
#> returned
#> Warning: from glmnet C++ code (error code -90); Convergence for 90th lambda
#> value not reached after maxit=100000 iterations; solutions for larger lambdas
#> returned

##################################################
# Predict with a common alpha for all groups:
##################################################
predict(fit, xtest, groupstest, ytest = ytest)
#> 
#> Call:  
#> predict.cv.ptLasso(object = fit, xtest = xtest, groupstest = groupstest,  
#>     ytest = ytest) 
#> 
#> 
#> alpha =  0.7 
#> 
#> Performance (AUC):
#> 
#>            allGroups   mean wtdMean group_1 group_2 group_3
#> Overall       0.5990 0.5960  0.5960  0.6030  0.6644  0.5206
#> Pretrain      0.6401 0.6640  0.6640  0.6965  0.7732  0.5222
#> Individual    0.6559 0.6707  0.6707  0.6936  0.7808  0.5377
#> 
#> Support size:
#>                                         
#> Overall    7                            
#> Pretrain   40 (3 common + 37 individual)
#> Individual 37

##################################################
# Predict with a different alpha for each group:
##################################################
predict(fit, xtest, groupstest, ytest = ytest, alphatype = "varying")
#> 
#> Call:  
#> predict.cv.ptLasso(object = fit, xtest = xtest, groupstest = groupstest,  
#>     ytest = ytest, alphatype = "varying") 
#> 
#> 
#> alpha:
#> [1] 0.2 0.5 0.2
#> 
#> 
#> Performance (AUC):
#>            overall   mean wtdMean group_1 group_2 group_3
#> Overall     0.5990 0.5960  0.5960  0.6030  0.6644  0.5206
#> Pretrain    0.6359 0.6573  0.6573  0.6838  0.7736  0.5145
#> Individual  0.6559 0.6707  0.6707  0.6936  0.7808  0.5377
#> 
#> 
#> Support size:
#>                                         
#> Overall    7                            
#> Pretrain   40 (3 common + 37 individual)
#> Individual 37

Base case: input grouped survival data

require(survival)
#> Loading required package: survival

Now, we will simulate survival times with 3 groups; the three groups have overlapping support, with 5 shared features and each has 5 individual features. To compute survival time, we start by computing survival=Xβ+ϵ\text{survival} = X \beta + \epsilon, where β\beta is specific to each group and ϵ\epsilon is noise. Because survival times must be positive, we modify this to be survival=survival+1.1*abs(min(survival))\text{survival} = \text{survival} + 1.1 * \text{abs}(\text{min}(\text{survival})).

set.seed(1234)

n = 600; ntrain = 300
p = 50
     
x = matrix(rnorm(n*p), n, p)
beta1 = c(rnorm(5), rep(0, p-5))

beta2 = runif(p) * beta1 # Shared support
beta2 = beta2 + c(rep(0, 5), rnorm(5), rep(0, p-10)) # Individual features

beta3 = runif(p) * beta1 # Shared support
beta3 = beta3 + c(rep(0, 10), rnorm(5), rep(0, p-15)) # Individual features

# Randomly split into groups
groups = sample(1:3, n, replace = TRUE)

# Compute survival times:
survival = x %*% beta1
survival[groups == 2] = x[groups == 2, ] %*% beta2
survival[groups == 3] = x[groups == 3, ] %*% beta3
survival = survival + rnorm(n)
survival = survival + 1.1 * abs(min(survival))

# Censoring times from a random uniform distribution:
censoring = runif(n, min = 1, max = 10)

# Did we observe surivival or censoring?
y = Surv(pmin(survival, censoring), survival <= censoring)

# Split into train and test:
xtest = x[-(1:300), ]
ytest = y[-(1:300), ]
groupstest = groups[-(1:300)]

x = x[1:300, ]
y = y[1:300, ]
groups = groups[1:300]

Training with ptLasso is much the same as it was for the continuous and binomial cases; the only difference is that we specify family = "cox". By default, ptLasso uses the partial likelihood for model selection. We could instead use the C index.

############################################################
# Default -- use partial likelihood as the type.measure:
############################################################
fit = ptLasso(x, y, groups, alpha = 0.5, family = "cox")
predict(fit, xtest, groupstest, ytest = ytest)
#> 
#> Call:  
#> predict.ptLasso(object = fit, xtest = xtest, groupstest = groupstest,  
#>     ytest = ytest) 
#> 
#> 
#> alpha =  0.5 
#> 
#> Performance (Deviance):
#> 
#>            allGroups  mean wtdMean group_1 group_2 group_3
#> Overall        381.2 87.60   89.36   99.49  106.53   56.79
#> Pretrain       396.3 87.86   88.66   93.31   96.54   73.72
#> Individual     425.2 99.07   99.54  111.68  101.85   83.67
#> 
#> Support size:
#>                                         
#> Overall    10                           
#> Pretrain   20 (4 common + 16 individual)
#> Individual 24

############################################################
# Alternatively -- use the C index:
############################################################
fit = ptLasso(x, y, groups, alpha = 0.5, family = "cox", type.measure = "C")
#> Warning: from glmnet C++ code (error code -30075); Numerical error at 75th
#> lambda value; solutions for larger values of lambda returned
predict(fit, xtest, groupstest, ytest = ytest)
#> 
#> Call:  
#> predict.ptLasso(object = fit, xtest = xtest, groupstest = groupstest,  
#>     ytest = ytest) 
#> 
#> 
#> alpha =  0.5 
#> 
#> Performance (C-index):
#> 
#>            allGroups   mean wtdMean group_1 group_2 group_3
#> Overall       0.8545 0.8673  0.8608  0.9139  0.7746  0.9133
#> Pretrain      0.8359 0.8396  0.8393  0.9152  0.8173  0.7864
#> Individual    0.7925 0.7985  0.8008  0.9075  0.8007  0.6873
#> 
#> Support size:
#>                                         
#> Overall    6                            
#> Pretrain   35 (4 common + 31 individual)
#> Individual 37

The call to cv.ptLasso is again much the same; we only need to specify family (“cox”) and type.measure (if we want to use the C index instead of the partial likelihood).

##################################################
# Fit:
##################################################
fit = cv.ptLasso(x, y, groups, family = "cox", type.measure = "C")

##################################################
# Predict with a common alpha for all groups:
##################################################
predict(fit, xtest, groupstest, ytest = ytest)
#> 
#> Call:  
#> predict.cv.ptLasso(object = fit, xtest = xtest, groupstest = groupstest,  
#>     ytest = ytest) 
#> 
#> 
#> alpha =  0.2 
#> 
#> Performance (C-index):
#> 
#>            allGroups   mean wtdMean group_1 group_2 group_3
#> Overall       0.8527 0.8652  0.8586  0.9113  0.7711  0.9133
#> Pretrain      0.8501 0.8795  0.8742  0.9177  0.8043  0.9164
#> Individual    0.7865 0.8005  0.8033  0.9126  0.8078  0.6811
#> 
#> Support size:
#>                                        
#> Overall    8                           
#> Pretrain   13 (4 common + 9 individual)
#> Individual 31

##################################################
# Predict with a different alpha for each group:
##################################################
predict(fit, xtest, groupstest, ytest = ytest, alphatype = "varying")
#> 
#> Call:  
#> predict.cv.ptLasso(object = fit, xtest = xtest, groupstest = groupstest,  
#>     ytest = ytest, alphatype = "varying") 
#> 
#> 
#> alpha:
#> [1] 0.3 0.4 0.4
#> 
#> 
#> Performance (C-index):
#>            overall   mean wtdMean group_1 group_2 group_3
#> Overall     0.8527 0.8652  0.8586  0.9113  0.7711  0.9133
#> Pretrain    0.8081 0.8493  0.8475  0.9229  0.8078  0.8173
#> Individual  0.7865 0.8005  0.8033  0.9126  0.8078  0.6811
#> 
#> 
#> Support size:
#>                                         
#> Overall    8                            
#> Pretrain   28 (4 common + 24 individual)
#> Individual 31

Different groups in train and test data

Suppose we observe groups at test time that were unobserved at train time. For example, our training set may consist of KKpeople – each with many observations – and at test time, we wish to make predictions for observations from new people. We can still use pretraining in this setting: train a model using all data, and use this to guide the training for person-specific models.

Now however, we also fit an extra model to predict the similarity of test observations to the observations from each of the training people. To train this model, we use the (training) observation matrix XX and the response ysimy_{\text{sim}}, where ysim=ky_{\text{sim}} = k for all observations from the kthk^\text{th} person. When used for prediction, this model gives us a similarity (or probability) vector of length KK that sums to 1, describing how similar an observation is to each training person.

At test time, we make predictions from (1) each pretrained person-specific model and (2) the person-similarity model, and we compute the weighted average of the pretrained predictions with respect to the similarity vector. Here is an example using simulated data.

set.seed(1234)

# Start with 5 people, each with 300 observations and 200 features.
# 3 people will be used for training, and 2 for testing.
n = 300*5; p = 200;
groups = sort(rep(1:5, n/5))

# We will have different coefficients for each of the 3 training people, 
# and the first 3 features are shared support.
beta.group1 = c(-1, 1, 1, rep(0.5, 3), rep(0, p-6)); 
beta.group2 = c(-1, 1, 1, rep(0, 3), rep(0.5, 3), rep(0, p-9)); 
beta.group3 = c(-1, 1, 1, rep(0, 6), rep(0.5, 3), rep(0, p-12)); 

# The two test people are each a combination of of the training people.
# Person 4 will have observations drawn from classes 1 and 2, and
# Person 5 will have observations drawn from classes 1 and 3.
# The vector "hidden groups" is a latent variable - used to simulate data
# but unobserved in real data.
hidden.gps = groups
hidden.gps[hidden.gps == 4] = sample(c(1, 2), sum(groups == 4), replace = TRUE)
hidden.gps[hidden.gps == 5] = sample(c(1, 3), sum(groups == 5), replace = TRUE)

# We modify X according to group membership;
# we want X to cluster into groups 1, 2 and 3.
x = matrix(rnorm(n * p), nrow = n, ncol = p)
x[hidden.gps == 1, 1:3] = x[hidden.gps == 1, 1:3] + 1
x[hidden.gps == 2, 1:3] = x[hidden.gps == 2, 1:3] + 2
x[hidden.gps == 3, 1:3] = x[hidden.gps == 3, 1:3] + 3

# And now, we compute y using betas 1, 2 and 3: 
x.beta = rep(0, n)
x.beta[hidden.gps == 1] = x[hidden.gps == 1, ] %*% beta.group1 
x.beta[hidden.gps == 2] = x[hidden.gps == 2, ] %*% beta.group2 
x.beta[hidden.gps == 3] = x[hidden.gps == 3, ] %*% beta.group3
y = x.beta + 5 * rnorm(n)

We’re ready to split into train, validation and test sets. We will use people 1, 2 and 3 for training and validation (two-thirds train, one-third validation), and people 4 and 5 for testing.

trn.index = groups < 4
val.sample = sample(1:sum(trn.index), 1/3 * sum(trn.index), replace = FALSE)

xtrain = x[trn.index, ][-val.sample, ]
ytrain = y[trn.index][-val.sample]
gpstrain = groups[trn.index][-val.sample]

xval   = x[trn.index, ][val.sample, ] 
yval = y[trn.index][val.sample]
gpsval = groups[trn.index][val.sample]

xtest  = x[!trn.index, ]
ytest = y[!trn.index]
gpstest = groups[!trn.index]

We start with pretraining, where the person ID is the grouping variable.

cvfit = cv.ptLasso(xtrain, ytrain, gpstrain, 
                   type.measure = "mse", 
                   group.intercepts = FALSE,
                   overall.lambda = "lambda.1se")

Now, we train a model to predict the person ID from the covariates. Because this example is simulated, we can measure the performance of our model on test data (via the confusion matrix comparing predicted group labels to true labels). In real settings, this would be impossible.

simmod = cv.glmnet(xtrain, as.factor(gpstrain), family = "multinomial")

# Peek at performance on test data.
# Not possible with real data.
class.preds = predict(simmod, xtest, type="response")[, , 1]
table(apply(class.preds, 1, which.max), 
      hidden.gps[groups >= 4])
#>    
#>       1   2   3
#>   1 260  37   3
#>   2  39  82  29
#>   3   0  36 114

Finally we can make predictions: we have everything we need. For each test observation, we will get the pretrained prediction for all 3 training classes. Our final predictions are the weighted combination of the predictions from ptLasso and the class predictions from glmnet.

alphahat  = cvfit$alphahat
bestmodel = cvfit$fit[[which(cvfit$alphalist == alphahat)]]

offset = (1-alphahat) * predict(bestmodel$fitoverall, xtest, s = "lambda.1se")

# Get the prediction for all three classes for each test observation. 
# This will be a matrix with three columns; one for each class.
pretrained.preds = do.call(cbind, 
                        lapply(1:3, 
                               function(i) predict(bestmodel$fitpre[[i]], 
                                                   xtest,
                                                   newoffset = offset)
                      )
)

assess.glmnet( rowSums(pretrained.preds * class.preds), newy = ytest)$mse
#> [1] 28.17891
#> attr(,"measure")
#> [1] "Mean-Squared Error"

There are two reasonable baselines. The first is the overall model with no grouping at all, and the second is the set of individual models (one for each group).

########################################################
# Baseline 1: overall model
########################################################
overall.predictions = predict(cvfit$fitoverall, xtest)
assess.glmnet(overall.predictions, newy = ytest)$mse
#> lambda.1se 
#>   29.64747 
#> attr(,"measure")
#> [1] "Mean-Squared Error"

########################################################
# Baseline 2: individual models
########################################################
individual.preds = do.call(cbind, 
                           lapply(1:3, 
                                  function(i) predict(bestmodel$fitind[[i]], 
                                                      xtest,
                                                      type = "response")
                      )
)
assess.glmnet(rowSums(individual.preds * class.preds), newy = ytest)$mse
#> [1] 29.17333
#> attr(,"measure")
#> [1] "Mean-Squared Error"

What we have done – taking a weighted average of predictions with respect to similarity to each person – makes sense mathematically. However, we have found better empirical results if we instead train a supervised learning algorithm to make the final prediction ŷ\hat{y} using the pretrained model predictions and the class similarity predictions as features. So, let’s do that here, using our so-far-untouched validation set.

val.offset = predict(bestmodel$fitoverall, xval, s = "lambda.1se")
val.offset = (1 - alphahat) * val.offset
val.preds = do.call(cbind, 
                    lapply(1:3, function(i) predict(bestmodel$fitpre[[i]], 
                                                    xval,
                                                    newoffset = val.offset,
                                                    type = "response")
                      )
)
val.class.preds = predict(simmod, xval)[, , 1]

pred.data = cbind(val.preds, val.class.preds, val.preds * val.class.preds) 
final.model = cv.glmnet(pred.data, rowSums(val.preds * val.class.preds))

pred.data.test = cbind(pretrained.preds, 
                       class.preds, 
                       pretrained.preds * class.preds)
assess.glmnet(predict(final.model, pred.data.test), newy = ytest)$mse
#> lambda.1se 
#>   28.28504 
#> attr(,"measure")
#> [1] "Mean-Squared Error"

Comparing performance of all models side-by-side shows that (1) using input groups improved performance – including for the individual models and (2) including the final model did not help performance dramatically (but we still recommend trying this with real data).

rd = function(x) round(x, 2)

cat("Overall model PSE: ", 
    rd(assess.glmnet(overall.predictions, newy = ytest)$mse))
#> Overall model PSE:  29.65
cat("Individual model PSE: ", 
    rd(assess.glmnet(rowSums(individual.preds*class.preds), newy = ytest)$mse))
#> Individual model PSE:  29.17
cat("Pretraining model PSE: ", 
    rd(assess.glmnet(rowSums(pretrained.preds*class.preds), newy = ytest)$mse))
#> Pretraining model PSE:  28.18
cat("Pretraining model + final prediction model PSE: ", 
    rd(assess.glmnet(predict(final.model, 
                          cbind(pretrained.preds, 
                                class.preds, 
                                pretrained.preds * class.preds)
                      ), 
              newy = ytest)$mse))
#> Pretraining model + final prediction model PSE:  28.29

Learning the input groups

Suppose we have a dataset with features XX and response yy, and no input grouping. Suppose we also have a small set of meaningful features ZZ that we expect to stratify observations (e.g. in biomedicine, ZZ may consist of age and sex). In this setting, we can learn input groups using ZZ.

The steps to do this are as follows.

  1. Partition data into two sets: one to learn the grouping and one to do pretraining.
  2. With the first set, train a small CART tree using ZZ and yy.
  3. Make predictions for the remaining data; assign observations to groups according to their terminal nodes.
  4. Apply pretraining using the learned group assignments.

Here, we show an example using simulated data. We use rpart to train a CART tree. The package ODRF (Liu and Xia (2022)) is another good choice – it fits a linear model in each terminal node, which is closer to what pretraining does, and may therefore have better performance.

require(rpart)
#> Loading required package: rpart

Simulate data with a binary outcome: XX is drawn from a random normal (with p=50p = 50 uncorrelated features), and ZZ is simulated as age (uniform between 20 and 90) and sex (half 0, half 1). The true groups are (1) age under 50, (2) age over 50 and sex = 0 and (3) age over 50 and sex = 1.

set.seed(1234)

n = 1000; p = 50
groupvars = cbind(age = round(runif(n, min = 20, max = 90)), 
                  sex = sample(c(0, 1), n, replace = TRUE))
groups = rep(1, n)
groups[groupvars[, "age"] > 50 & groupvars[, "sex"] == 0] = 2
groups[groupvars[, "age"] > 50 & groupvars[, "sex"] == 1] = 3

Now, we’ll define coefficients βk\beta_k such that P(yi=1xi)=11+exp(xiTβk)P(y_i = 1 \mid x_i) = \frac{1}{1 + \exp(-x_i^T \beta_k)} for each group. Across groups, three coefficients are shared, three are group-specific and the rest are 0. Each group has a unique intercept to adjust its baseline risk.

beta.group1 = c(-0.5, 0.5, 0.1, c(0.1, 0.2, 0.3), rep(0, p-6)); 
beta.group2 = c(-0.5, 0.5, 0.1, rep(0, 3), c(0.1, 0.2, 0.3), rep(0, p-9)); 
beta.group3 = c(-0.5, 0.5, 0.1, rep(0, 6), c(0.1, 0.2, 0.3), rep(0, p-12)); 

x = matrix(rnorm(n * p), nrow = n, ncol = p)
x.beta = rep(0, n)
x.beta[groups == 1] = x[groups == 1, ] %*% beta.group1 - 0.75
x.beta[groups == 2] = x[groups == 2, ] %*% beta.group2 
x.beta[groups == 3] = x[groups == 3, ] %*% beta.group3 + 0.75

y = rbinom(n, size = 1, prob = 1/(1 + exp(-x.beta)))

# Now that we have our data, we will partition it into 3 datasets: 
# one to cluster, one to train models and one to test performance.
xcluster = x[1:250, ]; xtrain = x[251:750, ]; xtest = x[751:1000, ];
ycluster = y[1:250];   ytrain = y[251:750];   ytest = y[751:1000];

zcluster = groupvars[1:250, ]; 
ztrain = groupvars[251:750, ]; 
ztest = groupvars[751:1000, ];

# We will use this just to see how our clustering performed.
# Not possible with real data!
groupstrain = groups[251:750]; 

By design, P(y=1)P(y = 1) is different across groups:

ggplot() + 
  geom_boxplot(aes(x=groups, y=1/(1 + exp(-x.beta)), group = groups)) +
  labs(x = "Group", y = "P(y = 1)") +
  theme_minimal()

We cluster using rpart. Note that we use maxdepth = 2: an obvious choice because we simulated the data and we know that there is a second-level interaction (age + sex) that determines outcome. In general, however, we recommend keeping this tree small (maxdepth smaller than 4) so that it is easily interpretable.

treefit = rpart(ycluster~., 
                data = data.frame(zcluster, ycluster), 
                control=rpart.control(maxdepth=2, minbucket=20))
treefit
#> n= 250 
#> 
#> node), split, n, deviance, yval
#>       * denotes terminal node
#> 
#> 1) root 250 61.82400 0.4480000  
#>   2) age< 50.5 111 23.18919 0.2972973 *
#>   3) age>=50.5 139 34.10072 0.5683453  
#>     6) sex< 0.5 56 13.92857 0.4642857 *
#>     7) sex>=0.5 83 19.15663 0.6385542 *

We want our tree to return the ID of the terminal node for each observation instead of class probabilities. The following is a trick that causes predict to behave as desired.

leaf=treefit$frame[,1]=="<leaf>"   
treefit$frame[leaf,"yval"]=1:sum(leaf)

predgroupstrain = predict(treefit, data.frame(ztrain))
predgroupstest  = predict(treefit, data.frame(ztest))

Finally, we are ready to apply pretraining using the predicted groups as our grouping variable.

cvfit = cv.ptLasso(xtrain, ytrain, predgroupstrain, family = "binomial", 
                   type.measure = "auc", nfolds = 10, 
                   overall.lambda = "lambda.min")
predict(cvfit, xtest, predgroupstest, ytest = ytest)
#> 
#> Call:  
#> predict.cv.ptLasso(object = cvfit, xtest = xtest, groupstest = predgroupstest,  
#>     ytest = ytest) 
#> 
#> 
#> alpha =  0 
#> 
#> Performance (AUC):
#> 
#>            allGroups   mean wtdMean group_1 group_2 group_3
#> Overall       0.7081 0.6448  0.6399  0.6085  0.6575  0.6684
#> Pretrain      0.7109 0.6590  0.6526  0.6147  0.6823  0.6800
#> Individual    0.7058 0.6525  0.6477  0.6085  0.6428  0.7063
#> 
#> Support size:
#>                                       
#> Overall    8                          
#> Pretrain   8 (8 common + 0 individual)
#> Individual 19

Note that the overall model trained by cv.ptLasso takes advantage of the clustering: it fits a unique intercept for each group. Performance would have been much worse if we hadn’t done any clustering at all:

baseline.model = cv.glmnet(xtrain, ytrain, family = "binomial", type.measure = "auc", nfolds = 5)
assess.glmnet(baseline.model, newx=xtest, newy=ytest)$auc
#> [1] 0.6050242
#> attr(,"measure")
#> [1] "AUC"

Target grouped data

Intuition

Now we turn to the target grouped setting, where we have a dataset with a multinomial outcome and no other grouping on the observations. For example, our data might look like the following:

set.seed(1234)

n = 500; p = 75; k = 3
X = matrix(rnorm(n * p), nrow = n, ncol = p)
y = sample(1:k, n, replace = TRUE)

Xtest = matrix(rnorm(n * p), nrow = n, ncol = p)

Each row in XX belongs to class 1, 2 or 3, and we wish to predict class membership. We could fit a single multinomial model to the data:

multinomial = cv.glmnet(X, y, family = "multinomial")

multipreds  = predict(multinomial, Xtest, s = "lambda.min")
multipreds.class = apply(multipreds, 1, which.max)

Or, we could fit 3 one-vs-rest models; at prediction time, we would assign observations to the class with the highest probability.

class1 = cv.glmnet(X, y == 1, family = "binomial")
class2 = cv.glmnet(X, y == 2, family = "binomial")
class3 = cv.glmnet(X, y == 3, family = "binomial")

ovrpreds = cbind(
  predict(class1, Xtest, s = "lambda.min"),
  predict(class2, Xtest, s = "lambda.min"),
  predict(class3, Xtest, s = "lambda.min"))
ovrpreds.class = apply(ovrpreds, 1, which.max)

Another alternative is to do pretraining, which fits something in between one model for all data and three separate models. ptLasso will do this for you, using the arguments family = "multinomial" and use.case = "targetGroups".

fit = ptLasso(X, y, groups = y, alpha = 0.5,
              family = "multinomial", 
              use.case = "targetGroups")

But what exactly is pretraining doing here? We’ll walk through an example, doing pretraining “by hand”. The steps are:

  1. Train an overall model: a multinomial model using a penalty on the coefficients β\beta so that each coefficient is either 0 or nonzero for all classes.
  2. Train individual one-vs-rest models using the penalty factor and offset defined by the overall model (as in the input grouped setting).

To train the overall model, we use cv.glmnet with type.multinomial = "grouped". This puts a penalty on β\beta to force coefficients to be in or out of the model for all classes. This is analogous to the overall model in the input grouped setting: we want to first learn shared information.

multinomial = cv.glmnet(X, y, family = "multinomial", 
                        type.multinomial = "grouped",
                        keep = TRUE)

Then, we fit 3 one-vs-rest models using the support and offset from the multinomial model.

# The support of the overall model:
nonzero.coefs = which((coef(multinomial, s = "lambda.1se")[[1]] != 0)[-1])

# The offsets - one for each class:
offset = predict(multinomial, X, s = "lambda.1se")
offset.class1 = offset[, 1, 1]
offset.class2 = offset[, 2, 1]
offset.class3 = offset[, 3, 1]

Now we have everything we need to train the one-vs-rest models. As always, we have the pretraining parameter α\alpha - for this example, let’s use α=0.5\alpha = 0.5:

alpha = 0.5
penalty.factor = rep(1/alpha, p)
penalty.factor[nonzero.coefs] = 1

class1 = cv.glmnet(X, y == 1, family = "binomial", 
                   offset = (1-alpha) * offset.class1,
                   penalty.factor = penalty.factor)
class2 = cv.glmnet(X, y == 2, family = "binomial", 
                   offset = (1-alpha) * offset.class2,
                   penalty.factor = penalty.factor)
class3 = cv.glmnet(X, y == 3, family = "binomial", 
                   offset = (1-alpha) * offset.class3,
                   penalty.factor = penalty.factor)

And we’re done with pretraining! To predict, we again assign each row to the class with the highest prediction:

newoffset = predict(multinomial, X, s = "lambda.1se")
ovrpreds = cbind(
  predict(class1, Xtest, s = "lambda.min", newoffset = newoffset[, 1, 1]),
  predict(class2, Xtest, s = "lambda.min", newoffset = newoffset[, 2, 1]),
  predict(class3, Xtest, s = "lambda.min", newoffset = newoffset[, 3, 1])
)
ovrpreds.class = apply(ovrpreds, 1, which.max)

This is all done automatically within ptLasso; we will now show an example using the ptLasso functions. The example above is intended only to show how pretraining works for multinomial outcomes, and some technical details have been omitted. (For example, ptLasso takes care of crossfitting between the first and second steps.)

Example

First, let’s simulate multinomial data with 5 classes. We start by drawing XX from a normal distribution (uncorrelated features), and then we shift the columns differently for each group.

set.seed(1234)

n = 500; p = 50; k = 5
class.sizes = rep(n/k, k)
ncommon = 10; nindiv = 5;
shift.common = seq(-.2, .2, length.out = k)
shift.indiv  = seq(-.1, .1, length.out = k)

x     = matrix(rnorm(n * p), n, p)
xtest = matrix(rnorm(n * p), n, p)
y = ytest = c(sapply(1:length(class.sizes), function(i) rep(i, class.sizes[i])))

start = ncommon + 1
for (i in 1:k) {
  end = start + nindiv - 1
  x[y == i, 1:ncommon] = x[y == i, 1:ncommon] + shift.common[i]
  x[y == i, start:end] = x[y == i, start:end] + shift.indiv[i]
  
  xtest[ytest == i, 1:ncommon] = xtest[ytest == i, 1:ncommon] + shift.common[i]
  xtest[ytest == i, start:end] = xtest[ytest == i, start:end] + shift.indiv[i]
  start = end + 1
}

The calls to ptLasso and cv.ptLasso are almost the same as in the input grouped setting, only now we specify use.case = "targetGroups". The call to predict does not require a groups argument because the groups are unknown at prediction time.

################################################################################
# Fit the pretrained model.
# By default, ptLasso uses type.measure = "deviance", but for ease of
# interpretability, we use type.measure = "class" (the misclassification rate).
################################################################################
fit = ptLasso(x = x, y = y, 
              use.case = "targetGroups", type.measure = "class")

################################################################################
# Predict
################################################################################
predict(fit, xtest, ytest = ytest)
#> 
#> Call:  
#> predict.ptLasso(object = fit, xtest = xtest, ytest = ytest) 
#> 
#> 
#> 
#> alpha =  0.5 
#> 
#> Performance (Misclassification error):
#> 
#>            overall   mean group_1 group_2 group_3 group_4 group_5
#> Overall      0.738                                               
#> Pretrain     0.728 0.2000   0.200     0.2     0.2     0.2   0.200
#> Individual   0.736 0.1984   0.196     0.2     0.2     0.2   0.196
#> 
#> Support size:
#>                                         
#> Overall    29                           
#> Pretrain   23 (23 common + 0 individual)
#> Individual 32

################################################################################
# Fit with CV to choose the alpha parameter
################################################################################
cvfit = cv.ptLasso(x = x, y = y,
              use.case = "targetGroups", type.measure = "class")

################################################################################
# Predict using one alpha for all classes
################################################################################
predict(cvfit, xtest, ytest = ytest)
#> 
#> Call:  
#> predict.cv.ptLasso(object = cvfit, xtest = xtest, ytest = ytest) 
#> 
#> 
#> 
#> alpha =  0.9 
#> 
#> Performance (Misclassification error):
#> 
#>            overall   mean group_1 group_2 group_3 group_4 group_5
#> Overall      0.738                                               
#> Pretrain     0.722 0.1992     0.2     0.2     0.2     0.2   0.196
#> Individual   0.742 0.2000     0.2     0.2     0.2     0.2   0.200
#> 
#> Support size:
#>                                         
#> Overall    39                           
#> Pretrain   32 (23 common + 9 individual)
#> Individual 36

################################################################################
# Predict using a separate alpha for each class
################################################################################
predict(cvfit, xtest, ytest = ytest, alphatype = "varying")
#> 
#> Call:  
#> predict.cv.ptLasso(object = cvfit, xtest = xtest, ytest = ytest,  
#>     alphatype = "varying") 
#> 
#> 
#> alpha =  0.1 0 0.7 0 0.1 
#> 
#> Performance (Misclassification error):
#> 
#>            overall   mean group_1 group_2 group_3 group_4 group_5
#> Overall      0.738                                               
#> Pretrain     0.742 0.2016   0.208     0.2     0.2   0.202   0.198
#> Individual   0.742 0.2000   0.200     0.2     0.2   0.200   0.200
#> 
#> Support size:
#>                                          
#> Overall    39                            
#> Pretrain   36 (23 common + 13 individual)
#> Individual 36

Multi-response data with Gaussian responses

Multitask learning consists of data XX with two or more responses y1,,yjy_1, \dots, y_j. We usually assume that there is shared signal across the responses, and that performance can be improved by jointly fitting models for the responses.

Here, we suppose that we wish to predict multiple Gaussian responses. (If the goal is to predict multiple responses of a different type, see the section “Multi-response data with mixed response types”.)

Pretraining is a natural choice for multitask learning – it allows us to pass information between models for the different responses. The overview for our approach is to:

  1. fit a multi-response Gaussian model using a group lasso penalty (more below),
  2. extract the support (shared across responses) and offsets (one for each response), and
  3. fit a model for each response, using the shared support and appropriate offset.

Importantly, the group lasso penalty behaves like the lasso, but on the whole group of coefficients for each response: they are either all zero, or else none are zero (see the glmnet documentation about family = "mgaussian" for more detail). As a result, the multi-response Gaussian model is forced to choose the same support for all responses y1,,yjy_1, \dots, y_j. This encourages learning across all responses in the first stage; in the second stage, we find features that are specific to each individual response yky_k.

This is all done with the function ptLasso, using the argument use.case = "multiresponse".

We will illustrate this with simulated data with two Gaussian responses; the two responses share the first 5 features, and they each have 5 features of their own. The two responses are quite related, with Pearson correlation around 0.5.

set.seed(1234)

# Define constants
n = 1000         # Total number of samples
ntrain = 650     # Number of training samples
p = 500          # Number of features
sigma = 2        # Standard deviation of noise
     

# Generate covariate matrix
x = matrix(rnorm(n * p), n, p)

# Define coefficients for responses 1 and 2
beta1 = c(rep(1, 5), rep(0.5, 5), rep(0, p - 10))
beta2 = c(rep(1, 5), rep(0, 5), rep(0.5, 5), rep(0, p - 15))

mu = cbind(x %*% beta1, x %*% beta2)
y  = cbind(mu[, 1] + sigma * rnorm(n), 
           mu[, 2] + sigma * rnorm(n))

cat("SNR for the two tasks:", round(diag(var(mu)/var(y-mu)), 2))
#> SNR for the two tasks: 1.6 1.44
cat("Correlation between two tasks:", cor(y[, 1], y[, 2]))
#> Correlation between two tasks: 0.5164748

# Split into train and test
xtest = x[-(1:ntrain), ]
ytest = y[-(1:ntrain), ]

x = x[1:ntrain, ]
y = y[1:ntrain, ]

Now, we are ready to call ptLasso with our covariates x and response matrix y, and we specify the argument use.case = "multiresponse". A call to plot shows the CV curves over the lasso parameter λ\lambda for each model.

fit = ptLasso(x, y, use.case = "multiresponse")
plot(fit)

To choose the pretraining parameter α\alpha, we can use cv.ptLasso. Using plot, we can view the CV curve for pretraining together with the overall model (multi-response Gaussian model) and the individual model (a separate Gaussian model for each response).

fit = cv.ptLasso(x, y, use.case = "multiresponse")
plot(fit)

As in previous examples, we can predict using the predict; if ytest is supplied, this will print the mean squared error as well as the support size for the pretrained, overall and individual models using the single α\alpha that minimizes the the average CV MSE across both responses.

preds = predict(fit, xtest, ytest = ytest)
preds
#> 
#> Call:  
#> predict.cv.ptLasso(object = fit, xtest = xtest, ytest = ytest) 
#> 
#> 
#> 
#> alpha =  0.3 
#> 
#> Performance (Mean squared error):
#> 
#>            allGroups  mean response_1 response_2
#> Overall        9.217 4.608      4.092      5.125
#> Pretrain       9.006 4.503      4.149      4.857
#> Individual     9.324 4.662      4.168      5.157
#> 
#> Support size:
#>                                         
#> Overall    57                           
#> Pretrain   22 (20 common + 2 individual)
#> Individual 75

Also as before, we can choose to use the value of α\alpha that minimizes the CV MSE for each response.

preds = predict(fit, xtest, ytest = ytest, alphatype = "varying")
preds
#> 
#> Call:  
#> predict.cv.ptLasso(object = fit, xtest = xtest, ytest = ytest,  
#>     alphatype = "varying") 
#> 
#> 
#> alpha:
#> [1] 0.3 0.3
#> 
#> 
#> Performance (Mean squared error):
#>            allGroups  mean response_1 response_2
#> Overall        9.217 4.608      4.092      5.125
#> Pretrain       9.006 4.503      4.149      4.857
#> Individual     9.324 4.662      4.168      5.157
#> 
#> 
#> Support size:
#>                                         
#> Overall    57                           
#> Pretrain   22 (20 common + 2 individual)
#> Individual 75

Time series data

We may have repeated measurements of XX and yy across time; for example, we may observe patients at two different points in time. We expect that the relationship between XX and yy will be different at time 1 and time 2, but not completely unrelated. Therefore, pretraining can be useful: we can use the model fitted at time 1 to inform the model for time 2.

ptLasso supports this setting, and below is an example. We first assume that XX is constant across time, and yy changes. Later, we will show an example where XX changes across time.

To do pretraining with time series data, we:

  1. fit a model for time 1 and extract its offset and support,
  2. use the offset and support (the usual pretraining) to train a model for time 2.

We could continue this for kk time points: after fitting a model for time 2, we would extract the offset and support. Now, the offset will include the offset from time 1 and the prediction from time 2; the support will be the union of supports from the first two models.

Example 1: covariates are constant over time

We’ll start by simulating data – more details in the comments.

set.seed(1234)

# Define constants
n = 600          # Total number of samples
ntrain = 300     # Number of training samples
p = 100          # Number of features
sigma = 3        # Standard deviation of noise

# Generate covariate matrix
x = matrix(rnorm(n * p), n, p)

# Define coefficients for time points 1 and 2
beta1 = c(rep(2, 10), rep(0, p - 10))  # Coefs at time 1
beta2 = runif(p, 0.5, 2) * beta1       # Coefs at time 2, shared support with time 1

# Generate response variables for times 1 and 2
y = cbind(
  x %*% beta1 + sigma * rnorm(n),
  x %*% beta2 + sigma * rnorm(n)
)

# Split data into training and testing sets
xtest = x[-(1:ntrain), ]  # Test covariates
ytest = y[-(1:ntrain), ]  # Test response

x = x[1:ntrain, ]  # Train covariates
y = y[1:ntrain, ]  # Train response

Having simulated data, we are ready to call ptLasso; the call to ptLasso looks much the same as in all our other examples, only now (1) yy is a matrix with one column for each time point and (2) we specify use.case = "timeSeries". After fitting, a call to plot shows the models fitted for both of the time points with and without using pretraining.

fit = ptLasso(x, y, use.case = "timeSeries", alpha = 0)
plot(fit)

And as before, we can predict with xtest. In this example, pretraining helps performance: the two time points share the same support, and pretraining discovers and leverages this.

preds = predict(fit, xtest, ytest = ytest)
preds
#> 
#> Call:  
#> predict.ptLasso(object = fit, xtest = xtest, ytest = ytest) 
#> 
#> 
#> 
#> alpha =  0 
#> 
#> Performance (Mean squared error):
#> 
#>              mean response_1 response_2
#> Pretrain    9.604      10.78      8.428
#> Individual 10.428      10.78     10.076
#> 
#> Support size:
#>                                          
#> Pretrain   26 (10 common + 16 individual)
#> Individual 39

We specified alpha = 0 in this example, but cross validation would advise us to choose α=0.2\alpha = 0.2. Plotting shows us the average performance across the two time points. Importantly, at time 1, the individual model and the pretrained model are the same; we do not see the advantage of pretraining until time 2 (when we use information from time 1).

cvfit = cv.ptLasso(x, y, use.case = "timeSeries")
plot(cvfit)

predict(cvfit, xtest, ytest = ytest)
#> 
#> Call:  
#> predict.cv.ptLasso(object = cvfit, xtest = xtest, ytest = ytest) 
#> 
#> 
#> 
#> alpha =  0.2 
#> 
#> Performance (Mean squared error):
#> 
#>              mean response_1 response_2
#> Pretrain    9.875      10.87      8.884
#> Individual 10.447      10.87     10.027
#> 
#> Support size:
#>                                          
#> Pretrain   28 (10 common + 18 individual)
#> Individual 40

Note that we could also have treated this as a multireponse problem, and ignored the time-ordering of the responses. See more in the section called “Multi-response data with Gaussian responses”. (However, time ordering can be informative, and the multi-response approach does not make use of this.)

fit = ptLasso(x, y, use.case = "multiresponse")

Example 2: covariates change over time

Now, we’ll repeat what we did above, but we’ll simulate data where xx changes with time. In this setting, ptLasso expects xx to be a list with one covariate matrix for each time.

set.seed(1234)  # Set seed for reproducibility

# Define constants
n = 600          # Total number of samples
ntrain = 300     # Number of training samples
p = 100          # Number of features
sigma = 3        # Standard deviation of noise

# Covariates for times 1 and 2
x1 = matrix(rnorm(n * p), n, p)
x2 = x1 + matrix(0.2 * rnorm(n * p), n, p)  # Perturbed covariates for time 2
x = list(x1, x2)

# Define coefficients for time points 1 and 2
beta1 = c(rep(2, 10), rep(0, p - 10))  # Coefs at time 1
beta2 = runif(p, 0.5, 2) * beta1       # Coefs at time 2, shared support with time 1

# Response variables for times 1 and 2:
y = cbind(
  x[[1]] %*% beta1 + sigma * rnorm(n),
  x[[2]] %*% beta2 + sigma * rnorm(n)
)

# Split data into training and testing sets
xtest = lapply(x, function(xx) xx[-(1:ntrain), ])  # Test covariates
ytest = y[-(1:ntrain), ]  # Test response

x = lapply(x, function(xx) xx[1:ntrain, ])  # Train covariates
y = y[1:ntrain, ]  # Train response

Now, xx is a list of length two:

str(x)
#> List of 2
#>  $ : num [1:300, 1:100] -1.207 0.277 1.084 -2.346 0.429 ...
#>  $ : num [1:300, 1:100] -1.493 0.303 1.172 -2.316 0.224 ...

We can call ptLasso, cv.ptLasso, plot and predict just as before:

fit = ptLasso(x, y, use.case = "timeSeries", alpha = 0)
plot(fit)  # Plot the fitted model

predict(fit, xtest, ytest = ytest)  # Predict using the fitted model
#> 
#> Call:  
#> predict.ptLasso(object = fit, xtest = xtest, ytest = ytest) 
#> 
#> 
#> 
#> alpha =  0 
#> 
#> Performance (Mean squared error):
#> 
#>             mean response_1 response_2
#> Pretrain   11.92       12.1      11.75
#> Individual 11.46       12.1      10.82
#> 
#> Support size:
#>                                          
#> Pretrain   36 (16 common + 20 individual)
#> Individual 61

# With cross validation:
cvfit = cv.ptLasso(x, y, use.case = "timeSeries")
plot(cvfit, plot.alphahat = TRUE)  # Plot cross-validated model

predict(cvfit, xtest, ytest = ytest)  # Predict using cross-validated model
#> 
#> Call:  
#> predict.cv.ptLasso(object = cvfit, xtest = xtest, ytest = ytest) 
#> 
#> 
#> 
#> alpha =  0.4 
#> 
#> Performance (Mean squared error):
#> 
#>             mean response_1 response_2
#> Pretrain   11.55      12.11      11.00
#> Individual 11.53      12.11      10.96
#> 
#> Support size:
#>                                          
#> Pretrain   54 (19 common + 35 individual)
#> Individual 65

More examples of pretraining using glmnet

Multi-response data with mixed response types

Muti-response data consists of datasets with covariates XX and multiple outcomes y1,y2,y3,y_1, y_2, y_3, \dots. If these outcomes are all continuous, then it may be natural to treat this as a multitask learning problem (see the section “Multi-response data with Gaussian responses”). If the outcomes have mixed types however – e.g. y1y_1 is continuous, y2y_2 binary and y3y_3 survival – then the problem is slightly more challenging, because there are fewer methods developed for this setting.

Pretraining is a natural fit for this task: we often believe that there is shared information between y1y_1, y2y_2 and y3y_3. If we fit 3 separate models, we never get to take advantage of any shared information; further, because the outcomes have different types, there are very few methods to fit one model for all outcomes (an “overall model”).

So, we will use pretraining to pass information between models. We will:

  1. fit a model for y1y_1,
  2. extract the offset and support from this model,
  3. use the offset and support (the usual pretraining) to train models for y2y_2 and y3y_3.

There is one small detail here: we must choose the primary outcome y1y_1. This is an important choice because it will form the support and offset for the other two outcomes. We recommend making this selection using domain knowledge, but cross-validation (or a validation set) can of course be used.

Here, we walk through an example with simulated data with three outcomes y1,y2y_1, y_2 and y3y_3. The three outcomes have overlapping support; the first 10 features are predictive . Outcomes 2 and 3 additionally have 5 features unique to each of them. We’ll define y1y_1 to be continuous, y2y_2 to be binomial and y3y_3 to be survival.

set.seed(1234)

# Define constants
n = 600          # Total number of samples
ntrain = 300     # Number of training samples
p = 50           # Number of features

# Define covariates     
x = matrix(rnorm(n*p), n, p)

# y1: continuous response
beta1 = c(rep(.5, 10), rep(0, p-10))
y1 = x %*% beta1 + rnorm(n)

# y2: binomial response
beta2 = runif(p, min = 0.5, max = 1) * beta1  # Shared with group 1
beta2 = beta2 + c(rep(0, 10), 
                  runif(5, min = 0, max = 0.5), 
                  rep(0, p-15)) # Individual
y2 = rbinom(n, 1, prob = 1/(1 + exp(-x %*% beta2)))

# y3: survival response
beta3 = beta1  # Shared with group 1
beta3 = beta3 + c(rep(0, 10), 
                  runif(5, min = -0.1, max = 0.1),
                  rep(0, p-15)) # Individual
y3.true = - log(runif(n)) / exp(x %*% beta3)
y3.cens = runif(n)
y3 = Surv(pmin(y3.true, y3.cens), y3.true <= y3.cens)

# Split into train and test
xtest = x[-(1:ntrain), ]
y1test = y1[-(1:ntrain)]
y2test = y2[-(1:ntrain)]
y3test = y3[-(1:ntrain), ]

x = x[1:ntrain, ]
y1 = y1[1:ntrain]
y2 = y2[1:ntrain]
y3 = y3[1:ntrain, ]

# Define training folds
nfolds = 10
foldid = sample(rep(1:10, trunc(nrow(x)/nfolds)+1))[1:nrow(x)]

For the first step of pretraining, train a model for the primary outcome (y1y_1) and record the offset and support – these will be used when training the models for y2y_2 and y3y_3.

y1_fit = cv.glmnet(x, y1, keep=TRUE, foldid = foldid)

train_offset = y1_fit$fit.preval[, y1_fit$lambda == y1_fit$lambda.1se]
support = which(coef(y1_fit, s = y1_fit$lambda.1se)[-1] != 0)

Now we have everything we need to train the models for y2y_2 and y3y_3. In the following code, we loop over α=0,0.1,,1\alpha = 0, 0.1, \dots, 1; in each step, we (1) train models for y2y_2 and y3y_3 and (2) record the CV error from both models. The CV error will be used to determine values of α\alpha to use for the final models.

cv.error.y2 = cv.error.y3 = NULL
alphalist = seq(0, 1, length.out = 11)

for(alpha in alphalist){
  pf = rep(1/alpha, p)
  pf[support] = 1
  
  offset = (1 - alpha) * train_offset
    
  y2_fit = cv.glmnet(x, y2, 
                     foldid = foldid,
                     offset = offset,
                     penalty.factor = pf,
                     family = "binomial",
                     type.measure = "auc")
  cv.error.y2 = c(cv.error.y2, max(y2_fit$cvm))
  
  y3_fit = cv.glmnet(x, y3, 
                     foldid = foldid,
                     offset = offset,
                     penalty.factor = pf,
                     family = "cox",
                     type.measure = "C")
  cv.error.y3 = c(cv.error.y3, max(y3_fit$cvm))
}

Plotting our CV performance suggests the value of α\alpha we should choose for each outcome:

par(mfrow = c(2, 1))
plot(alphalist, cv.error.y2, type = "b",
     main = bquote("Outcome 2: CV AUC vs " ~ alpha),
     xlab = expression(alpha),
     ylab = "CV AUC")
abline(v = alphalist[which.max(cv.error.y2)])

plot(alphalist, cv.error.y3, type = "b",
     main = bquote("Outcome 3: CV C index vs " ~ alpha),
     xlab = expression(alpha),
     ylab = "CV C index")
abline(v = alphalist[which.max(cv.error.y3)])

Now that we have selected our values of α\alpha, we can fit the final models for y2y_2 and y3y_3:

############################################################
# Model for y2:
############################################################
best.alpha.y2 = alphalist[which.max(cv.error.y2)]
pf = rep(1/best.alpha.y2, p); pf[support] = 1

y2_fit = cv.glmnet(x, y2, 
                   foldid = foldid,
                   offset = (1-best.alpha.y2) * train_offset,
                   penalty.factor = pf,
                   family = "binomial",
                   type.measure = "auc")

############################################################
# Repeat for y3:
############################################################
best.alpha.y3 = alphalist[which.max(cv.error.y3)]
pf = rep(1/best.alpha.y3, p); pf[support] = 1

y3_fit = cv.glmnet(x, y3, 
                   foldid = foldid,
                   offset = (1-best.alpha.y3) * train_offset,
                   penalty.factor = pf,
                   family = "cox", 
                   type.measure = "C")

We will also train models for y2y_2 and y3y_3without pretraining; this is a natural benchmark.

y2_fit_no_pretrain = cv.glmnet(x, y2, foldid = foldid, 
                               family = "binomial", type.measure = "auc")

y3_fit_no_pretrain = cv.glmnet(x, y3, 
                               foldid = foldid, 
                               family = "cox", type.measure = "C")

All of our models have been trained. Let’s compare performance with and without pretraining; we’ll start with the model for y2y_2.

testoffset = predict(y1_fit, xtest, s = "lambda.1se")

cat("Model 2 AUC with pretraining:", 
    round(assess.glmnet(y2_fit, xtest, newy = y2test, 
                        newoffset = (1 - best.alpha.y2) * testoffset)$auc, 2),
    fill=TRUE)
#> Model 2 AUC with pretraining: 0.76

cat("Model 2 AUC without pretraining:", 
    round(assess.glmnet(y2_fit_no_pretrain, xtest, newy = y2test)$auc, 2)
    )
#> Model 2 AUC without pretraining: 0.66

And now, the models for y3y_3:

cat("Model 3 C-index with pretraining:", 
    round(assess.glmnet(y3_fit, xtest, newy = y3test, 
                        newoffset = (1 - best.alpha.y3) * testoffset)$C, 2))
#> Model 3 C-index with pretraining: 0.8

cat("Model 3 C-index without pretraining:", 
    round(assess.glmnet(y3_fit_no_pretrain, xtest, newy = y3test)$C, 2)
    )
#> Model 3 C-index without pretraining: 0.78

For both y2y_2 and y3y_3, we saw a performance improvement using pretraining. We didn’t technically need to train the individual (non-pretrained) models for y2y_2 and y3y_3: during our CV loop to choose α\alpha, we saw the cross validation performance for the individual models (the special case when α=1\alpha = 1), and CV recommended a smaller value of α\alpha for both outcomes.

Note that, in this example, we trained a model using y1y_1, and then used this model to form the offset and support for the models for y2y_2 and y3y_3 in parallel. But using pretraining for multi-response data is flexible. Pretraining is simply a method to pass information from one model to another, and we are free to choose how information flows. For example, we chose to pass information from model 1 (y1y_1) to model 2 (y2y_2) and to model 3 (y3y_3). But, we could have instead chained our models to pass information from model 1 to model 2, and then from model 2 to model 3 in the following way:

  1. fit a model for y1y_1,
  2. extract the offset and support from this model,
  3. use the offset and support (the usual pretraining) to train a model for y2y_2,
  4. extract the offset and support from this second model, and
  5. use them to train a model for y3y_3.

In this framework, the model for y3y_3 depends implicitly on both the models for y1y_1and y2y_2, as the offset and support for the model for y2y_2 were informed by the model for y1y_1. Choosing how information should be passed between outcomes is context specific and we recommend relying on domain knowledge for selecting an approach (though many options may be tried and compared with cross-validation or a validation set).

Conditional average treatment effect estimation

Background: CATE estimation and pretraining

In causal inference, we are often interested in predicting the treatment effect for individual observations; this is called the conditional average treatment effect (CATE). For example, before prescribing a drug to a patient, we want to know whether the drug is likely to work well for that patient - not just whether it works well on average. One tool to model the CATE is the R-learner (Nie and Wager (2021)), which minimizes the R loss:

$$ \hat{L}_n\{\tau(\cdot)\}=\arg \min_\tau \frac{1}{n}\sum\Bigl[ (y_i- m^*(x_i)) - (W_i-e^*(x_i))\tau(x_i) \Bigr]^2. $$ Here, xix_i and yiy_i are the covariates and outcome for observation ii, e*(xi)e^*(x_i) is the treatment propensity and WiW_i the treatment assignment, and m*(xi)m^*(x_i) is the conditional mean outcome (E[yix=xi]E[y_i \mid x = x_i]). Then, τ̂\hat\tau is the estimate of the heterogeneous treatment effect function.

This is fitted in stages: first, the R-learner fits m*m^* and e*e^* to get m̂*\hat{m}^* and ê*\hat{e}^*; then plugs in m̂*(xi)\hat{m}^*(x_i) and ê*(xi)\hat{e}^*(x_i) to fit τ\tau. A minor detail is that cross-fitting (or prevalidation) is used in the first stage so that the plugin value for e.g. m̂*(xi)\hat{m}^*(x_i) comes from a model trained without using xix_i.

When τ\tau is a linear function, then the second stage of fitting is straightforward. The values m̂*(xi)\hat{m}^*(x_i) and ê*(xi)\hat{e}^*(x_i) are known, and we can use linear regression to model yim̂*(xi)y_i - \hat{m}^*(x_i) as a function of the weighted feature vector (Wiê*(xi))xi(W_i-\hat{e}^*(x_i)) x_i. This is what we will do in the following example.

How can pretraining be useful here? Well, we are separately fitting models for m*m^* (the conditional mean) and τ\tau (the heterogeneous treatment effect), and these two functions are likely to share support: it is sensible to assume that the features that modulate the mean treatment effect also modulate the heterogeneous treatment effect. We can use pretraining by (1) training a model for m*m^* and (2) using the support from this model to guide the fitting of τ\tau. Note that the offset is not used in this case; m*m^* and τ\tau are designed to predict different outcomes.

A simulated example

Here is an example. We will simplify the problem by assuming treatment has been randomized – the true e*(xi)=0.5e^*(x_i) = 0.5 for all ii.

set.seed(1234)

n = 600; ntrain = 300
p = 20
     
x = matrix(rnorm(n*p), n, p)

# Treatment assignment
w = rbinom(n, 1, 0.5)

# m^*
m.coefs = c(rep(2,10), rep(0, p-10))
m = x %*% m.coefs

# tau
tau.coefs = runif(p, 0.5, 1)*m.coefs 
tau = 1.5*m + x%*%tau.coefs

mu = m + w * tau
y  = mu + 10 * rnorm(n)
cat("Signal to noise ratio:", var(mu)/var(y-mu))
#> Signal to noise ratio: 2.301315

# Split into train/test
xtest = x[-(1:ntrain), ]
tautest = tau[-(1:ntrain)] 
wtest = w[-(1:ntrain)]

x = x[1:ntrain, ]
y = y[1:ntrain] 
w = w[1:ntrain]

# Define training folds
nfolds = 10
foldid = sample(rep(1:10, trunc(nrow(x)/nfolds)+1))[1:nrow(x)]

We begin model fitting, starting with our estimate of e*e^* (the probability of receiving the treatment). To fit τ\tau, we will also need to record the cross-fitted ê*(x)\hat{e}^*(x).


e_fit = cv.glmnet(x, w, foldid = foldid,
                  family="binomial", type.measure="deviance",
                  keep = TRUE)

e_hat = e_fit$fit.preval[, e_fit$lambda == e_fit$lambda.1se]
e_hat = 1/(1 + exp(-e_hat))

Now, stage 1 of pretraining: fit a model for m*m^* and record the support. As before, we also record the cross-fitted m̂*(x)\hat{m}^*(x).


m_fit = cv.glmnet(x, y, foldid = foldid, keep = TRUE)

m_hat = m_fit$fit.preval[, m_fit$lambda == m_fit$lambda.1se]

bhat = coef(m_fit, s = m_fit$lambda.1se)
support = which(bhat[-1] != 0)

To fit τ\tau, we will regress ỹ=yim̂*(xi)\tilde{y} = y_i - \hat{m}^*(x_i) on x̃=(wiê*(xi))xi\tilde{x} = (w_i - \hat{e}^*(x_i)) x_i; we’ll define them here:

y_tilde = y - m_hat
x_tilde = cbind(as.numeric(w - e_hat) * cbind(1, x))

And now, pretraining for τ\tau. Loop over α=0,0.1,,1\alpha = 0, 0.1, \dots, 1; for each α\alpha, fit a model for τ\tau using the penalty factor defined by the support of m̂\hat{m} and α\alpha. We’ll keep track of our CV MSE at each step so that we can choose the α\alpha that minimizes the MSE.

cv.error = NULL
alphalist = seq(0, 1, length.out = 11)

for(alpha in alphalist){
  pf = rep(1/alpha, p)
  pf[support] = 1
  pf = c(0, pf) # Don't penalize the intercept
  
  tau_fit = cv.glmnet(x_tilde, y_tilde, 
                      foldid = foldid,
                      penalty.factor = pf,
                      intercept = FALSE, # already include in x_tilde
                      standardize = FALSE)
  cv.error = c(cv.error, min(tau_fit$cvm))
}


plot(alphalist, cv.error, type = "b",
     xlab = expression(alpha), 
     ylab = "CV MSE", 
     main = bquote("CV mean squared error as a function of " ~ alpha))
abline(v = alphalist[which.min(cv.error)])

In the plot above, the value at α=1\alpha = 1 corresponds to the usual R learner, which makes no assumption about a shared support between τ\tau and m*m^*. Based on the plot, we choose α=0.2\alpha = 0.2 as our best performing model:


best.alpha = alphalist[which.min(cv.error)]
cat("Chosen alpha:", best.alpha)
#> Chosen alpha: 0.2

pf = rep(1/best.alpha, p)
pf[support] = 1
pf = c(0, pf)
tau_fit = cv.glmnet(x_tilde, y_tilde, foldid = foldid,
                    penalty.factor = pf,
                    intercept = FALSE,
                    standardize = FALSE)

To concretely compare the pretrained R-learner with the usual R-learner, we’ll train the usual R-learner here:

tau_rlearner = cv.glmnet(x_tilde, y_tilde, foldid = foldid, 
                         penalty.factor = c(0, rep(1, ncol(x))),
                         intercept = FALSE,
                         standardize = FALSE)

As anticipated, pretraining improves the prediction squared error relative to the R learner – this is how we designed our simulation:


rlearner_preds   = predict(tau_rlearner, cbind(1, xtest), s = "lambda.min")
cat("R-learner PSE: ", 
    round(mean((rlearner_preds - tautest)^2), 2))
#> R-learner PSE:  45.85

pretrained_preds = predict(tau_fit, cbind(1, xtest), s = "lambda.min")
cat("Pretrained R-learner PSE: ", 
    round(mean((pretrained_preds - tautest)^2), 2))
#> Pretrained R-learner PSE:  37.63

What if the pretraining assumption is wrong?

Here, we repeat everything from above, only now there is no overlap in the support of m*m^* and τ\tau.

######################################################
# Simulate data
######################################################
x = matrix(rnorm(n*p), n, p)

# Treatment assignment
w = rbinom(n, 1, 0.5)

# m^*
m.coefs = c(rep(2,10), rep(0, p-10))
m = x %*% m.coefs

# tau
# Note these coefficients have no overlap with m.coefs!
tau.coefs = c(rep(0, 10), rep(2, 10), rep(0, p-20))
tau = x%*%tau.coefs

mu = m + w * tau
y  = mu + 10 * rnorm(n)
cat("Signal to noise ratio:", var(mu)/var(y-mu))
#> Signal to noise ratio: 0.6938152

# Split into train/test
xtest = x[-(1:ntrain), ]
tautest = tau[-(1:ntrain)] 
wtest = w[-(1:ntrain)]

x = x[1:ntrain, ]
y = y[1:ntrain] 
w = w[1:ntrain]

######################################################
# Model fitting: e^*
######################################################
e_fit = cv.glmnet(x, w, foldid = foldid,
                  family="binomial", type.measure="deviance",
                  keep = TRUE)
e_hat = e_fit$fit.preval[, e_fit$lambda == e_fit$lambda.1se]
e_hat = 1/(1 + exp(-e_hat))

######################################################
# Model fitting: m^*
######################################################
m_fit = cv.glmnet(x, y, foldid = foldid, keep = TRUE)

m_hat = m_fit$fit.preval[, m_fit$lambda == m_fit$lambda.1se]

bhat = coef(m_fit, s = m_fit$lambda.1se)
support = which(bhat[-1] != 0)

######################################################
# Pretraining: tau
######################################################
y_tilde = y - m_hat
x_tilde = cbind(as.numeric(w - e_hat) * cbind(1, x))

cv.error = NULL
alphalist = seq(0, 1, length.out = 11)

for(alpha in alphalist){
  pf = rep(1/alpha, p)
  pf[support] = 1
  pf = c(0, pf) # Don't penalize the intercept
  
  tau_fit = cv.glmnet(x_tilde, y_tilde, 
                      foldid = foldid,
                      penalty.factor = pf,
                      intercept = FALSE, # already include in x_tilde
                      standardize = FALSE)
  cv.error = c(cv.error, min(tau_fit$cvm))
}

# Our final model for tau:
best.alpha = alphalist[which.min(cv.error)]
cat("Chosen alpha:", best.alpha)
#> Chosen alpha: 1

pf = rep(1/best.alpha, p)
pf[support] = 1
pf = c(0, pf)
tau_fit = cv.glmnet(x_tilde, y_tilde, foldid = foldid,
                    penalty.factor = pf,
                    intercept = FALSE,
                    standardize = FALSE)

######################################################
# Fit the usual R-learner:
######################################################
tau_rlearner = cv.glmnet(x_tilde, y_tilde, foldid = foldid, 
                         penalty.factor = c(0, rep(1, ncol(x))),
                         intercept = FALSE,
                         standardize = FALSE)

######################################################
# Measure performance:
######################################################
rlearner_preds = predict(tau_rlearner, cbind(1, xtest), s = "lambda.min")
cat("R-learner prediction squared error: ", 
    round(mean((rlearner_preds - tautest)^2), 2))
#> R-learner prediction squared error:  31.11

pretrained_preds = predict(tau_fit, cbind(1, xtest), s = "lambda.min")
cat("Pretrained R-learner prediction squared error: ", 
    round(mean((pretrained_preds - tautest)^2), 2))
#> Pretrained R-learner prediction squared error:  31.11

Pretraining has not hurt our performance, even though the support of m*m^* and τ\tau are not shared. Why? Recall that we defined y=m*(x)+W*τ(x)+ϵy = m^*(x) + W * \tau(x) + \epsilon, so the relationship between yy and xx is a function of the supports of both m*m^* and τ\tau. In the first stage of pretraining, we fitted m*m^* using y ~ x – so the support of m*m^*should include the support of τ\tau. As a result, using pretraining with the R-learner should not harm predictive performance.

Using non-linear bases

Suppose we have a dataset with features XX and response yy, where the relationship between XX and yy is a nonlinear function of the columns of XX. Can we still use the lasso? Yes! We can pretrain our linear model using xgboost to obtain basis functions (features). Let’s walk through an example.

Example 1: xgboost pretraining

require(xgboost)
#> Loading required package: xgboost

We start by simulating data (n=1800n = 1800, p=1000p = 1000) with a continuous response. Our coefficients β\beta are sparse; the first 200 entries will be drawn from a standard univariate normal, and the remainder are 00. We define yy as y=1(X>0)β+ϵy = 1(X > 0) \beta + \epsilon, where ϵ\epsilon is noise; we hope that xgboost will learn the splits corresponding to X>0X > 0.

set.seed(1234)

n = 1800; p = 1000; noise = 5;

x     = matrix(rnorm(n * p), nrow=n, ncol=p)
xtest = matrix(rnorm(n * p), nrow=n, ncol=p)

x.model     = 1*(x > 0)     
xtest.model = 1*(xtest > 0) 

beta = c(rnorm(200), rep(0, p-200))

y     = x.model %*% beta + noise * rnorm(n)
ytest = xtest.model %*% beta + noise * rnorm(n)

train.folds = sample(rep(1:10, n/10))

Now, we run xgboost to get our basis functions:

xgbfit      = xgboost(data=x, label=y, nrounds=200, max_depth=1, verbose=0)

x.boost     = predict(xgbfit, x, predleaf = TRUE) - 1
xtest.boost = predict(xgbfit, xtest, predleaf = TRUE) - 1

And we are ready for model fitting with cv.glmnet. Our two baselines are (1) a linear model that does not pretrain with xgboost, and (2) xgboost. We find that glmnet together with xgboost outperforms glmnet alone and xgboost alone.

cvfit = cv.glmnet(x.boost, y, type.measure = "mse", foldid = train.folds)
cvfit.noboost = cv.glmnet(x, y, type.measure = "mse", foldid = train.folds)

cat("Lasso with xgboost pretraining PSE: ", 
    assess.glmnet(cvfit, newx = xtest.boost, newy = ytest)$mse)
#> Lasso with xgboost pretraining PSE:  46.23225

cat("Lasso without xgboost pretraining PSE: ", 
    assess.glmnet(cvfit.noboost, newx = xtest, newy = ytest)$mse)
#> Lasso without xgboost pretraining PSE:  60.68818

cat("xgboost alone PSE: ", 
    assess.glmnet(predict(xgbfit, xtest), newy = ytest)$mse)
#> xgboost alone PSE:  49.47738

Example 2: xgboost pretraining with input groups

Now, let’s repeat the above supposing our data have input groups. The only difference here is that we will use cv.ptLasso for our model instead of cv.glmnet, and we will use the group indicators as a feature when fitting xgboost.

We start by simulating data with 3 groups (600600 observations in each group) and a continuous response. As before, we will simulate yy as y=1(X>0)β+ϵy = 1(X > 0) \beta + \epsilon, only now we have a different β\beta for each group. The coefficients for the groups are in Table @ref(tab:nonlinear).

Coefficients for simulating data for use with xgboost pretraining
1-50 51-100 101-150 151-200 201-500
group 1 2 1 0 0 0
group 2 2 0 1 0 0
group 3 2 0 0 1 0
set.seed(1234)

n = 1800; p = 500; k = 3;
noise = 5;

groups = groupstest = sort(rep(1:k, n/k))

x     = matrix(rnorm(n * p), nrow=n, ncol=p)
xtest = matrix(rnorm(n * p), nrow=n, ncol=p)

x.model     = 1*(x > 0)     
xtest.model = 1*(xtest > 0)

common.beta = c(rep(2, 50), rep(0, p-50))
beta.1 = c(rep(0, 50),  rep(1, 50), rep(0, p-100)) 
beta.2 = c(rep(0, 100), rep(1, 50), rep(0, p-150)) 
beta.3 = c(rep(0, 150), rep(1, 50), rep(0, p-200)) 

y = x.model %*% common.beta + noise * rnorm(n)
y[groups == 1] = y[groups == 1] + x.model[groups == 1, ] %*% beta.1
y[groups == 2] = y[groups == 2] + x.model[groups == 2, ] %*% beta.2
y[groups == 3] = y[groups == 3] + x.model[groups == 3, ] %*% beta.3

ytest = xtest.model %*% common.beta + noise * rnorm(n)
ytest[groups == 1] = ytest[groups == 1] + xtest.model[groups == 1, ] %*% beta.1
ytest[groups == 2] = ytest[groups == 2] + xtest.model[groups == 2, ] %*% beta.2
ytest[groups == 3] = ytest[groups == 3] + xtest.model[groups == 3, ] %*% beta.3

Here are the dummy variables for our group indicators; we will use them to fit and predict with xgboost.

group.ids     = model.matrix(~as.factor(groups) - 1) 
grouptest.ids = model.matrix(~as.factor(groupstest) - 1) 
colnames(grouptest.ids) = colnames(group.ids)

Now, let’s train xgboost and predict to get our new features. Note that we now use max_depth = 2: this is intended to allow interactions between the group indicators and the other features.

xgbfit      = xgboost(data=cbind(x, group.ids), label=y, 
                      nrounds=200, max_depth=2, verbose=0)

x.boost     = predict(xgbfit, cbind(x, group.ids), predleaf = TRUE) - 1
xtest.boost = predict(xgbfit, cbind(xtest, grouptest.ids), predleaf = TRUE) - 1

Finally, we are ready to fit two models trained with cv.ptLasso: one uses the xgboost features and the other does not. As before, we find that pretraining with xgboost improves performance relative to (1) model fitting in the original feature space and (2) xgboost alone.

cvfit = cv.ptLasso(x.boost, y, groups=groups, type.measure = "mse")
preds = predict(cvfit, xtest.boost, groups=groupstest, alphatype = "varying")
preds = preds$yhatpre

cvfit.noboost = cv.ptLasso(x, y, groups=groups, type.measure = "mse")
preds.noboost = predict(cvfit.noboost, xtest, groups=groupstest, 
                        alphatype = "varying")
preds.noboost = preds.noboost$yhatpre

cat("ptLasso with xgboost pretraining PSE: ", 
    assess.glmnet(preds, newy = ytest)$mse)
#> ptLasso with xgboost pretraining PSE:  55.1535

cat("ptLasso without xgboost pretraining PSE: ", 
    assess.glmnet(preds.noboost, newy = ytest)$mse)
#> ptLasso without xgboost pretraining PSE:  66.37259

cat("xgboost alone PSE: ", 
    assess.glmnet(predict(xgbfit, xtest), newy = ytest)$mse)
#> xgboost alone PSE:  59.63781

Unsupervised pretraining

Suppose we have a dataset with features XX and response yy. Suppose we also have a large set of unlabeled data X*X^*. Here, we show how to pretrain a model using X*X^*. The steps are:

  1. Do sparse PCA using X*X^*. Identify the nonzero features in the first principal component (PC).
  2. Use glmnet (or cv.glmnet) to train model using XX and yy. Define the penalty factor using the support identified by sparse PCA. Unlike the usual pretraining, there is no offset defined by sparse PCA.

In step 1, we may choose to use the nonzero features from the first kk PCs instead of just the first PC; in the examples that follow, we use only the first PC for simplicity.

We will demonstrate unsupervised pretraining using simulated data. The covariates XX and X*X^* are drawn from a multivariate normal distribution where the first 10 features describe most of the variance, and yy is defined as Xβ+ϵX \beta + \epsilon, where only the first 10 coefficients in β\beta are nonzero and ϵ\epsilon is noise. In this example, we have 10 times as much unlabeled data as labeled data; this generally happens when labels are difficult to obtain.

require(MASS) # for mvrnorm
#> Loading required package: MASS

set.seed(1234)

n = 100; p = 150; 

mu = rep(0, p)
sigma <- matrix(runif(p^2)*2-1, ncol=p) 
sigma[, 11:p] = 1e-2 # The first 10 features are the most important
sigma <- t(sigma) %*% sigma
diag(sigma)[11:p] = 1

x     = mvrnorm(n = n, mu = mu, Sigma = sigma)
xtest = mvrnorm(n = n, mu = mu, Sigma = sigma)
xstar = mvrnorm(n = 10 * n, mu = mu, Sigma = sigma) # unlabeled

noise = 3
beta = c(rep(1, 10), rep(0, p - 10))
y     = x %*% beta     + noise * rnorm(n)
ytest = xtest %*% beta + noise * rnorm(n)

train.folds = sample(rep(1:10, 10))

Now, we do sparse PCA using X*X^* and we identify the features with nonzero loadings in the first PC. The argument k=1k = 1 means that we only obtain the first PC.

require(sparsepca)
#> Loading required package: sparsepca

pcs = spca(xstar, k = 1, verbose=FALSE, alpha=1e-2, beta=1e-2)
nonzero.loadings = which(pcs$loadings != 0)

We set ourselves up for success: because of how we simulated our data, we know that the first 10 features are those that explain the variance in XX. These are also the features that define the relationship between XX and yy. Let’s check that sparse PCA has found the right features:

nonzero.loadings
#>  [1]  1  2  3  4  5  6  7  8  9 10

Now, we are ready to model! We don’t need to call ptLasso here. All we need to do is call cv.glmnet across a grid of values of α\alpha with a different penalty.factor for each call. Note that offset is not used – sparse PCA identifies which features may important, but it doesn’t suggest a value for the fitted coefficients.

To do model selection, we want to know which value of α\alpha gave us the best CV error. Fortunately, cv.glmnet will record the CV MSE for each model in a vector called cvm; we just need to keep track of the minimum error from each model.


alphalist = seq(0, 1, length.out = 11)

cvm = NULL
for(alpha in alphalist){
  # Define the penalty factor:
  pf = rep(1/alpha, p)
  pf[nonzero.loadings] = 1
  
  # Train a model:
  model = cv.glmnet(x, y, family = "gaussian", type.measure = "mse",
                              penalty.factor = pf, 
                              foldid = train.folds)
  
  # Record the minmum CV MSE for this model:
  cvm = c(cvm, min(model$cvm))
}

best.alpha = alphalist[which.min(cvm)]

# Plot performance as a function of alpha
# with a vertical line to show us the minimum mse:
plot(alphalist, cvm, 
     xlab = expression(alpha), 
     ylab = "Mean squared error (CV)"
)
abline(v = best.alpha)

So, using CV performance as a metric, we choose α=0.2\alpha = 0.2. Now, we train our final model and predict and measure performance with our held-out data. We find that pretraining gives us a boost in performance.

pf = rep(1/best.alpha, p)
pf[nonzero.loadings] = 1

selected.model =  cv.glmnet(x, y, family = "gaussian", type.measure = "mse",
                              penalty.factor = pf, 
                              foldid = train.folds)

# Prediction squared error with pretraining:
assess.glmnet(selected.model, xtest, newy = ytest, s = "lambda.min")["mse"]
#> $mse
#> lambda.min 
#>   9.914999 
#> attr(,"measure")
#> [1] "Mean-Squared Error"

without.pretraining =  cv.glmnet(x, y, family = "gaussian", type.measure = "mse",
                                 foldid = train.folds)

# Prediction squared error without pretraining:
assess.glmnet(without.pretraining, xtest, newy = ytest, s = "lambda.min")["mse"]
#> $mse
#> lambda.min 
#>   14.42903 
#> attr(,"measure")
#> [1] "Mean-Squared Error"

References

Craig, Erin, Mert Pilanci, Thomas Le Menestrel, Balasubramanian Narasimhan, Manuel Rivas, Roozbeh Dehghannasiri, Julia Salzman, Jonathan Taylor, and Robert Tibshirani. 2024. “Pretraining and the Lasso.” arXiv Preprint arXiv:2401.12911.
Friedman, Jerome, Robert Tibshirani, and Trevor Hastie. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. https://doi.org/10.18637/jss.v033.i01.
Liu, Yu, and Yingcun Xia. 2022. ODRF: Consistency of the Oblique Decision Tree and Its Random Forest.” arXiv Preprint arXiv:2211.12653.
Nie, Xinkun, and Stefan Wager. 2021. “Quasi-Oracle Estimation of Heterogeneous Treatment Effects.” Biometrika 108 (2): 299–319.