Sparse Partial Least Squares Regression for Simultaneous Dimension Reduction and Variable Selection

Table 1

Variable selection performances of SPLS–NIPALS versus SPLS–SIMPLS algorithms

Method	Number of correct variables†	Number of incorrect variables†
SPLS–NIPALS	9.75 / 12 / 13	0 / 0 / 2
SPLS–SIMPLS	7 / 9 / 13	0 / 2 / 5

†

First quartile/median/third quartile.

5.2. Setting the weight factor κ in the general regression formulation of problem (8)

We ran a small simulation study to examine how the generalization of the regression formulation given in expression (8) helps to avoid the local solution issue. The data-generating mechanism is set as follows. Columns of X are generated by X_i = H_j+ɛ_i for n_j−1+1⩽i⩽n_j, where j = 1,…,4 and (n₀,…,n₄)=(0,4,8,10,100). Here, H₁ is a random vector from $N (0, 290 I_{1000}), H_{2}$ is a random vector from $N (0, 300 I_{1000}), H_{3} = - 0.3 H_{1} + 0.925 H_{2}$ and H₄=0. The ɛ_is are independent identically distributed random vectors from $N (0, I_{1000})$ ⁠. For illustration, we use M = X^TX. When κ = 0.5, the algorithm becomes stuck at a local solution in 27 out of 100 simulation runs. When κ = 0.1,0.3,0.4, the correct solution is obtained in all runs. This indicates that a slight imbalance giving less weight to the concave objective function of formulation (8) might lead to a numerically easier optimization problem.

5.3. Comparisons with recent variable selection methods in terms of prediction power and variable selection

In this section, we compare SPLS regression with other popular methods in terms of prediction and variable selection performances in various correlated covariates settings. We include OLS and the lasso, which are not particularly tailored for correlated variables. We also consider dimension reduction methods such as PLS, principal component regression (PCR) and supervised PCs, which ought to be appropriate for highly correlated variables. The EN is also included in these comparisons since it can handle highly correlated variables.

We first consider the case where there is a reasonable number of observations (i.e. n>p) and set n = 400 and p = 40. We vary the number of spurious variables as q = 10 and q = 30, and the noise-to-signal ratios as 0.1 and 0.2. Hidden variables H₁,…,H₃ are from $N (0, 25 I_{n})$ ⁠, and the columns of the covariate matrix X are generated by X_i = H_j+ɛ_i for n_j−1+1⩽i⩽n_j, where j = 1,…,3,(n₀,…,n₃)=(0,(p−q)/2,p−q,p) and ɛ₁,…,ɛ_p are drawn independently from $N (0, I_{n})$ ⁠. Y is generated by 3H₁−4H₂+f, where f is normally distributed with mean 0. This mechanism generates covariates, subsets of which are highly correlated.

We, then, consider the case where the sample size is smaller than the number of the variables (i.e. n<p) and set n = 40 and p = 80. The numbers of spurious variables are set to q = 20 and q = 40, and noise-to-signal ratios to 0.1 and 0.2 respectively. X and Y are generated similarly to the above n>p case.

We select the optimal tuning parameters for most of the methods by using tenfold CV. Since the CV curve tends to be flat in this simulation study, we first identify parameters of which CV scores are less than 1.1 times the minimum of the CV scores. We select the smallest K and the largest η among the selected parameters for SPLS, the largest λ₂ and the smallest step size for the EN and the smallest step size for the lasso. We use the F-statistic (the default CV score in the R package superpc) from the fitted model as a CV score for supervised PC. Then, we use the same procedure to generate an independent test data set and predict Y on this test data set on the basis of the fitted models. For each parameter setting, we perform 30 runs of simulations and compute the mean and standard deviation of the mean-squared prediction errors. The averages of the sensitivities and specificities are computed across the simulations to compare the accuracy of variable selection. The results are presented in Tables 2 and 3.

Table 2

Mean-squared prediction error for simulations I and II†

p/n/q/nssettings	Mean-squared prediction errors for the following methods:
p/n/q/nssettings	PLS (SE)	PCR (SE)	OLS (SE)	Lasso (SE)	SPLS1 (SE)	SPLS2 (SE)	Supervised PCs (SE)	EN (SE)
40/400/10/0.1	31417.9	15717.1	31444.4	208.3	199.8	201.4	198.6	200.1
40/400/10/0.1	(552.5)	(224.2)	(554.0)	(10.4)	(9.0)	(11.2)	(9.5)	(10.0)
40/400/10/0.2	31872.0	16186.5	31956.9	697.3	661.4	658.7	658.8	685.5
40/400/10/0.2	(544.4)	(231.4)	(548.9)	(15.7)	(13.9)	(15.7)	(14.2)	(17.7)
40/400/30/0.1	31409.1	20914.2	31431.7	205.0	203.3	205.5	202.7	203.1
40/400/30/0.1	(552.5)	(1324.4)	(554.2)	(9.5)	(10.1)	(11.1)	(9.4)	(9.7)
40/400/30/0.2	31863.7	21336.0	31939.3	678.6	661.2	663.5	663.5	684.9
40/400/30/0.2	(544.1)	(1307.6)	(549.1)	(13.6)	(14.4)	(15.6)	(14.4)	(19.3)
80/40/20/0.1	29121.4	15678.0		485.2	538.4	494.6	720.0	533.9
80/40/20/0.1	(1583.2)	(652.9)		(48.4)	(70.5)	(63.0)	(240.0)	(75.3)
80/40/20/0.2	30766.9	16386.5		1099.2	1019.5	965.5	2015.8	1050.7
80/40/20/0.2	(1386.0)	(636.8)		(86.0)	(74.6)	(74.7)	(523.6)	(84.5)
80/40/40/0.1	29116.2	17416.1		502.4	506.9	497.7	522.7	545.3
80/40/40/0.1	(1591.7)	(924.2)		(54.0)	(66.9)	(62.8)	(69.4)	(77.1)
80/40/40/0.2	29732.4	17940.8		1007.2	1013.3	964.4	1080.6	1018.7
80/40/40/0.2	(1605.8)	(932.2)		(82.9)	(78.7)	(74.6)	(165.6)	(74.9)

p/n/q/nssettings	Mean-squared prediction errors for the following methods:
p/n/q/nssettings	PLS (SE)	PCR (SE)	OLS (SE)	Lasso (SE)	SPLS1 (SE)	SPLS2 (SE)	Supervised PCs (SE)	EN (SE)
40/400/10/0.1	31417.9	15717.1	31444.4	208.3	199.8	201.4	198.6	200.1
40/400/10/0.1	(552.5)	(224.2)	(554.0)	(10.4)	(9.0)	(11.2)	(9.5)	(10.0)
40/400/10/0.2	31872.0	16186.5	31956.9	697.3	661.4	658.7	658.8	685.5
40/400/10/0.2	(544.4)	(231.4)	(548.9)	(15.7)	(13.9)	(15.7)	(14.2)	(17.7)
40/400/30/0.1	31409.1	20914.2	31431.7	205.0	203.3	205.5	202.7	203.1
40/400/30/0.1	(552.5)	(1324.4)	(554.2)	(9.5)	(10.1)	(11.1)	(9.4)	(9.7)
40/400/30/0.2	31863.7	21336.0	31939.3	678.6	661.2	663.5	663.5	684.9
40/400/30/0.2	(544.1)	(1307.6)	(549.1)	(13.6)	(14.4)	(15.6)	(14.4)	(19.3)
80/40/20/0.1	29121.4	15678.0		485.2	538.4	494.6	720.0	533.9
80/40/20/0.1	(1583.2)	(652.9)		(48.4)	(70.5)	(63.0)	(240.0)	(75.3)
80/40/20/0.2	30766.9	16386.5		1099.2	1019.5	965.5	2015.8	1050.7
80/40/20/0.2	(1386.0)	(636.8)		(86.0)	(74.6)	(74.7)	(523.6)	(84.5)
80/40/40/0.1	29116.2	17416.1		502.4	506.9	497.7	522.7	545.3
80/40/40/0.1	(1591.7)	(924.2)		(54.0)	(66.9)	(62.8)	(69.4)	(77.1)
80/40/40/0.2	29732.4	17940.8		1007.2	1013.3	964.4	1080.6	1018.7
80/40/40/0.2	(1605.8)	(932.2)		(82.9)	(78.7)	(74.6)	(165.6)	(74.9)

†

p, the number of covariates; n, the sample size; q, the number of spurious variables; ns, noise-to-signal ratio; SPLS1, SPLS tuned by FDR control (FDR = 0.1); SPLS2, SPLS tuned by CV; SE, standard error.

Table 2

Mean-squared prediction error for simulations I and II†

p/n/q/nssettings	Mean-squared prediction errors for the following methods:
p/n/q/nssettings	PLS (SE)	PCR (SE)	OLS (SE)	Lasso (SE)	SPLS1 (SE)	SPLS2 (SE)	Supervised PCs (SE)	EN (SE)
40/400/10/0.1	31417.9	15717.1	31444.4	208.3	199.8	201.4	198.6	200.1
40/400/10/0.1	(552.5)	(224.2)	(554.0)	(10.4)	(9.0)	(11.2)	(9.5)	(10.0)
40/400/10/0.2	31872.0	16186.5	31956.9	697.3	661.4	658.7	658.8	685.5
40/400/10/0.2	(544.4)	(231.4)	(548.9)	(15.7)	(13.9)	(15.7)	(14.2)	(17.7)
40/400/30/0.1	31409.1	20914.2	31431.7	205.0	203.3	205.5	202.7	203.1
40/400/30/0.1	(552.5)	(1324.4)	(554.2)	(9.5)	(10.1)	(11.1)	(9.4)	(9.7)
40/400/30/0.2	31863.7	21336.0	31939.3	678.6	661.2	663.5	663.5	684.9
40/400/30/0.2	(544.1)	(1307.6)	(549.1)	(13.6)	(14.4)	(15.6)	(14.4)	(19.3)
80/40/20/0.1	29121.4	15678.0		485.2	538.4	494.6	720.0	533.9
80/40/20/0.1	(1583.2)	(652.9)		(48.4)	(70.5)	(63.0)	(240.0)	(75.3)
80/40/20/0.2	30766.9	16386.5		1099.2	1019.5	965.5	2015.8	1050.7
80/40/20/0.2	(1386.0)	(636.8)		(86.0)	(74.6)	(74.7)	(523.6)	(84.5)
80/40/40/0.1	29116.2	17416.1		502.4	506.9	497.7	522.7	545.3
80/40/40/0.1	(1591.7)	(924.2)		(54.0)	(66.9)	(62.8)	(69.4)	(77.1)
80/40/40/0.2	29732.4	17940.8		1007.2	1013.3	964.4	1080.6	1018.7
80/40/40/0.2	(1605.8)	(932.2)		(82.9)	(78.7)	(74.6)	(165.6)	(74.9)

p/n/q/nssettings	Mean-squared prediction errors for the following methods:
p/n/q/nssettings	PLS (SE)	PCR (SE)	OLS (SE)	Lasso (SE)	SPLS1 (SE)	SPLS2 (SE)	Supervised PCs (SE)	EN (SE)
40/400/10/0.1	31417.9	15717.1	31444.4	208.3	199.8	201.4	198.6	200.1
40/400/10/0.1	(552.5)	(224.2)	(554.0)	(10.4)	(9.0)	(11.2)	(9.5)	(10.0)
40/400/10/0.2	31872.0	16186.5	31956.9	697.3	661.4	658.7	658.8	685.5
40/400/10/0.2	(544.4)	(231.4)	(548.9)	(15.7)	(13.9)	(15.7)	(14.2)	(17.7)
40/400/30/0.1	31409.1	20914.2	31431.7	205.0	203.3	205.5	202.7	203.1
40/400/30/0.1	(552.5)	(1324.4)	(554.2)	(9.5)	(10.1)	(11.1)	(9.4)	(9.7)
40/400/30/0.2	31863.7	21336.0	31939.3	678.6	661.2	663.5	663.5	684.9
40/400/30/0.2	(544.1)	(1307.6)	(549.1)	(13.6)	(14.4)	(15.6)	(14.4)	(19.3)
80/40/20/0.1	29121.4	15678.0		485.2	538.4	494.6	720.0	533.9
80/40/20/0.1	(1583.2)	(652.9)		(48.4)	(70.5)	(63.0)	(240.0)	(75.3)
80/40/20/0.2	30766.9	16386.5		1099.2	1019.5	965.5	2015.8	1050.7
80/40/20/0.2	(1386.0)	(636.8)		(86.0)	(74.6)	(74.7)	(523.6)	(84.5)
80/40/40/0.1	29116.2	17416.1		502.4	506.9	497.7	522.7	545.3
80/40/40/0.1	(1591.7)	(924.2)		(54.0)	(66.9)	(62.8)	(69.4)	(77.1)
80/40/40/0.2	29732.4	17940.8		1007.2	1013.3	964.4	1080.6	1018.7
80/40/40/0.2	(1605.8)	(932.2)		(82.9)	(78.7)	(74.6)	(165.6)	(74.9)

†

p, the number of covariates; n, the sample size; q, the number of spurious variables; ns, noise-to-signal ratio; SPLS1, SPLS tuned by FDR control (FDR = 0.1); SPLS2, SPLS tuned by CV; SE, standard error.

Table 3

Model accuracy for simulations I and II†

p/n/q/ns settings	Results for the following methods:
	Lasso		SPLS1		SPLS2		SuperPC		EN
	Sensitivity	Specificity	Sensitivity	Specificity	Sensitivity	Specificity	Sensitivity	Specificity	Sensitivity	Specificity
40/400/10/0.1	0.76	1.00	1.00	0.83	1.00	1.00	1.00	1.00	1.00	0.95
40/400/10/0.2	0.67	1.00	1.00	0.80	1.00	1.00	1.00	1.00	0.94	0.97
40/400/30/0.1	1.00	0.98	1.00	0.83	1.00	1.00	1.00	1.00	1.00	0.95
40/400/30/0.2	0.96	1.00	1.00	0.80	1.00	1.00	1.00	1.00	1.00	0.95
80/40/20/0.1	0.15	1.00	1.00	0.80	1.00	1.00	0.97	0.93	0.72	0.99
80/40/20/0.2	0.12	1.00	1.00	0.67	1.00	1.00	0.86	0.83	0.80	0.98
80/40/40/0.1	0.21	1.00	1.00	0.80	1.00	1.00	1.00	0.93	0.72	0.99
80/40/40/0.2	0.15	1.00	1.00	0.80	1.00	1.00	0.97	0.90	0.80	0.98

p/n/q/ns settings	Results for the following methods:
	Lasso		SPLS1		SPLS2		SuperPC		EN
	Sensitivity	Specificity	Sensitivity	Specificity	Sensitivity	Specificity	Sensitivity	Specificity	Sensitivity	Specificity
40/400/10/0.1	0.76	1.00	1.00	0.83	1.00	1.00	1.00	1.00	1.00	0.95
40/400/10/0.2	0.67	1.00	1.00	0.80	1.00	1.00	1.00	1.00	0.94	0.97
40/400/30/0.1	1.00	0.98	1.00	0.83	1.00	1.00	1.00	1.00	1.00	0.95
40/400/30/0.2	0.96	1.00	1.00	0.80	1.00	1.00	1.00	1.00	1.00	0.95
80/40/20/0.1	0.15	1.00	1.00	0.80	1.00	1.00	0.97	0.93	0.72	0.99
80/40/20/0.2	0.12	1.00	1.00	0.67	1.00	1.00	0.86	0.83	0.80	0.98
80/40/40/0.1	0.21	1.00	1.00	0.80	1.00	1.00	1.00	0.93	0.72	0.99
80/40/40/0.2	0.15	1.00	1.00	0.80	1.00	1.00	0.97	0.90	0.80	0.98

†

p, the number of covariates; n, the sample size; q, the number of spurious variables; ns, noise-to-signal ratio; SPLS1, SPLS tuned by FDR control (FDR = 0.1); SPLS2, SPLS tuned by CV.

Table 3

Model accuracy for simulations I and II†

p/n/q/ns settings	Results for the following methods:
	Lasso		SPLS1		SPLS2		SuperPC		EN
	Sensitivity	Specificity	Sensitivity	Specificity	Sensitivity	Specificity	Sensitivity	Specificity	Sensitivity	Specificity
40/400/10/0.1	0.76	1.00	1.00	0.83	1.00	1.00	1.00	1.00	1.00	0.95
40/400/10/0.2	0.67	1.00	1.00	0.80	1.00	1.00	1.00	1.00	0.94	0.97
40/400/30/0.1	1.00	0.98	1.00	0.83	1.00	1.00	1.00	1.00	1.00	0.95
40/400/30/0.2	0.96	1.00	1.00	0.80	1.00	1.00	1.00	1.00	1.00	0.95
80/40/20/0.1	0.15	1.00	1.00	0.80	1.00	1.00	0.97	0.93	0.72	0.99
80/40/20/0.2	0.12	1.00	1.00	0.67	1.00	1.00	0.86	0.83	0.80	0.98
80/40/40/0.1	0.21	1.00	1.00	0.80	1.00	1.00	1.00	0.93	0.72	0.99
80/40/40/0.2	0.15	1.00	1.00	0.80	1.00	1.00	0.97	0.90	0.80	0.98

p/n/q/ns settings	Results for the following methods:
	Lasso		SPLS1		SPLS2		SuperPC		EN
	Sensitivity	Specificity	Sensitivity	Specificity	Sensitivity	Specificity	Sensitivity	Specificity	Sensitivity	Specificity
40/400/10/0.1	0.76	1.00	1.00	0.83	1.00	1.00	1.00	1.00	1.00	0.95
40/400/10/0.2	0.67	1.00	1.00	0.80	1.00	1.00	1.00	1.00	0.94	0.97
40/400/30/0.1	1.00	0.98	1.00	0.83	1.00	1.00	1.00	1.00	1.00	0.95
40/400/30/0.2	0.96	1.00	1.00	0.80	1.00	1.00	1.00	1.00	1.00	0.95
80/40/20/0.1	0.15	1.00	1.00	0.80	1.00	1.00	0.97	0.93	0.72	0.99
80/40/20/0.2	0.12	1.00	1.00	0.67	1.00	1.00	0.86	0.83	0.80	0.98
80/40/40/0.1	0.21	1.00	1.00	0.80	1.00	1.00	1.00	0.93	0.72	0.99
80/40/40/0.2	0.15	1.00	1.00	0.80	1.00	1.00	0.97	0.90	0.80	0.98

†

p, the number of covariates; n, the sample size; q, the number of spurious variables; ns, noise-to-signal ratio; SPLS1, SPLS tuned by FDR control (FDR = 0.1); SPLS2, SPLS tuned by CV.

Although not so surprising, the methods with an intrinsic variable selection property show smaller prediction errors compared with the methods lacking this property. For n>p, the lasso, SPLS, supervised PCs and the EN show similar prediction performances in all four scenarios. This holds for the n<p case, except that supervised PC shows a slight increase in prediction error for dense models (p = 80 and q = 20). For the model selection accuracy, SPLS, supervised PCs and the EN show excellent performances, whereas the lasso exhibits poor performance by missing relevant variables. SPLS performs better than other methods for n<p and high noise-to-signal ratio scenarios. We observe that the EN misses relevant variables in the n<p scenario, even though its L₂-penalty aims to handle these cases specifically. Moreover, the EN performs well for the right size of the regularization parameter λ₂, but finding the optimal size objectively through CV seems to be a challenging task.

In general, both SPLS–CV and SPLS–FDR perform at least as well as other methods (Table 3). Especially, when n<p, the lasso fails to identify important variables, whereas SPLS regression succeeds. This is because, although the number of SPLS latent components is limited by n, the actual number of variables that makes up the latent components can exceed n.

5.4. Comparisons of predictive power among methods that handle multicollinearity

In this section, we compare SPLS regression with some of the popular methods that handle multicollinearity such as PLS, PCR, ridge regression, a mixed variance–covariance approach, gene shaving (Hastie et al., 2000) and supervised PCs (Bair et al., 2006). These comparisons are motivated by those presented in Bair et al. (2006). We compare only prediction performances since all methods except for gene shaving and supervised PCs are not equipped with variable selection. For the dimension reduction methods, we allow only one latent component for a fair comparison.

Throughout these simulations, we set p = 5000 and n = 100. All the scenarios follow the general model of Y = Xβ+f, but the underlying data generation for X is varying. We devise simulation scenarios where the multicollinearity is due to the presence of one main latent variable (simulations 1 and 2), the presence of multiple latent variables (simulation 3) and the presence of a correlation structure that is not induced by latent variables but some other mechanism (simulation 4). We select the optimal tuning parameters and compute the prediction errors as in Section 5.3. The results are summarized in Table 4.

Table 4

Mean-squared prediction errors†

Method	Mean-squared prediction errors for the following simulations:
Method	Simulation 1	Simulation 2	Simulation 3	Simulation 4
PCR1	320.67 (8.07)	308.93 (7.13)	241.75 (5.62)	2730.53 (75.82)
PLS1	301.25 (7.32)	292.70 (7.69)	209.19 (4.58)	1748.53 (47.47)
Ridge regression	304.80 (7.47)	296.36 (7.81)	211.59 (4.70)	1723.58 (46.41)
Supervised PC	252.01 (9.71)	248.26 (7.68)	134.90 (3.34)	263.46 (14.98)
SPLS1(FDR)	256.22 (13.82)	246.28 (7.87)	139.01 (3.74)	290.78 (13.29)
SPLS1(CV)	257.40 (9.66)	261.14 (8.11)	120.27 (3.42)	195.63 (7.59)
Mixed variance–covariance	301.05 (7.31)	292.46 (7.67)	209.45 (4.58)	1748.65 (47.58)
Gene shaving	255.60 (9.28)	292.46 (7.67)	119.39 (3.31)	203.46 (7.95)
True	224.13 (5.12)	218.04 (6.80)	96.90 (3.02)	99.12 (2.50)

Method	Mean-squared prediction errors for the following simulations:
Method	Simulation 1	Simulation 2	Simulation 3	Simulation 4
PCR1	320.67 (8.07)	308.93 (7.13)	241.75 (5.62)	2730.53 (75.82)
PLS1	301.25 (7.32)	292.70 (7.69)	209.19 (4.58)	1748.53 (47.47)
Ridge regression	304.80 (7.47)	296.36 (7.81)	211.59 (4.70)	1723.58 (46.41)
Supervised PC	252.01 (9.71)	248.26 (7.68)	134.90 (3.34)	263.46 (14.98)
SPLS1(FDR)	256.22 (13.82)	246.28 (7.87)	139.01 (3.74)	290.78 (13.29)
SPLS1(CV)	257.40 (9.66)	261.14 (8.11)	120.27 (3.42)	195.63 (7.59)
Mixed variance–covariance	301.05 (7.31)	292.46 (7.67)	209.45 (4.58)	1748.65 (47.58)
Gene shaving	255.60 (9.28)	292.46 (7.67)	119.39 (3.31)	203.46 (7.95)
True	224.13 (5.12)	218.04 (6.80)	96.90 (3.02)	99.12 (2.50)

†

PCR1, PCR with one component; PLS1, PLS with one component; SPLS1(FDR), SPLS with one component tuned by FDR control (FDR = 0.4); SPLS1(CV), SPLS with one component tuned by CV; True, true model.

Table 4

Mean-squared prediction errors†

Method	Mean-squared prediction errors for the following simulations:
Method	Simulation 1	Simulation 2	Simulation 3	Simulation 4
PCR1	320.67 (8.07)	308.93 (7.13)	241.75 (5.62)	2730.53 (75.82)
PLS1	301.25 (7.32)	292.70 (7.69)	209.19 (4.58)	1748.53 (47.47)
Ridge regression	304.80 (7.47)	296.36 (7.81)	211.59 (4.70)	1723.58 (46.41)
Supervised PC	252.01 (9.71)	248.26 (7.68)	134.90 (3.34)	263.46 (14.98)
SPLS1(FDR)	256.22 (13.82)	246.28 (7.87)	139.01 (3.74)	290.78 (13.29)
SPLS1(CV)	257.40 (9.66)	261.14 (8.11)	120.27 (3.42)	195.63 (7.59)
Mixed variance–covariance	301.05 (7.31)	292.46 (7.67)	209.45 (4.58)	1748.65 (47.58)
Gene shaving	255.60 (9.28)	292.46 (7.67)	119.39 (3.31)	203.46 (7.95)
True	224.13 (5.12)	218.04 (6.80)	96.90 (3.02)	99.12 (2.50)

Method	Mean-squared prediction errors for the following simulations:
Method	Simulation 1	Simulation 2	Simulation 3	Simulation 4
PCR1	320.67 (8.07)	308.93 (7.13)	241.75 (5.62)	2730.53 (75.82)
PLS1	301.25 (7.32)	292.70 (7.69)	209.19 (4.58)	1748.53 (47.47)
Ridge regression	304.80 (7.47)	296.36 (7.81)	211.59 (4.70)	1723.58 (46.41)
Supervised PC	252.01 (9.71)	248.26 (7.68)	134.90 (3.34)	263.46 (14.98)
SPLS1(FDR)	256.22 (13.82)	246.28 (7.87)	139.01 (3.74)	290.78 (13.29)
SPLS1(CV)	257.40 (9.66)	261.14 (8.11)	120.27 (3.42)	195.63 (7.59)
Mixed variance–covariance	301.05 (7.31)	292.46 (7.67)	209.45 (4.58)	1748.65 (47.58)
Gene shaving	255.60 (9.28)	292.46 (7.67)	119.39 (3.31)	203.46 (7.95)
True	224.13 (5.12)	218.04 (6.80)	96.90 (3.02)	99.12 (2.50)

†

PCR1, PCR with one component; PLS1, PLS with one component; SPLS1(FDR), SPLS with one component tuned by FDR control (FDR = 0.4); SPLS1(CV), SPLS with one component tuned by CV; True, true model.

The first simulation scenario is the same as the ‘simple simulation’ that was utilized by Bair et al. (2006), where hidden components H₁ and H₂ are defined as follows: H_1j equals 3 for 1⩽j⩽50 and 4 for 51⩽j⩽n and H_2j=3.5 for 1⩽j⩽n. Columns of X are generated by X_i = H₁+ɛ_i for 1⩽i⩽50 and H₂+ɛ_i for 51⩽i⩽p, where ɛ_i are an independent identically distributed random vector from $N (0, I_{n})$ ⁠. β is a p×1 vector, where the ith element is 1/25 for 1⩽i⩽50 and 0 for 51⩽i⩽p. f is a random vector from $N (0, {1.5}^{2} I_{n})$ ⁠. Although this scenario is ideal for supervised PCs in that Y is related to one main hidden component, SPLS regression shows a comparable performance with supervised PCs and gene shaving.

The second simulation was referred to as ‘hard simulation’ by Bair et al. (2006), where more complicated hidden components are generated, and the rest of the data generation remains the same as in the simple simulation. H₁,…,H₅ are generated by H_1j=3 I(j⩽50)+4 I(j>50),H_2j=3.5+1.5 I(u_1j⩽0.4),H_3j=3.5+0.5 I(u_1j⩽0.7),H_4j=3.5−1.5I(u_1j⩽0.3) and H_5j=3.5, for 1⩽j⩽n, where u_1j,u_2j and u_3j are independent identically distributed random variables from Unif(0,1). Columns of X are generated by X_i = H_j+ɛ_i for n_j−1+1⩽i⩽n_j, where j = 1,…,5 and (n₀,…,n₅)=(0,50,100,200,300,p). As seen in Table 4, when there are complex latent components, SPLS and supervised PCs show the best performance. These two simulation studies illustrate that both SPLS and supervised PCs have good prediction performances under the latent component model with few relevant variables.

The third simulation is designed to compare the prediction performances of the methods when all methods are allowed to use only one latent component, even though there are more than one hidden components related to Y. This scenario aims to illustrate the differences of the derived latent components depending on whether they are guided by the response Y. H₁ and H₂ are generated as H_1j=2.5 I(j⩽50)+4 I(j>50),H_2j=2.5 I(1⩽j⩽25 or 51⩽j⩽75)+4 I(26⩽j⩽50 or 76⩽j⩽100). (H₃,…,H₆) are defined in the same way as (H₂,…,H₅) in the second simulation. Columns of X are generated by X_i = H_j+ɛ_i for n_j−1+1⩽i⩽n_j, j = 1,…,6, and (n₀,…,n₆)=(0,25,50,100,200,300,p). f is a random vector from $N (0, I_{n})$ ⁠. Gene shaving and SPLS both exhibit good predictive performance in this scenario. In a way, when the number of components in the model is fixed, the methods which utilize Y when deriving latent components can achieve better predictive performances compared with methods that utilize only X when deriving these vectors. This agrees with the prior observation that PLS typically requires a smaller number of latent components than that of PCA (Frank and Friedman, 1993).

The fourth simulation is designed to compare the prediction performances of the methods when the relevant variables are not governed by a latent variable model. We generate the first 50 columns of X from a multivariate normal distribution with auto-regressive covariance, and the remaining 4950 columns of X are generated from hidden components as before. Five hidden components are generated as follows: H_1j equals 1 for 1⩽j⩽50 and 6 for 51⩽j⩽n and H₂,…,H₅ are the same as in the second simulation. Denoting X = (X⁽¹⁾,X⁽²⁾) by using a partitioned matrix, we generate rows of X⁽¹⁾ from $N (0, Σ_{50 \times 50})$ ⁠, where Σ_50×50 is from an AR(1) process with an auto-correlation ρ = 0.9. Columns of X⁽²⁾ are generated by $X_{i}^{(2)} = U_{j} + ε_{i}$ for n_j−1+1⩽i⩽n_j, where j = 1,…,5 and (n₀,…,n₅)=(0,50,100,200,300,p−50). β is a p×1 vector and its ith element is given by β_i = k_j for n_j−1+1⩽i⩽n_j, where j = 1,…,6, (n₀,…,n₆)=(0,10,20,30,40,50,p) and (k₁,…,k₆)=(8,6,4,2,1,0)/25. SPLS regression and gene shaving perform well, indicating that they have the ability to handle such a correlation structure. As in the third simulation, these two methods may gain some advantage in handling more general correlation structures by utilizing response Y when deriving direction vectors.

6. Case-study: application to yeast cell cycle data set

Transcription factors (TFs) play an important role for interpreting a genome’s regulatory code by binding to specific sequences to induce or repress gene expression. It is of general interest to identify TFs which are related to regulation of the cell cycle, which is one of the fundamental processes in a eukaryotic cell. Recently, Boulesteix and Strimmer (2005) performed an integrative analysis of gene expression and CHIP–chip data measuring the amount of transcription and physical binding of TFs respectively, to address this question. Their analysis focused on estimation rather than variable selection. In this section, we focus on identifying cell cycle regulating TFs.

We utilize a yeast cell cycle gene expression data set from Spellman et al. (1998). This experiment measures messenger ribonucleic acid levels every 7 min for 119 min with a total of 18 measurements covering two cell cycle periods. The second data set, CHIP–chip data of Lee et al. (2002), contains binding information of 106 TFs which elucidates which transcriptional regulators bind to promoter sequences of genes across the yeast genome. After excluding genes with missing values in either of the experiments, 542 cell-cycle-related genes are retained.

We analyse these data sets with our proposed multivariate (SPLS–NIPALS) and univariate SPLS regression methods, and also with the lasso for a comparison and summarize the results in Table 5. Since CHIP–chip data provide a proxy for the binary outcome of binding, we scale the CHIP–chip data and use tenfold CV for tuning. Multivariate SPLS selects the least number of TFs (32 TFs), and univariate SPLS selects 70 TFs. The lasso selects the largest number of TFs, 100 out of 106. There are a total of 21 experimentally confirmed cell-cycle-related TFs (Wang et al., 2007), and we report the number of confirmed TFs among those selected as a guideline for performance comparisons. In Table 5, we also report a hypergeometric probability calculation quantifying chance occurrences of the number of confirmed TFs among the variables selected by each method. A comparison of these probabilities indicates that multivariate SPLS has more evidence that selection of a large number of confirmed TFs is not due to chance.

Table 5

Comparison of the number of selected TFs†

Method	Number of TFs selected (s)	Number of confirmed TFs (k)	Prob(K⩾k)
Multivariate SPLS	32	10	0.034
Univariate SPLS	70	17	0.058
Lasso	100	21	0.256
Total	106	21

Method	Number of TFs selected (s)	Number of confirmed TFs (k)	Prob(K⩾k)
Multivariate SPLS	32	10	0.034
Univariate SPLS	70	17	0.058
Lasso	100	21	0.256
Total	106	21

†

Prob(K⩾k) denotes the probability of observing at least k confirmed variables out of 85 unconfirmed and 21 confirmed variables in a random draw of s variables.

Table 5

Comparison of the number of selected TFs†

Method	Number of TFs selected (s)	Number of confirmed TFs (k)	Prob(K⩾k)
Multivariate SPLS	32	10	0.034
Univariate SPLS	70	17	0.058
Lasso	100	21	0.256
Total	106	21

Method	Number of TFs selected (s)	Number of confirmed TFs (k)	Prob(K⩾k)
Multivariate SPLS	32	10	0.034
Univariate SPLS	70	17	0.058
Lasso	100	21	0.256
Total	106	21

†

Prob(K⩾k) denotes the probability of observing at least k confirmed variables out of 85 unconfirmed and 21 confirmed variables in a random draw of s variables.

Open in new tab Download slide

We next compare results from multivariate and univariate SPLS. There are a total of 28 TFs which are selected by both methods and nine of these are experimentally verified according to the literature. The estimators, i.e. TF activities, of selected TFs in general show periodicity. This is indeed a desirable property since the 18 time points cover two periods of a cell cycle. Interestingly, as depicted Fig. 1, multivariate SPLS regression obtains smoother estimates of TF activities compared with univariate SPLS. A total of four TFs are selected only by multivariate SPLS regression. These coefficients are small but consistent across the time points (Fig. 2). A total of 42 TFs are selected only by univariate SPLS, and eight of these are among the confirmed TFs. These TFs do not show periodicity or have non zero coefficients only at few time points (the data are not shown). In general, multivariate SPLS regression can capture the weak effects that are consistent across the time points.

Fig. 1

Estimated TF activities for the 21 confirmed TFs (plots for ABF-1, CBF-1, GCR2 and SKN7 are not displayed since the TF activities of the factors were zero by both the univariate and the multivariate SPLS; the y-axis denotes estimated coefficients and the x-axis is time; multivariate SPLS regression yields smoother estimates and exhibits periodicity): , estimated TF activities by the multivariate SPLS regression; , estimated TF activities by univariate SPLS

Fig. 2

Open in new tab Download slide

Estimated TF activities selected only by the multivariate SPLS regression; the magnitudes of the estimated TF activities are small but consistent across the time points

7. Discussion

PLS regression has been successfully utilized in ill-conditioned linear regression problems that arise in several scientific disciplines. Goutis (1996) showed that PLS yields shrinkage estimators. Butler and Denham (2000) argued that it may provide peculiar shrinkage in the sense that some of the components of the regression coefficient vector may expand instead of shrinking. However, as argued by Rosipal and Krämer (2006), this does not necessarily lead to worse shrinkage because PLS estimators are highly non-linear. We showed that both univariate and multivariate PLS regression estimators are consistent under the latent model assumption with strong restrictions on the number of variables and the sample size. This makes the suitability of PLS for the contemporary very large p and small n paradigm questionable. We argued and illustrated that imposing sparsity on direction vectors helps to avoid sample size problems in the presence of large numbers of irrelevant variables. We further developed a regression technique called SPLS. SPLS regression is also likely to yield shrinkage estimators since the methodology can be considered as a form of PLS regression on a restricted set of predictors. Analysis of its shrinkage properties is among our current investigations. SPLS regression is computationally efficient since it solves a linear equation by employing a CG algorithm rather than matrix inversion at each step.

We presented the solution of the SPLS criterion for the direction vectors and proposed an accompanying SPLS regression algorithm. Our SPLS regression algorithm has connections to other variable selection algorithms including the EN (Zou and Hastie, 2005) and the threshold gradient (Friedman and Popescu, 2004) method. The EN method deals with collinearity in variable selection by incorporating the ridge regression method into the LARS algorithm. In a way, SPLS handles the same issue by fusing the PLS technique into the LARS algorithm. SPLS can also be related to the threshold gradient method in that both algorithms use only the thresholded gradient and not the Hessian. However, SPLS achieves faster convergence by using the CG.

We presented proof-of-principle simulation studies with combinations of small and large number of predictors and sample sizes. These illustrated that SPLS regression achieves both high predictive power and accuracy for finding the relevant variables. Moreover, it can select a higher number of relevant variables than the available sample size since the number of variables that contribute to the direction vectors is not limited by the sample size.

Our application with SPLS involved two recent genomic data types, namely gene expression data and genomewide binding data of TFs. The response variable was continuous and a linear modelling framework followed naturally. Extensions of SPLS to other modelling frameworks such as generalized linear models and survival models are exciting future directions. Our application with integrative analysis of expression and TF binding date highlighted the use of SPLS within the context of a multivariate response. We expect that several genomic problems with multivariate responses, e.g. linking expression of a cluster of genes to genetic marker data, might lend themselves to the multivariate SPLS framework. We provide an implementation of the SPLS regression methodology as an R package at http://cran.r-project.org/web/packages/spls .

Acknowledgements

This research has been supported by National Institutes of Health grant H6003747 and National Science Foundation grant DMS 0804597 to SK.

References

Abramovich

,

F.

,

Benjamini

,

Y.

,

Donoho

,

D. L.

and

Johnstone

,

I. M.

(

2006

)

Adapting to unknown sparsity by controlling the false discovery rate

.

Ann. Statist.

,

34

,

584

–

653

.

D’Aspremont

,

A.

,

Ghaoui

,

L. E.

,

Jordan

,

M. I.

and

Lanckriet

,

G. R. G.

(

2007

)

A direct formulation for sparse pca using semidefinite programming

.

SIAM Rev.

,

49

,

434

–

448

.

Bair

,

E.

,

Hastie

,

T.

,

Paul

,

D.

and

Tibshirani

,

R.

(

2006

)

Prediction by supervised principal components

.

J. Am. Statist. Ass.

,

101

,

119

–

137

.

Bendel

,

R. B.

and

Afifi

,

A. A.

(

1976

)

A criterion for stepwise regression

.

Am. Statistn

,

30

,

85

–

87

.

Benjamini

,

Y.

and

Hochberg

,

Y.

(

1995

)

Controlling the false discovery rate: a practical and powerful approach to multiple testing

.

J. R. Statist. Soc. B

,

57

,

289

–

300

.

Boulesteix

,

A.-L

and

Strimmer

,

K.

(

2005

)

Predicting transcription factor activities from combined analysis of microarray and chip data: a partial least squares approach

.

Theor. Biol. Med. Modllng

,

2

.

Boulesteix

,

A.-L.

and

Strimmer

,

K.

(

2006

)

Partial least squares: a versatile tool for the analysis of high-dimensional genomic data

.

Brief. Bioinform.

,

7

,

32

–

44

.

Ter Braak

,

C. J. F.

and

De Jong

,

S.

(

1998

)

The objective function of partial least squares regression

.

J. Chemometr.

,

12

,

41

–

54

.

Butler

,

N. A.

and

Denham

,

M. C.

(

2000

)

The peculiar shrinkage properties of partial least squares regression

.

J. R. Statist. Soc B

,

62

,

585

–

593

.

Chun

,

H.

and

Keleş

,

S.

(

2009

)

Expression quantitative loci mapping with multivariate sparse partial least squares

.

Genetics

,

182

,

79

–

90

.

Efron

,

B.

,

Hastie

,

T.

,

Johnstone

,

I.

and

Tibshirani

,

R.

(

2004

)

Least angle regression

.

Ann. Statist.

,

32

,

407

–

499

.

Frank

,

I. E.

and

Friedman

,

J. H.

(

1993

)

A statistical view of some chemometrics regression tools

.

Technometrics

,

35

,

109

–

135

.

Friedman

,

J. H.

and

Popescu

,

B. E.

(

2004

)

Gradient directed regularization for linear regression and classification

. Technical Report.

Department of Statistics, Stanford University

,

Stanford

.

Geman

,

S.

(

1980

)

A limit theorem for the norm of random matrices

.

Ann. Probab.

,

8

,

252

–

261

.

Golub

,

G. H.

and

Van Loan

,

C. F.

(

1987

)

Matrix Computations

.

Baltimore

:

Johns Hopkins University Press

.

Google Preview

Goutis

,

C.

(

1996

)

Partial least squares algorithm yields shrinkage estimators

.

Ann. Statist.

,

24

,

816

–

824

.

Hastie

,

T.

,

Tibshirani

,

R.

,

Eisen

,

M.

,

Alizadeh

,

A.

,

Levy

,

R.

,

Staudt

,

L.

,

Botstein

,

D.

and

Brown

,

P.

(

2000

)

Identifying distinct sets of genes with similar expression patterns via ‘‘gene shaving’’

.

Genome Biol.

,

1

,

1

–

21

.

Helland

,

I. S.

(

1990

)

Partial least squares regression and statistical models

.

Scand. J. Statist.

,

17

,

97

–

114

.

Helland

,

I. S.

(

2000

)

Model reduction for prediction in regression models

.

Scand. J. Statist.

,

27

,

1

–

20

.

Helland

,

I. S.

and

Almoy

,

T.

(

1994

)

Comparison of prediction methods when only a few components are relevant

.

J. Am. Statist. Ass.

,

89

,

583

–

591

.

Huang

,

X.

,

Pan

,

W.

,

Park

,

S.

,

Han

,

X.

,

Miller

,

L. W.

and

Hall

,

J.

(

2004

)

Modeling the relationship between lvad support time and gene expression changes in the human heart by penalized partial least squares

.

Bioinformatics

,

20

,

888

–

894

.

Johnstone

,

I. M.

and

Lu

,

A. Y.

(

2004

)

Sparse principal component analysis. Technical Report

.

Department of Statistics, Stanford University

,

Stanford

.

Jolliffe

,

I. T.

,

Trendafilov

,

N. T.

and

Uddin

,

M.

(

2003

)

A modified principal component technique based on the lasso

.

J. Computnl Graph. Statist.

,

12

,

531

–

547

.

De Jong

,

S.

(

1993

)

SIMPLS: an alternative approach to partial least squares regression

.

Chemometr. Intell. Lab. Syst.

,

18

,

251

–

263

.

Kosorok

,

M. R.

and

Ma

,

S.

(

2007

)

Marginal asymptotics for the ‘‘large p, small n’’ paradigm: with applications to microarray data

.

Ann. Statist.

,

35

,

1456

–

1486

.

Krämer

,

N.

(

2007

)

An overview on the shrinkage properties of partial least squares regression

.

Computnl Statist.

,

22

,

249

–

273

.

Lee

,

T. I.

,

Rinaldi

,

N. J.

,

Robert

,

F.

,

Odom

,

D. T.

,

Bar-Joseph

,

Z.

,

Gerber

,

G. K.

,

Hannett

,

N. M.

,

Harbison

,

C. T.

,

Thomson

,

C. M.

,

Simon

,

I.

,

Zeitlinger

,

J.

,

Jennings

,

E. G.

,

Murray

,

H. L.

,

Gordon

,

D. B.

,

Ren

,

B.

,

Wyrick

,

J. J.

,

Tagne

,

J.-B.

,

Volkert

,

T. L.

,

Fraenkel

,

E.

,

Gifford

,

D. K.

and

Young

,

R. A.

(

2002

)

Transcriptional regulatory networks in saccharomyces cerevisiae

.

Science

,

298

,

799

–

804

.

Nadler

,

B.

and

Coifman

,

R. R.

(

2005

)

The prediction error in cls and pls: the importance of feature selection prior to multivariate calibration

.

J. Chemometr.

,

19

,

107

–

118

.

Naik

,

P.

and

Tsai

,

C.-L.

(

2000

)

Partial least squares estimator for single-index models

.

J. R. Statist. Soc. B

,

62

,

763

–

771

.

Pratt

,

J. W.

(

1960

)

On interchanging limits and integrals

.

Ann. Math. Statist.

,

31

,

74

–

77

.

Rosipal

,

R.

and

Krämer

,

N.

(

2006

) Overview and recent advances in partial least squares. In

Subspace, Latent Structure and Feature Selection Techniques

(eds

C.

Saunders

,

M.

Grobelnik

,

S.

Gunn

and

J.

Shawe-Taylor

), pp.

34

–

51

.

New York

:

Springer

.

Spellman

,

P. T.

,

Sherlock

,

G.

,

Zhang

,

M. Q.

,

Iyer

,

V. R.

,

Anders

,

K.

,

Eisen

,

M. B.

,

Brown

,

P. O.

,

Botstein

,

D.

and

Futcher

,

B.

(

1998

)

Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization

.

Molec. Biol. Cell

,

9

,

3273

–

3279

.

Stoica

,

P.

and

Soderstorom

,

T.

(

1998

)

Partial least squares: a first-order analysis

.

Scand. J. Statist.

,

25

,

17

–

24

.

Tibshirani

,

R.

(

1996

)

Regression shrinkage and selection via the lasso

.

J. R. Statist. Soc. B

,

58

,

267

–

288

.

Wang

,

L.

,

Chen

,

G.

and

Li

,

H.

(

2007

)

Group scad regression analysis for microarray time course gene expression data

.

Bioinformatics

,

23

,

1486

–

1494

.

Wold

,

H.

(

1966

)

Estimation of Principal Components and Related Models by Iterative Least Squares

.

New York

:

Academic Press

.

Google Preview

Zou

,

H.

and

Hastie

,

T.

(

2005

)

Regularization and variable selection via the elastic net

.

J. R. Statist. Soc. B

,

67

,

301

–

320

.

Zou

,

H.

,

Hastie

,

T.

and

Tibshirani

,

R.

(

2006

)

Sparse principal component analysis

.

J. Computnl Graph. Statist.

,

15

,

265

–

286

.