| Title: | Selecting Variables in Regression Models |
|---|---|
| Description: | A simple method to select the best model or best subset of variables using different types of data (binary, Gaussian or Poisson) and applying it in different contexts (parametric or non-parametric). Implemented methodology described in: M. Sestelo, N. M. Villanueva, L. Meira-Machado and J. Roca-Pardiñas (2016). FWDselect: an R package for variable selection in regression models. The R Journal, 8 (1), 132-148. <doi:10.32614/RJ-2016-009>. |
| Authors: | Marta Sestelo [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-4284-6509>), Nora M. Villanueva [aut, cph], Javier Roca-Pardinas [aut] |
| Maintainer: | Marta Sestelo <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 2.1.1 |
| Built: | 2026-05-22 08:57:38 UTC |
| Source: | https://github.com/sestelo/fwdselect |
The diabetes data is a data frame with 11 variables and 442 measurements. These are the data used in the Efron et al. (2004) paper. The data has been standardized to have unit L2 norm in each column and zero mean.
diabetesdiabetes
diabetes is a data frame with 11 variables (columns).
The first column of the data frame contains the response variable
(diabetes$y) which is a quantitative measure of disease progression
one year after baseline. The rest of the columns contain the measurements of
the ten explanatory variables (age, sex, body mass index, average blood
pressure and six blood serum registers) obtained from each of the 442
diabetes patients.
The orginal data are available in the lars package, see http://cran.r-project.org/web/packages/lars/.
Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Annals of Statistics, 32:407–499.
library(FWDselect) data(diabetes) head(diabetes)library(FWDselect) data(diabetes) head(diabetes)
Registered values of SO2 in different
temporal instant. Each column of the dataset corresponds with the value
obtained by the series of bi-hourly means for SO2 in the instant t
(5-min temporal instant). The values of this dataset are greater than the
maximum value permitted for SO2 atmospheric.
episodeepisode
episode is a data frame with 19 variables (columns).
response variable, registered values of SO2 at a specific temporal instant, in microg/m3N. This is the value that we want to predict.
registered values of SO2 at a specific temporal instant, in this case instant zero, in microg/m3N.
registered values of SO2 at a specific temporal instant, in this case 5-min instant temporal before, in microg/m3N.
registered values of SO2 at a specific temporal instant, in this case 10-min instant temporal before, in microg/m3N.
...
data(episode) head(episode)data(episode) head(episode)
qselection objectThis function plots the cross-validation information criterion for several
subsets of size chosen by the user.
## S3 method for class 'qselection' plot(x = object, y = NULL, ylab = NULL, ...)## S3 method for class 'qselection' plot(x = object, y = NULL, ylab = NULL, ...)
x |
|
y |
NULL |
ylab |
NULL |
... |
Other options. |
Simply returns a plot.
Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas.
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj2 = qselection(x, y, qvector = c(1:9), method = "lm", criterion = "variance", cluster = FALSE) plot(obj2)library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj2 = qselection(x, y, qvector = c(1:9), method = "lm", criterion = "variance", cluster = FALSE) plot(obj2)
Registered values of SO2 in different temporal instant. Each column of the
dataset corresponds with the value obtained by the series of bi-hourly means
for SO2 in the instant (5-min temporal instant).
pollutionpollution
pollution is a data frame with 19 variables (columns).
response variable, registered values of SO2 at a specific temporal instant, in microg/m3N. This is the value that we want to predict.
registered values of SO2 at a specific temporal instant, in this case instant zero, in microg/m3N.
registered values of SO2 at a specific temporal instant, in this case 5-min instant temporal before, in microg/m3N.
registered values of SO2 at a specific temporal instant, in this case 10-min instant temporal before, in microg/m3N.
...
data(pollution) head(pollution)data(pollution) head(pollution)
qselection summaryqselection summary
## S3 method for class 'qselection' print(x = object, ...)## S3 method for class 'qselection' print(x = object, ...)
x |
|
... |
Other options. |
The function returns a summary table with the subsets of size
, their information criterion values and the chosen variables for
each one. Additionally, an asterisk is shown next to the size of subset
which minimizes the information criterion.
Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas.
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj2 = qselection(x, y, qvector = c(1:9), method = "lm", criterion = "variance", cluster = FALSE) obj2library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj2 = qselection(x, y, qvector = c(1:9), method = "lm", criterion = "variance", cluster = FALSE) obj2
selection summaryselection summary
## S3 method for class 'selection' print(x = model, ...)## S3 method for class 'selection' print(x = model, ...)
x |
|
... |
Other options. |
The function returns the best subset of size q and its information
criterion value. In the case of seconds=TRUE this information is
returned for each alternative model.
Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas.
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj1 = selection(x, y, q = 1, method = "lm", criterion = "variance", cluster = FALSE) obj1library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj1 = selection(x, y, q = 1, method = "lm", criterion = "variance", cluster = FALSE) obj1
Function that enables to obtain the best variables for more than one size of subset. Returns a table with the chosen covariates to be introduced into the models and their information criteria. Additionally, an asterisk is shown next to the size of subset which minimizes the information criterion.
qselection( x, y, qvector, criterion = "deviance", method = "lm", family = "gaussian", nfolds = 5, cluster = TRUE, ncores = NULL )qselection( x, y, qvector, criterion = "deviance", method = "lm", family = "gaussian", nfolds = 5, cluster = TRUE, ncores = NULL )
x |
A data frame containing all the covariates. |
y |
A vector with the response values. |
qvector |
A vector with more than one variable-subset size to be selected. |
criterion |
The information criterion to be used.
Default is the deviance. Other functions provided
are the coefficient of determination ( |
method |
A character string specifying which regression method is used,
i.e., linear models ( |
family |
A description of the error distribution and link function to be
used in the model: ( |
nfolds |
Number of folds for the cross-validation procedure, for
|
cluster |
A logical value. If |
ncores |
An integer value specifying the number of cores to be used
in the parallelized procedure. If |
q |
A vector of subset sizes. |
criterion |
A vector of Information criterion values. |
selection |
Selected variables for each size. |
Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas.
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj2 = qselection(x, y, qvector = c(1:9), method = "lm", criterion = "variance", cluster = FALSE) obj2library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj2 = qselection(x, y, qvector = c(1:9), method = "lm", criterion = "variance", cluster = FALSE) obj2
q variablesMain function for selecting the best subset of variables.
Note that the selection procedure can be used with lm, glm or gam functions.
selection( x, y, q, prevar = NULL, criterion = "deviance", method = "lm", family = "gaussian", seconds = FALSE, nmodels = 1, nfolds = 5, cluster = TRUE, ncores = NULL )selection( x, y, q, prevar = NULL, criterion = "deviance", method = "lm", family = "gaussian", seconds = FALSE, nmodels = 1, nfolds = 5, cluster = TRUE, ncores = NULL )
x |
A data frame containing all the covariates. |
y |
A vector with the response values. |
q |
An integer specifying the size of the subset of variables to be selected. |
prevar |
A vector containing the number of the best subset of
|
criterion |
The information criterion to be used.
Default is the deviance. Other functions provided
are the coefficient of determination ( |
method |
A character string specifying which regression method is used,
i.e., linear models ( |
family |
A description of the error distribution and link function to be
used in the model: ( |
seconds |
A logical value. By default, |
nmodels |
Number of secondary models to be returned. |
nfolds |
Number of folds for the cross-validation procedure, for
|
cluster |
A logical value. If |
ncores |
An integer value specifying the number of cores to be used
in the parallelized procedure. If |
Best model |
The best model. If |
Variable name |
Names of the variable. |
Variable number |
Number of the variables. |
Information criterion |
Information criterion used and its value. |
Prediction |
The prediction of the best model. |
Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas.
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj1 = selection(x, y, q = 1, method = "lm", criterion = "variance", cluster = FALSE) obj1 # second models obj11 = selection(x, y, q = 1, method = "lm", criterion = "variance", seconds = TRUE, nmodels = 2, cluster = FALSE) obj11 # prevar argument obj2 = selection(x, y, q = 2, method = "lm", criterion = "variance", cluster = FALSE) obj2 obj3 = selection(x, y, q = 3, prevar = obj2$Variable_numbers, method = "lm", criterion = "variance", cluster = FALSE)library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj1 = selection(x, y, q = 1, method = "lm", criterion = "variance", cluster = FALSE) obj1 # second models obj11 = selection(x, y, q = 1, method = "lm", criterion = "variance", seconds = TRUE, nmodels = 2, cluster = FALSE) obj11 # prevar argument obj2 = selection(x, y, q = 2, method = "lm", criterion = "variance", cluster = FALSE) obj2 obj3 = selection(x, y, q = 3, prevar = obj2$Variable_numbers, method = "lm", criterion = "variance", cluster = FALSE)
Function that applies a bootstrap based test for covariate selection. It helps to determine the number of variables to be included in the model.
test( x, y, method = "lm", family = "gaussian", nboot = 50, speedup = TRUE, qmin = NULL, unique = FALSE, q = NULL, bootseed = NULL, cluster = TRUE, ncores = NULL )test( x, y, method = "lm", family = "gaussian", nboot = 50, speedup = TRUE, qmin = NULL, unique = FALSE, q = NULL, bootseed = NULL, cluster = TRUE, ncores = NULL )
x |
A data frame containing all the covariates. |
y |
A vector with the response values. |
method |
A character string specifying which regression method is used,
i.e., linear models ( |
family |
A description of the error distribution and link function to be
used in the model: ( |
nboot |
Number of bootstrap repeats. |
speedup |
A logical value. If |
qmin |
By default |
unique |
A logical value. By default |
q |
By default |
bootseed |
Seed to be used in the bootstrap procedure. |
cluster |
A logical value. If |
ncores |
An integer value specifying the number of cores to be used
in the parallelized procedure. If |
In a regression framework, let , a set of
initial variables and the response variable, we propose a
procedure to test the null hypothesis of significant variables in
the model – effects not equal to zero– versus the alternative in
which the model contains more than variables. Based on the general
model
the following
strategy is considered: for a subset of size , considerations will be
given to a test for the null hypothesis
vs. the general hypothesis
A list with two objects. The first one is a table containing
Hypothesis |
Number of the null hypothesis tested |
Statistic |
Value of the T statistic |
pvalue |
pvalue obtained in the testing procedure |
Decision |
Result of the test for a significance level of 0.05 |
The second argument nvar indicates the number of variables that
have to be included in the model.
The detailed expression of the formulas are described in HTML help.
Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas.
Sestelo, M., Villanueva, N. M. and Roca-Pardinas, J. (2013). FWDselect: an R package for selecting variables in regression models. Discussion Papers in Statistics and Operation Research, University of Vigo, 13/01.
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] test(x, y, method = "lm", cluster = FALSE, nboot = 5) ## for speedup = FALSE # obj2 = qselection(x, y, qvector = c(1:9), method = "lm", # cluster = FALSE) # plot(obj2) # we choose q = 7 for the argument qmin # test(x, y, method = "lm", cluster = FALSE, nboot = 5, # speedup = FALSE, qmin = 7)library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] test(x, y, method = "lm", cluster = FALSE, nboot = 5) ## for speedup = FALSE # obj2 = qselection(x, y, qvector = c(1:9), method = "lm", # cluster = FALSE) # plot(obj2) # we choose q = 7 for the argument qmin # test(x, y, method = "lm", cluster = FALSE, nboot = 5, # speedup = FALSE, qmin = 7)