Title: | Selecting Variables in Regression Models |
---|---|
Description: | A simple method to select the best model or best subset of variables using different types of data (binary, Gaussian or Poisson) and applying it in different contexts (parametric or non-parametric). |
Authors: | Marta Sestelo [aut, cre], Nora M. Villanueva [aut], Javier Roca-Pardinas [aut] |
Maintainer: | Marta Sestelo <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.1.0.9000 |
Built: | 2024-11-20 03:03:34 UTC |
Source: | https://github.com/sestelo/fwdselect |
The diabetes data is a data frame with 11 variables and 442 measurements. These are the data used in the Efron et al. (2004) paper. The data has been standardized to have unit L2 norm in each column and zero mean.
diabetes
diabetes
diabetes
is a data frame with 11 variables (columns).
The first column of the data frame contains the response variable
(diabetes$y
) which is a quantitative measure of disease progression
one year after baseline. The rest of the columns contain the measurements of
the ten explanatory variables (age, sex, body mass index, average blood
pressure and six blood serum registers) obtained from each of the 442
diabetes patients.
The orginal data are available in the lars package, see http://cran.r-project.org/web/packages/lars/.
Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Annals of Statistics, 32:407–499.
library(FWDselect) data(diabetes) head(diabetes)
library(FWDselect) data(diabetes) head(diabetes)
Registered values of SO2 in different
temporal instant. Each column of the dataset corresponds with the value
obtained by the series of bi-hourly means for SO2 in the instant t
(5-min temporal instant). The values of this dataset are greater than the
maximum value permitted for SO2 atmospheric.
episode
episode
episode
is a data frame with 19 variables (columns).
response variable, registered values of SO2 at a specific temporal instant, in microg/m3N. This is the value that we want to predict.
registered values of SO2 at a specific temporal instant, in this case instant zero, in microg/m3N.
registered values of SO2 at a specific temporal instant, in this case 5-min instant temporal before, in microg/m3N.
registered values of SO2 at a specific temporal instant, in this case 10-min instant temporal before, in microg/m3N.
...
data(episode) head(episode)
data(episode) head(episode)
FWDselect
: Selecting Variables in Regression Models.This package introduces a simple method to select the best model using different types of data (binary, Gaussian or Poisson) and applying it in different contexts (parametric or non-parametric). The proposed method is a new forward stepwise-based selection procedure that selects a model containing a subset of variables according to an optimal criterion (obtained by cross-validation) and also takes into account the computational cost. Additionally, bootstrap resampling techniques are used to implement tests capable of detecting whether significant effects of the unselected variables are present in the model.
Package: | FWDselect |
Type: | Package |
Version: | 2.1.0 |
Date: | 2015-12-18 |
License: | MIT + file LICENSE |
FWDselect
is just a shortcut for “Forward selection” and is a very
good summary of one of the package's major functionalities, i.e., that of
providing a forward stepwise-based selection procedure. This software helps
the user select relevant variables and evaluate how many of these need to be
included in a regression model. In addition, it enables both numerical and
graphical outputs to be displayed. The package includes several functions
that enable users to select the variables to be included in linear,
generalized linear or generalized additive regression models. Users can
obtain the best combinations of q
variables by means of the main
function selection
. Additionally, if one wants to obtain the
results for more than one size of subset, it is possible to apply the
qselection
function, which returns a summary table showing the
different subsets, selected variables and information criterion values. The
object obtained when using this last function is the argument required for
plot.qselection
, which provides a graphical output. Finally, to
determine the number of variables that should be introduced into the model,
only the test
function needs to be applied.
Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas.
Burnham, K., Anderson, D. (2002). Model selection and multimodel inference: a practical information-theoretic approach. 2nd Edition Springer.
Efron, B. (1979). Bootstrap methods: another look at the jackknife. Annals of Statistics, 7:1-26.
Efron, B. and Tibshirani, R. J. (1993). An introduction to the Bootstrap. Chapman and Hall, London.
Miller, A. (2002). Subset selection in regression. Champman and Hall.
Sestelo, M., Villanueva, N. M. and Roca-Pardinas, J. (2013). FWDselect: Variable selection algorithm in regression models. Discussion Papers in Statistics and Operation Research, University of Vigo, 13/02.
qselection
objectThis function plots the cross-validation information criterion for several
subsets of size chosen by the user.
## S3 method for class 'qselection' plot(x = object, y = NULL, ylab = NULL, ...)
## S3 method for class 'qselection' plot(x = object, y = NULL, ylab = NULL, ...)
x |
|
y |
NULL |
ylab |
NULL |
... |
Other options. |
Simply returns a plot.
Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas.
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj2 = qselection(x, y, qvector = c(1:9), method = "lm", criterion = "variance", cluster = FALSE) plot(obj2)
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj2 = qselection(x, y, qvector = c(1:9), method = "lm", criterion = "variance", cluster = FALSE) plot(obj2)
Registered values of SO2 in different temporal instant. Each column of the
dataset corresponds with the value obtained by the series of bi-hourly means
for SO2 in the instant (5-min temporal instant).
pollution
pollution
pollution
is a data frame with 19 variables (columns).
response variable, registered values of SO2 at a specific temporal instant, in microg/m3N. This is the value that we want to predict.
registered values of SO2 at a specific temporal instant, in this case instant zero, in microg/m3N.
registered values of SO2 at a specific temporal instant, in this case 5-min instant temporal before, in microg/m3N.
registered values of SO2 at a specific temporal instant, in this case 10-min instant temporal before, in microg/m3N.
...
data(pollution) head(pollution)
data(pollution) head(pollution)
qselection
summaryqselection
summary
## S3 method for class 'qselection' print(x = object, ...)
## S3 method for class 'qselection' print(x = object, ...)
x |
|
... |
Other options. |
The function returns a summary table with the subsets of size
, their information criterion values and the chosen variables for
each one. Additionally, an asterisk is shown next to the size of subset
which minimizes the information criterion.
Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas.
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj2 = qselection(x, y, qvector = c(1:9), method = "lm", criterion = "variance", cluster = FALSE) obj2
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj2 = qselection(x, y, qvector = c(1:9), method = "lm", criterion = "variance", cluster = FALSE) obj2
selection
summaryselection
summary
## S3 method for class 'selection' print(x = model, ...)
## S3 method for class 'selection' print(x = model, ...)
x |
|
... |
Other options. |
The function returns the best subset of size q and its information
criterion value. In the case of seconds=TRUE
this information is
returned for each alternative model.
Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas.
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj1 = selection(x, y, q = 1, method = "lm", criterion = "variance", cluster = FALSE) obj1
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj1 = selection(x, y, q = 1, method = "lm", criterion = "variance", cluster = FALSE) obj1
Function that enables to obtain the best variables for more than one size of subset. Returns a table with the chosen covariates to be introduced into the models and their information criteria. Additionally, an asterisk is shown next to the size of subset which minimizes the information criterion.
qselection(x, y, qvector, criterion = "deviance", method = "lm", family = "gaussian", nfolds = 5, cluster = TRUE, ncores = NULL)
qselection(x, y, qvector, criterion = "deviance", method = "lm", family = "gaussian", nfolds = 5, cluster = TRUE, ncores = NULL)
x |
A data frame containing all the covariates. |
y |
A vector with the response values. |
qvector |
A vector with more than one variable-subset size to be selected. |
criterion |
The information criterion to be used.
Default is the deviance. Other functions provided
are the coefficient of determination ( |
method |
A character string specifying which regression method is used,
i.e., linear models ( |
family |
A description of the error distribution and link function to be
used in the model: ( |
nfolds |
Number of folds for the cross-validation procedure, for
|
cluster |
A logical value. If |
ncores |
An integer value specifying the number of cores to be used
in the parallelized procedure. If |
q |
A vector of subset sizes. |
criterion |
A vector of Information criterion values. |
selection |
Selected variables for each size. |
Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas.
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj2 = qselection(x, y, qvector = c(1:9), method = "lm", criterion = "variance", cluster = FALSE) obj2
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj2 = qselection(x, y, qvector = c(1:9), method = "lm", criterion = "variance", cluster = FALSE) obj2
q
variablesMain function for selecting the best subset of variables.
Note that the selection procedure can be used with lm, glm or gam functions.
selection(x, y, q, prevar = NULL, criterion = "deviance", method = "lm", family = "gaussian", seconds = FALSE, nmodels = 1, nfolds = 5, cluster = TRUE, ncores = NULL)
selection(x, y, q, prevar = NULL, criterion = "deviance", method = "lm", family = "gaussian", seconds = FALSE, nmodels = 1, nfolds = 5, cluster = TRUE, ncores = NULL)
x |
A data frame containing all the covariates. |
y |
A vector with the response values. |
q |
An integer specifying the size of the subset of variables to be selected. |
prevar |
A vector containing the number of the best subset of
|
criterion |
The information criterion to be used.
Default is the deviance. Other functions provided
are the coefficient of determination ( |
method |
A character string specifying which regression method is used,
i.e., linear models ( |
family |
A description of the error distribution and link function to be
used in the model: ( |
seconds |
A logical value. By default, |
nmodels |
Number of secondary models to be returned. |
nfolds |
Number of folds for the cross-validation procedure, for
|
cluster |
A logical value. If |
ncores |
An integer value specifying the number of cores to be used
in the parallelized procedure. If |
Best model |
The best model. If |
Variable name |
Names of the variable. |
Variable number |
Number of the variables. |
Information criterion |
Information criterion used and its value. |
Prediction |
The prediction of the best model. |
Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas.
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj1 = selection(x, y, q = 1, method = "lm", criterion = "variance", cluster = FALSE) obj1 # second models obj11 = selection(x, y, q = 1, method = "lm", criterion = "variance", seconds = TRUE, nmodels = 2, cluster = FALSE) obj11 # prevar argument obj2 = selection(x, y, q = 2, method = "lm", criterion = "variance", cluster = FALSE) obj2 obj3 = selection(x, y, q = 3, prevar = obj2$Variable_numbers, method = "lm", criterion = "variance", cluster = FALSE)
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] obj1 = selection(x, y, q = 1, method = "lm", criterion = "variance", cluster = FALSE) obj1 # second models obj11 = selection(x, y, q = 1, method = "lm", criterion = "variance", seconds = TRUE, nmodels = 2, cluster = FALSE) obj11 # prevar argument obj2 = selection(x, y, q = 2, method = "lm", criterion = "variance", cluster = FALSE) obj2 obj3 = selection(x, y, q = 3, prevar = obj2$Variable_numbers, method = "lm", criterion = "variance", cluster = FALSE)
Function that applies a bootstrap based test for covariate selection. It helps to determine the number of variables to be included in the model.
test(x, y, method = "lm", family = "gaussian", nboot = 50, speedup = TRUE, qmin = NULL, unique = FALSE, q = NULL, bootseed = NULL, cluster = TRUE, ncores = NULL)
test(x, y, method = "lm", family = "gaussian", nboot = 50, speedup = TRUE, qmin = NULL, unique = FALSE, q = NULL, bootseed = NULL, cluster = TRUE, ncores = NULL)
x |
A data frame containing all the covariates. |
y |
A vector with the response values. |
method |
A character string specifying which regression method is used,
i.e., linear models ( |
family |
A description of the error distribution and link function to be
used in the model: ( |
nboot |
Number of bootstrap repeats. |
speedup |
A logical value. If |
qmin |
By default |
unique |
A logical value. By default |
q |
By default |
bootseed |
Seed to be used in the bootstrap procedure. |
cluster |
A logical value. If |
ncores |
An integer value specifying the number of cores to be used
in the parallelized procedure. If |
In a regression framework, let , a set of
initial variables and
the response variable, we propose a
procedure to test the null hypothesis of
significant variables in
the model –
effects not equal to zero– versus the alternative in
which the model contains more than
variables. Based on the general
model
the following
strategy is considered: for a subset of size , considerations will be
given to a test for the null hypothesis
vs. the general hypothesis
A list with two objects. The first one is a table containing
Hypothesis |
Number of the null hypothesis tested |
Statistic |
Value of the T statistic |
pvalue |
pvalue obtained in the testing procedure |
Decision |
Result of the test for a significance level of 0.05 |
The second argument nvar
indicates the number of variables that
have to be included in the model.
The detailed expression of the formulas are described in HTML help http://cran.r-project.org/web/packages/FWDselect/FWDselect.pdf
Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas.
Sestelo, M., Villanueva, N. M. and Roca-Pardinas, J. (2013). FWDselect: an R package for selecting variables in regression models. Discussion Papers in Statistics and Operation Research, University of Vigo, 13/01.
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] test(x, y, method = "lm", cluster = FALSE, nboot = 5) ## for speedup = FALSE # obj2 = qselection(x, y, qvector = c(1:9), method = "lm", # cluster = FALSE) # plot(obj2) # we choose q = 7 for the argument qmin # test(x, y, method = "lm", cluster = FALSE, nboot = 5, # speedup = FALSE, qmin = 7)
library(FWDselect) data(diabetes) x = diabetes[ ,2:11] y = diabetes[ ,1] test(x, y, method = "lm", cluster = FALSE, nboot = 5) ## for speedup = FALSE # obj2 = qselection(x, y, qvector = c(1:9), method = "lm", # cluster = FALSE) # plot(obj2) # we choose q = 7 for the argument qmin # test(x, y, method = "lm", cluster = FALSE, nboot = 5, # speedup = FALSE, qmin = 7)