Title: | Nonparametric Multiple Expectile Regression via ER-Boost |
---|---|
Description: | Expectile regression is a nice tool for estimating the conditional expectiles of a response variable given a set of covariates. This package implements a regression tree based gradient boosting estimator for nonparametric multiple expectile regression, proposed by Yang, Y., Qian, W. and Zou, H. (2018) <doi:10.1080/00949655.2013.876024>. The code is based on the 'gbm' package originally developed by Greg Ridgeway. |
Authors: | Yi Yang [aut, cre] (http://www.math.mcgill.ca/yyang/), Hui Zou [aut] (http://users.stat.umn.edu/~zouxx019/), Greg Ridgeway [ctb, cph] |
Maintainer: | Yi Yang <[email protected]> |
License: | GPL-3 |
Version: | 1.5 |
Built: | 2025-03-26 05:34:09 UTC |
Source: | https://github.com/cran/erboost |
Fits ER-Boost Expectile Regression models.
erboost(formula = formula(data), distribution = list(name="expectile",alpha=0.5), data = list(), weights, var.monotone = NULL, n.trees = 3000, interaction.depth = 3, n.minobsinnode = 10, shrinkage = 0.001, bag.fraction = 0.5, train.fraction = 1.0, cv.folds=0, keep.data = TRUE, verbose = TRUE) erboost.fit(x,y, offset = NULL, misc = NULL, distribution = list(name="expectile",alpha=0.5), w = NULL, var.monotone = NULL, n.trees = 3000, interaction.depth = 3, n.minobsinnode = 10, shrinkage = 0.001, bag.fraction = 0.5, train.fraction = 1.0, keep.data = TRUE, verbose = TRUE, var.names = NULL, response.name = NULL) erboost.more(object, n.new.trees = 3000, data = NULL, weights = NULL, offset = NULL, verbose = NULL)
erboost(formula = formula(data), distribution = list(name="expectile",alpha=0.5), data = list(), weights, var.monotone = NULL, n.trees = 3000, interaction.depth = 3, n.minobsinnode = 10, shrinkage = 0.001, bag.fraction = 0.5, train.fraction = 1.0, cv.folds=0, keep.data = TRUE, verbose = TRUE) erboost.fit(x,y, offset = NULL, misc = NULL, distribution = list(name="expectile",alpha=0.5), w = NULL, var.monotone = NULL, n.trees = 3000, interaction.depth = 3, n.minobsinnode = 10, shrinkage = 0.001, bag.fraction = 0.5, train.fraction = 1.0, keep.data = TRUE, verbose = TRUE, var.names = NULL, response.name = NULL) erboost.more(object, n.new.trees = 3000, data = NULL, weights = NULL, offset = NULL, verbose = NULL)
formula |
a symbolic description of the model to be fit. The formula may
include an offset term (e.g. y~offset(n)+x). If |
distribution |
a list with a component |
data |
an optional data frame containing the variables in the model. By
default the variables are taken from |
weights |
an optional vector of weights to be used in the fitting process.
Must be positive but do not need to be normalized. If |
var.monotone |
an optional vector, the same length as the number of predictors, indicating which variables have a monotone increasing (+1), decreasing (-1), or arbitrary (0) relationship with the outcome. |
n.trees |
the total number of trees to fit. This is equivalent to the
number of iterations and the number of basis functions in the additive
expansion. The default number is 3000. Users should not always use the default value, but choose
the appropriate value of |
cv.folds |
Number of cross-validation folds to perform. If |
interaction.depth |
The maximum depth of variable interactions. 1 implies
an additive model, 2 implies a model with up to 2-way interactions, etc.
The default value is 3. Users should not always use the default value, but choose
the appropriate value of |
n.minobsinnode |
minimum number of observations in the trees terminal nodes. Note that this is the actual number of observations not the total weight. |
shrinkage |
a shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction. |
bag.fraction |
the fraction of the training set observations randomly
selected to propose the next tree in the expansion. This introduces randomnesses
into the model fit. If |
train.fraction |
The first |
keep.data |
a logical variable indicating whether to keep the data and
an index of the data stored with the object. Keeping the data and index makes
subsequent calls to |
object |
a |
n.new.trees |
the number of additional trees to add to |
verbose |
If TRUE, erboost will print out progress and performance indicators.
If this option is left unspecified for erboost.more then it uses |
x , y
|
For |
offset |
a vector of values for the offset |
misc |
For |
w |
For |
var.names |
For |
response.name |
For |
Expectile regression (Newey & Powell 1987) is a nice tool for estimating the conditional expectiles of a response variable given a set of covariates. This package implements a regression tree based gradient boosting estimator for nonparametric multiple expectile regression. The code is a modified version of gbm
library (https://cran.r-project.org/package=gbm) originally written by Greg Ridgeway.
Boosting is the process of iteratively adding basis functions in a greedy fashion so that each additional basis function further reduces the selected loss function. This implementation closely follows Friedman's Gradient Boosting Machine (Friedman, 2001).
In addition to many of the features documented in the Gradient Boosting Machine,
erboost
offers additional features including the out-of-bag estimator for
the optimal number of iterations, the ability to store and manipulate the
resulting erboost
object.
Concerning tuning parameters, interaction.depth
and n.trees
are two of the most important tuning parameters in erboost. Users should not always use the default values of those two parameters, instead they should choose the appropriate values of interaction.depth
and n.trees
according to their data. For example, if n.trees
, which is the maximal number of trees to fit, is set to be too small, then it is possible that the actual optimal number of trees (which is best.iter
selected by the function erboost.perf
in "example" section) for a particular data exceeds this number, resulting a sub-optimal model. Therefore, users should always fit the model with a large enough n.trees
such that n.trees
is greater than the potential optimal number of trees. The same principle also applies on interaction.depth
.
erboost.fit
provides the link between R and the C++ erboost engine. erboost
is a front-end to erboost.fit
that uses the familiar R modeling formulas.
However, model.frame
is very slow if there are many
predictor variables. For power-users with many variables use erboost.fit
.
For general practice erboost
is preferable.
erboost
, erboost.fit
, and erboost.more
return a
erboost.object
.
Yi Yang [email protected] and Hui Zou [email protected]
Yang, Y. and Zou, H. (2015), “Nonparametric Multiple Expectile Regression via ER-Boost,” Journal of Statistical Computation and Simulation, 84(1), 84-95.
G. Ridgeway (1999). “The state of boosting,” Computing Science and Statistics 31:172-181.
https://cran.r-project.org/package=gbm
J.H. Friedman (2001). “Greedy Function Approximation: A Gradient Boosting Machine,” Annals of Statistics 29(5):1189-1232.
J.H. Friedman (2002). “Stochastic Gradient Boosting,” Computational Statistics and Data Analysis 38(4):367-378.
erboost.object
,
erboost.perf
,
plot.erboost
,
predict.erboost
,
summary.erboost
,
N <- 200 X1 <- runif(N) X2 <- 2*runif(N) X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) X4 <- factor(sample(letters[1:6],N,replace=TRUE)) X5 <- factor(sample(letters[1:3],N,replace=TRUE)) X6 <- 3*runif(N) mu <- c(-1,0,1,2)[as.numeric(X3)] SNR <- 10 # signal-to-noise ratio Y <- X1**1.5 + 2 * (X2**.5) + mu sigma <- sqrt(var(Y)/SNR) Y <- Y + rnorm(N,0,sigma) # introduce some missing values X1[sample(1:N,size=50)] <- NA X4[sample(1:N,size=30)] <- NA data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) # fit initial model erboost1 <- erboost(Y~X1+X2+X3+X4+X5+X6, # formula data=data, # dataset var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, # +1: monotone increase, # 0: no monotone restrictions distribution=list(name="expectile",alpha=0.5), # expectile n.trees=3000, # number of trees shrinkage=0.005, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5, # fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 5, # do 5-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=TRUE) # print out progress # check performance using a 50% heldout test set best.iter <- erboost.perf(erboost1,method="test") print(best.iter) # check performance using 5-fold cross-validation best.iter <- erboost.perf(erboost1,method="cv") print(best.iter) # plot the performance # plot variable influence summary(erboost1,n.trees=1) # based on the first tree summary(erboost1,n.trees=best.iter) # based on the estimated best number of trees # make some new data N <- 20 X1 <- runif(N) X2 <- 2*runif(N) X3 <- ordered(sample(letters[1:4],N,replace=TRUE)) X4 <- factor(sample(letters[1:6],N,replace=TRUE)) X5 <- factor(sample(letters[1:3],N,replace=TRUE)) X6 <- 3*runif(N) mu <- c(-1,0,1,2)[as.numeric(X3)] Y <- X1**1.5 + 2 * (X2**.5) + mu + rnorm(N,0,sigma) data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) # predict on the new data using "best" number of trees # f.predict generally will be on the canonical scale f.predict <- predict.erboost(erboost1,data2,best.iter) # least squares error print(sum((data2$Y-f.predict)^2)) # create marginal plots # plot variable X1 after "best" iterations plot.erboost(erboost1,1,best.iter) # contour plot of variables 1 and 3 after "best" iterations plot.erboost(erboost1,c(1,3),best.iter) # do another 20 iterations erboost2 <- erboost.more(erboost1,20, verbose=FALSE) # stop printing detailed progress
N <- 200 X1 <- runif(N) X2 <- 2*runif(N) X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) X4 <- factor(sample(letters[1:6],N,replace=TRUE)) X5 <- factor(sample(letters[1:3],N,replace=TRUE)) X6 <- 3*runif(N) mu <- c(-1,0,1,2)[as.numeric(X3)] SNR <- 10 # signal-to-noise ratio Y <- X1**1.5 + 2 * (X2**.5) + mu sigma <- sqrt(var(Y)/SNR) Y <- Y + rnorm(N,0,sigma) # introduce some missing values X1[sample(1:N,size=50)] <- NA X4[sample(1:N,size=30)] <- NA data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) # fit initial model erboost1 <- erboost(Y~X1+X2+X3+X4+X5+X6, # formula data=data, # dataset var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, # +1: monotone increase, # 0: no monotone restrictions distribution=list(name="expectile",alpha=0.5), # expectile n.trees=3000, # number of trees shrinkage=0.005, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5, # fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 5, # do 5-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=TRUE) # print out progress # check performance using a 50% heldout test set best.iter <- erboost.perf(erboost1,method="test") print(best.iter) # check performance using 5-fold cross-validation best.iter <- erboost.perf(erboost1,method="cv") print(best.iter) # plot the performance # plot variable influence summary(erboost1,n.trees=1) # based on the first tree summary(erboost1,n.trees=best.iter) # based on the estimated best number of trees # make some new data N <- 20 X1 <- runif(N) X2 <- 2*runif(N) X3 <- ordered(sample(letters[1:4],N,replace=TRUE)) X4 <- factor(sample(letters[1:6],N,replace=TRUE)) X5 <- factor(sample(letters[1:3],N,replace=TRUE)) X6 <- 3*runif(N) mu <- c(-1,0,1,2)[as.numeric(X3)] Y <- X1**1.5 + 2 * (X2**.5) + mu + rnorm(N,0,sigma) data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) # predict on the new data using "best" number of trees # f.predict generally will be on the canonical scale f.predict <- predict.erboost(erboost1,data2,best.iter) # least squares error print(sum((data2$Y-f.predict)^2)) # create marginal plots # plot variable X1 after "best" iterations plot.erboost(erboost1,1,best.iter) # contour plot of variables 1 and 3 after "best" iterations plot.erboost(erboost1,c(1,3),best.iter) # do another 20 iterations erboost2 <- erboost.more(erboost1,20, verbose=FALSE) # stop printing detailed progress
These are objects representing fitted erboost
s.
initF |
the "intercept" term, the initial predicted value to which trees make adjustments |
fit |
a vector containing the fitted values on the scale of regression function |
train.error |
a vector of length equal to the number of fitted trees containing the value of the loss function for each boosting iteration evaluated on the training data |
valid.error |
a vector of length equal to the number of fitted trees containing the value of the loss function for each boosting iteration evaluated on the validation data |
cv.error |
if |
oobag.improve |
a vector of length equal to the number of fitted trees
containing an out-of-bag estimate of the marginal reduction in the expected
value of the loss function. The out-of-bag estimate uses only the training
data and is useful for estimating the optimal number of boosting iterations.
See |
trees |
a list containing the tree structures. |
c.splits |
a list of all the categorical splits in the collection of
trees. If the |
The following components must be included in a legitimate erboost
object.
Yi Yang [email protected] and Hui Zou [email protected]
Estimates the optimal number of boosting iterations for a erboost
object and
optionally plots various performance measures
erboost.perf(object, plot.it = TRUE, oobag.curve = FALSE, overlay = TRUE, method)
erboost.perf(object, plot.it = TRUE, oobag.curve = FALSE, overlay = TRUE, method)
object |
a |
plot.it |
an indicator of whether or not to plot the performance measures.
Setting |
oobag.curve |
indicates whether to plot the out-of-bag performance measures in a second plot. |
overlay |
if TRUE and oobag.curve=TRUE then a right y-axis is added to the training and test error plot and the estimated cumulative improvement in the loss function is plotted versus the iteration number. |
method |
indicate the method used to estimate the optimal number
of boosting iterations. |
erboost.perf
returns the estimated optimal number of iterations. The method
of computation depends on the method
argument.
Yi Yang [email protected] and Hui Zou [email protected]
Yang, Y. and Zou, H. (2015), “Nonparametric Multiple Expectile Regression via ER-Boost,” Journal of Statistical Computation and Simulation, 84(1), 84-95.
G. Ridgeway (1999). “The state of boosting,” Computing Science and Statistics 31:172-181.
https://cran.r-project.org/package=gbm
Plots the marginal effect of the selected variables by "integrating" out the other variables.
## S3 method for class 'erboost' plot(x, i.var = 1, n.trees = x$n.trees, continuous.resolution = 100, return.grid = FALSE, ...)
## S3 method for class 'erboost' plot(x, i.var = 1, n.trees = x$n.trees, continuous.resolution = 100, return.grid = FALSE, ...)
x |
a |
i.var |
a vector of indices or the names of the variables to plot. If
using indices, the variables are indexed in the same order that they appear
in the initial |
n.trees |
the number of trees used to generate the plot. Only the first
|
continuous.resolution |
The number of equally space points at which to evaluate continuous predictors |
return.grid |
if |
... |
other arguments passed to the plot function |
plot.erboost
produces low dimensional projections of the
erboost.object
by integrating out the variables not included in the
i.var
argument. The function selects a grid of points and uses the
weighted tree traversal method described in Friedman (2001) to do the
integration. Based on the variable types included in the projection,
plot.erboost
selects an appropriate display choosing amongst line plots,
contour plots, and lattice
plots. If the default graphics
are not sufficient the user may set return.grid=TRUE
, store the result
of the function, and develop another graphic display more appropriate to the
particular example.
Nothing unless return.grid
is true then plot.erboost
produces no
graphics and only returns the grid of evaluation points and their average
predictions.
Yi Yang [email protected] and Hui Zou [email protected]
Yang, Y. and Zou, H. (2015), “Nonparametric Multiple Expectile Regression via ER-Boost,” Journal of Statistical Computation and Simulation, 84(1), 84-95.
G. Ridgeway (1999). “The state of boosting,” Computing Science and Statistics 31:172-181.
https://cran.r-project.org/package=gbm
J.H. Friedman (2001). "Greedy Function Approximation: A Gradient Boosting Machine," Annals of Statistics 29(4).
Predicted values based on an ER-Boost Expectile regression model object
## S3 method for class 'erboost' predict(object, newdata, n.trees, single.tree=FALSE, ...)
## S3 method for class 'erboost' predict(object, newdata, n.trees, single.tree=FALSE, ...)
object |
Object of class inheriting from ( |
newdata |
Data frame of observations for which to make predictions |
n.trees |
Number of trees used in the prediction. |
single.tree |
If |
... |
further arguments passed to or from other methods |
predict.erboost
produces predicted values for each observation in newdata
using the the first n.trees
iterations of the boosting sequence. If n.trees
is a vector than the result is a matrix with each column representing the predictions from erboost models with n.trees[1]
iterations, n.trees[2]
iterations, and so on.
The predictions from erboost
do not include the offset term. The user may add the value of the offset to the predicted value if desired.
If object
was fit using erboost.fit
there will be no
Terms
component. Therefore, the user has greater responsibility to make
sure that newdata
is of the same format (order and number of variables)
as the one originally used to fit the model.
Returns a vector of predictions. By default the predictions are on the scale of f(x).
Yi Yang [email protected] and Hui Zou [email protected]
Helper functions for computing the relative influence of each variable in the erboost object.
relative.influence(object, n.trees) permutation.test.erboost(object, n.trees) erboost.loss(y,f,w,offset,dist,baseline)
relative.influence(object, n.trees) permutation.test.erboost(object, n.trees) erboost.loss(y,f,w,offset,dist,baseline)
object |
a |
n.trees |
the number of trees to use for computations. |
y , f , w , offset , dist , baseline
|
For |
This is not intended for end-user use. These functions offer the different
methods for computing the relative influence in summary.erboost
.
erboost.loss
is a helper function for permutation.test.erboost
.
Returns an unprocessed vector of estimated relative influences.
Yi Yang [email protected] and Hui Zou [email protected]
Yang, Y. and Zou, H. (2015), “Nonparametric Multiple Expectile Regression via ER-Boost,” Journal of Statistical Computation and Simulation, 84(1), 84-95.
G. Ridgeway (1999). “The state of boosting,” Computing Science and Statistics 31:172-181.
https://cran.r-project.org/package=gbm
J.H. Friedman (2001). "Greedy Function Approximation: A Gradient Boosting Machine," Annals of Statistics 29(5):1189-1232.
Computes the relative influence of each variable in the erboost object.
## S3 method for class 'erboost' summary(object, cBars=length(object$var.names), n.trees=object$n.trees, plotit=TRUE, order=TRUE, method=relative.influence, normalize=TRUE, ...)
## S3 method for class 'erboost' summary(object, cBars=length(object$var.names), n.trees=object$n.trees, plotit=TRUE, order=TRUE, method=relative.influence, normalize=TRUE, ...)
object |
a |
cBars |
the number of bars to plot. If |
n.trees |
the number of trees used to generate the plot. Only the first
|
plotit |
an indicator as to whether the plot is generated. |
order |
an indicator as to whether the plotted and/or returned relative influences are sorted. |
method |
The function used to compute the relative influence.
|
normalize |
if |
... |
other arguments passed to the plot function. |
This returns the reduction attributeable to each varaible in sum of squared error in predicting the gradient on each iteration. It describes the relative influence of each variable in reducing the loss function. See the references below for exact details on the computation.
Returns a data frame where the first component is the variable name and the second is the computed relative influence, normalized to sum to 100.
Yi Yang [email protected] and Hui Zou [email protected]
Yang, Y. and Zou, H. (2015), “Nonparametric Multiple Expectile Regression via ER-Boost,” Journal of Statistical Computation and Simulation, 84(1), 84-95.
G. Ridgeway (1999). “The state of boosting,” Computing Science and Statistics 31:172-181.
https://cran.r-project.org/package=gbm
J.H. Friedman (2001). "Greedy Function Approximation: A Gradient Boosting Machine," Annals of Statistics 29(5):1189-1232.