Boosted Forest Regressor¶
-
class
skgrf.ensemble.
GRFBoostedForestRegressor
(n_estimators=100, equalize_cluster_weights=False, sample_fraction=0.5, mtry=None, min_node_size=5, honesty=True, honesty_fraction=0.5, honesty_prune_leaves=True, alpha=0.05, imbalance_penalty=0, ci_group_size=2, tune_params=None, tune_n_estimators=50, tune_n_reps=100, tune_n_draws=1000, boost_steps=None, boost_error_reduction=0.97, boost_max_steps=5, boost_trees_tune=10, n_jobs=- 1, seed=42)[source]¶ GRF Boosted Regression implementation for sci-kit learn.
Provides a sklearn regressor interface to the GRF C++ library using Cython.
- Parameters
n_estimators (int) – The number of tree regressors to train
equalize_cluster_weights (bool) – Weight the samples such that clusters have equally weight. If
False
, larger clusters will have more weight. IfTrue
, the number of samples drawn from each cluster is equal to the size of the smallest cluster. IfTrue
, sample weights should not be passed on fitting.sample_fraction (float) – Fraction of samples used in each tree. If
ci_group_size
> 1, the max allowed fraction is 0.5mtry (int) – The number of features to split on each node. The default is
sqrt(p) + 20
wherep
is the number of features.min_node_size (int) – The minimum number of observations in each tree leaf.
honesty (bool) – Use honest splitting (subsample splitting).
honesty_fraction (float) – The fraction of data used for subsample splitting.
honesty_prune_leaves (bool) – Prune estimation sample tree such that no leaves are empty. If
False
, trees with empty leaves are skipped.alpha (float) – The maximum imbalance of a split.
imbalance_penalty (float) – Penalty applied to imbalanced splits.
ci_group_size (int) – The quantity of trees grown on each subsample. At least 2 is required to provide confidence intervals.
tune_params (list(str)) – A list of parameter names on which to perform tuning. Valid strings are “sample_fraction”, “mtry”, “min_node_size”, “honesty_fraction”, “honesty_prune_leaves”, “alpha”, “imbalance_penalty”.
tune_n_estimators (int) – The number of estimators to use in the tuning model.
tune_n_reps (int) – The number of forests used in the tuning model
tune_n_draws (int) – The number of random parameter values for tuning model selection
boost_steps (int) – The number of boosting iterations
boost_error_reduction (int) – The percentage of previous step’s error that must be estimated by the current boost step
boost_max_steps (int) – The maximum number of boosting iterations
boost_trees_tune (int) – The number of trees used to test a new boosting step.
n_jobs (int) – The number of threads. Default is number of CPU cores.
seed (int) – Random seed value.
- Variables
n_features_in_ (int) – The number of features (columns) from the fit input
X
.boosted_forests_ (dict) – The boosted regression forests.
mtry_ (int) – The
mtry
value determined by validation.outcome_index_ (int) – The index of the grf train matrix holding the outcomes.
samples_per_cluster_ (list) – The number of samples to train per cluster.
clusters_ (list) – The cluster labels determined from the fit input
cluster
.n_clusters_ (int) – The number of unique cluster labels from the fit input
cluster
.
-
fit
(X, y, sample_weight=None, cluster=None)[source]¶ Fit the grf forest using training data.
- Parameters
X (array2d) – training input features
y (array1d) – training input targets
sample_weight (array1d) – optional weights for input samples
cluster (array1d) – optional cluster assignments for input samples
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
-
predict
(X, boost_predict_steps=None)[source]¶ Predict regression target for X.
- Parameters
X (array2d) – prediction input features
boost_predict_steps (int) – number of boost prediction steps
-
score
(X, y, sample_weight=None)¶ Return the coefficient of determination \(R^2\) of the prediction.
The coefficient \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred) ** 2).sum()
and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters
X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted)
, wheren_samples_fitted
is the number of samples used in the fitting for the estimator.y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
- Returns
score – \(R^2\) of
self.predict(X)
wrt. y.- Return type
float
Notes
The \(R^2\) score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score()
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance