week11. lec1#
This note is completed with the assistance of ChatGPT
Multivariate Statistics for Data Science - Detailed Lecture Summary:
1. Bagging in Decision Trees:#
Concept: Bagging (Bootstrap Aggregating) involves creating multiple bootstrap trees and aggregating their results to enhance accuracy and reduce variance.
Key Points:
The aggregated classifier isn’t a single tree but can outperform individual trees.
Bagging reduces variance without increasing bias.
Example: By bagging with varying values of B (number of bootstrap samples), one can observe its effect on performance, such as classification error.
2. Random Forests for Classification:#
Concept: An enhancement over bagging, random forests aim to decorrelate individual trees to further reduce variance.
Key Points:
Trees in bagging aren’t independent; random forests address this by selecting a random subset of predictors for each split.
The default number of predictors considered at each split is \( m = \sqrt{p} \), but this can be tuned.
The algorithm involves:
Drawing bootstrap samples.
Growing trees by selecting random predictors and making splits.
Aggregating results for predictions, either by averaging (regression) or majority voting (classification).
3. Variable Importance in Decision Trees & Random Forests:#
Concept: Not all predictors contribute equally to the model’s decision-making. Their importance can be quantified.
Key Points:
Trees split data based on predictors, ideally focusing on the most informative ones.
Importance can be measured by:
The decrease in node impurity (often using the Gini impurity).
The decrease in prediction accuracy when a predictor’s values are permuted (OOB method).
In Random Forests, the importance of a predictor is averaged across all trees.
Importance scores are often normalized, with higher values indicating more important predictors.
4. Implementing Random Forests in R:#
Concept: Practical steps to compute a random forest in R.
Key Points:
The
randomForest
package in R is used.Data input involves specifying a formula, indicating the response variable and predictors.
The response variable should be formatted as a factor for classification tasks.
By default, the package uses \( m = \sqrt{p} \) predictors for each split in classification tasks.