[ML] XGBoost

yozzum·2025년 2월 20일

Machine Learning

목록 보기

6/30

Overview

Parallel processing, data binning for efficient computing
Overfitting prevention: Regularization settings available
Constructs level-wise trees.

Spliting Thresholds

By default, XGBoost considers all the features for each split. However, by using hyperparameters such as colsample_bytree, colsample_bylevel, and colsample_bynode, it is possible to sample only a subset of features.
When searching for a split point (threshold), instead of examining every possible value individually, the data is divided into several bins, and the optimal split point is determined. This method is called the Histogram-based Split Finding Algorithm.
Spliting point is decided based on Gain which is calculated from Similarity Scores at each node.

Steps

Residuals 계산

Predicted value 계산 (initial default = 0.5)

Split point 찾기

"Similarity Score" for each node 계산

"Gain" for each split using the similarity score 계산

"Gain"이 가장 큰 feature와 threshold 계산

Prune with "Gamma" (Gain - Gamma)

Update the predicted probability for each observation

Convert the previous predicted probability into an output value using log(odds)

Update the output value for each node

Convert the output value into the predicted probability using logistic function

Build the second tree using the new predicted probability