Decision Trees and Random Forests

더기덕·2022년 4월 2일
0

Elements of a decision tree

  • Nodes
    - Split for the value of a certain attribute

  • Edges
    - Outcome of a split to next node

  • Root
    - The node that performs the first split

  • Leaves
    - Terminal nodes that predict the outcome

Intuitions behind the split

  • We try to choose the variables which splits the tree most cleanly

Concept of Impurity

  • How do you define clean?
    - Entropy and Information Gain are the Mathematical Methods of choosing the best split. Refer to reading assignment
    • Information gain is the (Beginning entropy )- (sum of the entropy of the terminal nodes)

  • Example of Entropy
    - There's a node with 3 reds, 3 greens

    - Entropy is calculated as below:

  • GINI Index could also be used:
    - How to calculate GINI Index

    - Calculating a GINI Index

Information Gain

  • Calculating Information Gain


- Beginning Entropy :

- Entropy of the leaf nodes :


- Information Gain :
0.815-0.6075 = 0.2075

  • We repeat this process until the information gain is less than a certain threshold (e.g. 0.1)

Information Gain Ratio (GR)

  • The more nodes you create , higher is the Information Gain. But is this a good model?
    - That's where GR comes in

  • How you calculate GR

  • Example
    - an example model

    - Denominator of each model


    - Calculating GR

Random Forest

  • Decision Trees tend to overfit. Therefore you create multiple decision trees and let the trees do the voting
  • This is one of Ensemble Machine Learning Methods

Bagging

  • Selecting random features/ rows from data with replacement

  • Bagging Features
    - If you choose the features to split on, like the traditional decision tree model, the models are likely to start spliting with the same feature (the strongest feature)

    • In this case, the trees are likely to be highly correlated
    • Therefore, you randomly choose the features
    • number of features to choose is the sqrt of total no. of features
  • Bagging Rows
    - You also do the same for rows

The Contents of this post belongs to Jose Portilla's Python for Data Science and Machine Learning Bootcamp and 유나의 공부

0개의 댓글