Decision Tree to Classify Human Activity

Ji Kim·2021년 1월 24일

Machine Learning

목록 보기

Today, we will be using decision tree to perform classification prediction on UCI Machine Learning Repository's Human Activity Recognition dataset. The dataset is a compilation of various feature data collected after installing smartphone sensor to 30-participants. Based on the collected feature-set we will classify the type of action by each participant.

UCI Repository

After clicking Data Folder download UCI HAR and change its name to human_activity.

Now that we have downloaded the dataset, we will begin the preprocessing of the data.


# UCI Human Activity Recognition Using Smartphones
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

features_name_df = pd.read_csv('./human_activity/features.txt', sep='\s+',
                               header=None, names=['column_index', 'column_name'])

feature_name = feature_name_df.iloc[:, 1].values.tolist()

print('Extract 10 Feature Names : \n', feature_name[:10])


Extract 10 Feature Names : 
 ['tBodyAcc-mean()-X', 'tBodyAcc-mean()-Y', 'tBodyAcc-mean()-Z', 'tBodyAcc-std()-X', 'tBodyAcc-std()-Y', 'tBodyAcc-std()-Z', 'tBodyAcc-mad()-X', 'tBodyAcc-mad()-Y', 'tBodyAcc-mad()-Z', 'tBodyAcc-max()-X']

Now let us check whether or not there are duplicate column names.


feature_dup_df = feature_name_df.groupby('column_name').count()

print(feature_dup_df[feature_dup_df['column_index'] > 1].count())
feature_dup_df[feature_dup_df['column_index'] > 1].head()


column_index    42
dtype: int64

There are total 42 duplicate feature-names, we will simply add _1 or _2 to initial feature names to create distinctive values.


def get_new_feature_name_df(old_feature_name_df):
    feature_dup_df = pd.DataFrame(data=old_feature_name_df.groupby('column_name').cumcount(),

    feature_dup_df = feature_dup_df.reset_index()

    new_feature_name_df = pd.merge(old_feature_name_df.reset_index(), feature_dup_df, how='outer')
    new_feature_name_df['column_name'] = new_feature_name_df[['column_name', 'dup_cnt']].apply(lambda x: x[0]+'_'+str(x[1])
                                                                                                  if x[1] > 0 else x[0], axis=1)

    new_feature_name_df = new_feature_name_df.drop(['index'], axis=1)
    return new_feature_name_df

Now, we will create a function to manage dataset.


import pandas as pd

def get_human_dataset( ):
    feature_name_df = pd.read_csv('./human_activity/features.txt',sep='\s+',
    new_feature_name_df = get_new_feature_name_df(feature_name_df)
    feature_name = new_feature_name_df.iloc[:, 1].values.tolist()
    X_train = pd.read_csv('./human_activity/train/X_train.txt',sep='\s+', names=feature_name )
    X_test = pd.read_csv('./human_activity/test/X_test.txt',sep='\s+', names=feature_name)
    y_train = pd.read_csv('./human_activity/train/y_train.txt',sep='\s+',header=None,names=['action'])
    y_test = pd.read_csv('./human_activity/test/y_test.txt',sep='\s+',header=None,names=['action'])
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = get_human_dataset()


print('### Train Dataset Info \n')


### Train Dataset Info 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7352 entries, 0 to 7351
Columns: 561 entries, tBodyAcc-mean()-X to angle(Z,gravityMean)
dtypes: float64(561)
memory usage: 31.5 MB

There are total 561 float-type feature data which do not need any categorical encoding.




6    1407
5    1374
4    1286
1    1226
2    1073
3     986
Name: action, dtype: int64

Label data are fairly distributed without any bias.

Now, we will use DecisionTreeClassifier to classify the activity. We will set the hyper-parameters to default value and extract those afterwards.


from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt_clf = DecisionTreeClassifier(random_state=156), y_train)

pred = dt_clf.predict(X_test)
accuracy = accuracy_score(y_test, pred)
print('Accuracy of DecisionTreeClassifier : {0:4f} \n'.format(accuracy))

# extract hyper parameter of DecisionTreeClassifier
print('DecisionTreeClassifier Hyper Parameter : \n', dt_clf.get_params())


Accuracy of DecisionTreeClassifier : 0.854768 

DecisionTreeClassifier Hyper Parameter : 
 {'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': False, 'random_state': 156, 'splitter': 'best'}

The result has 85.48% of accuracy. Now we will see how the tree-depth of Decision Tree influences the accuracy of the model. We studied previously that the tree extends its depth until the tree satisfies the condition. Hence, we will gradually change the maximum depth of the tree to and observe the shift in accuracy.


from sklearn.model_selection import GridSearchCV

params = {
    'max_depth' : [6,8,10,12,16,20,24]

grid_cv = GridSearchCV(dt_clf, param_grid=params, scoring='accuracy', cv=5, verbose=1), y_train)

print('GridSearchCV Best Accuracy Score : {0:4f}'.format(grid_cv.best_score_))
print('GridSearchCV Best Hyperparamter : ', grid_cv.best_params_)


GridSearchCV Best Accuracy Score : 0.852557
GridSearchCV Best Hyperparamter :  {'max_depth': 8}

GridSearchCV returns that the best hyper parameter is at maximum depth of 8 with accuracy of 85.26%. We will see how the accuracy changed based on the change in CV set.

Using GridSearchCV's cvresults, we will extract the mean test score of each steps.


cv_results_df = pd.DataFrame(grid_cv.cv_results_)

cv_results_df[['param_max_depth', 'mean_test_score']]

Using the result from GridSearchCV, we will estimate the accuracy of the Decision Tree in separate train-set.


max_depths = [6,8,10,12,16,20,24]

for depth in max_depths:
    dt_clf = DecisionTreeClassifier(max_depth=depth, random_state=156), y_train)
    pred = dt_clf.predict(X_test)
    accuracy = accuracy_score(y_test, pred)
    print('{0} Max Depth Accuracy {1:4f}'.format(depth, accuracy))


6 Max Depth Accuracy 0.855786
8 Max Depth Accuracy 0.870716
10 Max Depth Accuracy 0.867323
12 Max Depth Accuracy 0.864608
16 Max Depth Accuracy 0.857482
20 Max Depth Accuracy 0.854768
24 Max Depth Accuracy 0.854768

As we have expected, maximum depth at 8 returns the highest accuracy score of 87.07%.

Now, let us also try tuning the accuracy by changing minimum amount of samples to split the leaf-node.


params = {
    'max_depth' : [8,12,16,20],
    'min_samples_split' : [16,24],

grid_cv = GridSearchCV(dt_clf, param_grid=params, scoring='accuracy', cv=5, verbose=1), y_train)

print('Grid Search CV Best Accuracy : {0:4f}'.format(grid_cv.best_score_))
print('Grid Search Best Hyper Parameter : ', grid_cv.best_params_)


Grid Search CV Best Accuracy : 0.855005
Grid Search Best Hyper Parameter :  {'max_depth': 8, 'min_samples_split': 16}

Now, we will use identical trained estimator to perform classification in separate test-set.


best_dt_clf = grid_cv.best_estimator_
pred1 = best_dt_clf.predict(X_test)
accuracy = accuracy_score(y_test, pred1)

print('Decision Tree Classifier Accuracy {0:4f}'.format(accuracy))


Decision Tree Classifier Accuracy 0.871734

Now using barplot, we will visualize the importances of feature in the model.


import seaborn as sns

ftr_importances_values = best_df_clf.feature_importances_

ftr_importances = pd.Series(ftr_importances_values, index=X_train.columns)
ftr_top20 = ftr_importances.sort_values(ascending=False)[:20]

plt.title('Feature Importances Top 20')

sns.barplot(x=ftr_top20, y=ftr_top20.index)


if this then that

0개의 댓글