Decision Trees for Beginners — Full Python Tutorial

This tutorial walks you through a complete Decision Tree project using a loan prediction dataset. Copy each code block (inside the <pre> tags) into Jupyter Notebook cells. Dataset file: loan_dataset.csv

1. Setup & Imports

<![CDATA[
# Import essential libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import joblib

sns.set_style('whitegrid')
pd.set_option('display.max_columns', None)
      ]]>

Voice-over: "We begin by importing all necessary Python libraries for data handling, visualization, preprocessing, and modeling. Make sure these libraries are installed in your environment."

2. Load the Dataset

<![CDATA[
# Load dataset
file_name = "loan_dataset.csv"
if not os.path.exists(file_name):
    raise FileNotFoundError(f"Dataset '{file_name}' not found.")
df = pd.read_csv(file_name)
display(df.head())
      ]]>

Voice-over: "We load the CSV dataset into a Pandas DataFrame and display the first rows to verify that it loaded correctly."

3. Inspect the Data

<![CDATA[
# Check data types and missing values
df.info()
display(df.isnull().sum())
display(df.describe(include='all').T)
      ]]>

Voice-over: "We inspect the data types, missing values, and get basic statistics for both numeric and categorical features to understand our dataset."

4. Target Distribution

<![CDATA[
# Check target distribution
plt.figure(figsize=(6,4))
sns.countplot(x='loan_status', data=df)
plt.title('Loan Status Distribution (0 = Rejected, 1 = Approved)')
plt.show()
      ]]>

Voice-over: "We plot the target column 'loan_status' to visualize how many loans were approved or rejected."

5. Clean Column Names & Strings

<![CDATA[
# Strip whitespace and ensure target is integer
df.columns = [c.strip() for c in df.columns]
for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].str.strip()
df['loan_status'] = df['loan_status'].astype(int)
df.head(3)
      ]]>

Voice-over: "We remove extra spaces from column names and string values, and make sure the target column is integer type."

6. Define Features and Target

<![CDATA[
# Define X and y
target = 'loan_status'
X = df.drop(columns=[target])
y = df[target]

# Numeric and categorical columns
numeric_cols = ['person_age','person_income','person_emp_exp','loan_amnt','loan_int_rate',
                'loan_percent_income','cb_person_cred_hist_length','credit_score']
categorical_cols = ['person_gender','person_education','person_home_ownership','loan_intent',
                    'previous_loan_defaults_on_file']

print("Numeric columns:", numeric_cols)
print("Categorical columns:", categorical_cols)
      ]]>

Voice-over: "We separate the dataset into features X and target y. We also define numeric and categorical columns based on the dataset metadata."

7. Preprocessing Pipeline

<![CDATA[
# Create preprocessing pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_cols),
    ('cat', categorical_transformer, categorical_cols)
])

X_prepared = preprocessor.fit_transform(X)
print("Prepared X shape:", X_prepared.shape)
      ]]>

Voice-over: "We set up a preprocessing pipeline that imputes missing values and one-hot encodes categorical variables. This ensures the data is ready for modeling."

8. Train/Test Split

<![CDATA[
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_prepared, y, test_size=0.2, stratify=y, random_state=42
)
print("Train shape:", X_train.shape, "Test shape:", X_test.shape)
      ]]>

Voice-over: "We split the data into training and test sets while preserving the class distribution using stratification."

9. Train Baseline Decision Tree

<![CDATA[
# Baseline Decision Tree
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

print("Train accuracy:", accuracy_score(y_train, y_train_pred))
print("Test accuracy:", accuracy_score(y_test, y_test_pred))
      ]]>

We train a baseline Decision Tree with default parameters and check the training and test accuracy.

10. Evaluation

<![CDATA[
# Classification report and confusion matrix
print(classification_report(y_test, y_test_pred))

cm = confusion_matrix(y_test, y_test_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
      ]]>

We evaluate the model using precision, recall, F1-score, and visualize the confusion matrix to analyze errors.

11. Hyperparameter Tuning

<![CDATA[
# Grid search for hyperparameters
param_grid = {
    'max_depth':[3,5,7,None],
    'min_samples_leaf':[1,2,5],
    'criterion':['gini','entropy']
}
cv = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)
grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=cv, scoring='f1', n_jobs=-1)
grid.fit(X_train, y_train)

best_clf = grid.best_estimator_
print("Best parameters:", grid.best_params_)
      ]]>

Voice-over: "We tune the tree using grid search with cross-validation to optimize F1-score, selecting the best combination of depth, leaf size, and splitting criterion."

12. Final Evaluation

<![CDATA[
# Evaluate tuned model
y_pred_tuned = best_clf.predict(X_test)
print("Test accuracy (tuned):", accuracy_score(y_test, y_pred_tuned))
print(classification_report(y_test, y_pred_tuned))
      ]]>

Voice-over: "We evaluate the tuned Decision Tree on the test set, checking metrics and confirming improved performance compared to the baseline."

13. Tree Visualization

<![CDATA[
# Visualize top of the tree
num_features = numeric_cols
cat_features = preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_cols).tolist()
feature_names = num_features + cat_features

plt.figure(figsize=(20,10))
plot_tree(best_clf, feature_names=feature_names, class_names=['Rejected','Approved'], filled=True, max_depth=3, fontsize=10)
plt.show()
      ]]>

Voice-over: "We plot the upper levels of the tree to see the main decision paths that lead to loan approval or rejection."

14. Feature Importance

<![CDATA[
import pandas as pd
import seaborn as sns

importances = best_clf.feature_importances_
feat_imp = pd.DataFrame({'feature': feature_names, 'importance': importances})
feat_imp = feat_imp.sort_values('importance', ascending=False).head(20)

sns.barplot(data=feat_imp, x='importance', y='feature')
plt.title('Top Feature Importances')
plt.show()
display(feat_imp)
      ]]>

Voice-over: "We rank features by importance to understand which variables most influence loan decisions."

15. Save Model & Preprocessor

<![CDATA[
# Save model
joblib.dump(best_clf, 'decision_tree_model.joblib')
joblib.dump(preprocessor, 'preprocessor.joblib')
      ]]>

Voice-over: "We save the trained Decision Tree and preprocessing pipeline for future predictions or deployment."

16. Predict a New Applicant

<![CDATA[
# Example prediction
example_raw = {
    'person_age':34,
    'person_gender':'Male',
    'person_education':'Bachelor',
    'person_income':48000,
    'person_emp_exp':5,
    'person_home_ownership':'Rent',
    'loan_amnt':12000,
    'loan_intent':'Debt Consolidation',
    'loan_int_rate':12.5,
    'loan_percent_income':25,
    'cb_person_cred_hist_length':8,
    'credit_score':680,
    'previous_loan_defaults_on_file':'No'
}
example_df = pd.DataFrame([example_raw])
example_df = example_df[X.columns]
example_transformed = preprocessor.transform(example_df)
pred = best_clf.predict(example_transformed)[0]
print("Predicted class (1=approved):", int(pred))
      ]]>

Finally, we demonstrate predicting a new applicant’s loan status. You can modify values to test different scenarios."

17. Outro

"And that’s it! You now have a complete Decision Tree model from raw dataset to predictions. The full notebook and dataset link are provided in the description. Experiment with new data and explore feature importance for deeper understanding."

Decision Tree Tutorial for Beginners | Simple ML Explained with an Example

Decision Trees for Beginners — Full Python Tutorial

1. Setup & Imports

2. Load the Dataset

3. Inspect the Data

4. Target Distribution

5. Clean Column Names & Strings

6. Define Features and Target

7. Preprocessing Pipeline

8. Train/Test Split

9. Train Baseline Decision Tree

10. Evaluation

11. Hyperparameter Tuning

12. Final Evaluation

13. Tree Visualization

14. Feature Importance

15. Save Model & Preprocessor

16. Predict a New Applicant

17. Outro

No comments:

ADS

Total Pageviews

Ads

Like button

Follow Us

Popular Posts

About us

Contact Form

Facebook