6  Day5 Model Evaluation

6.1 Basic Concepts

  • When evaluating a classification model, predictions are compared with the actual (ground truth) labels.

6.1.0.1 True Positive (TP)

  • The model correctly predicts the positive class.
    • Actual: 1 (positive).
    • Predicted: 1 (positive).

6.1.0.2 False Positive (FP) (Type I Error)

  • The model predicts the positive class, but the actual class is negative.
    • Actual: 0 (negative).
    • Predicted: 1 (positive).
    • In disease detection, marking a normal person as patient.

6.1.0.3 True Negative (TN)

  • The model correctly predicts the negative class.
    • Actual: 0 (negative).
    • Predicted: 0 (negative).
  • Indicates how well the model identifies the negative class.

6.1.0.4 False Negative (FN) (Type II Error)

  • The model predicts the negative class, but the actual class is positive.
    • Actual: 1 (positive).
    • Predicted: 0 (negative).
    • In disease detection, failing to identify a patient with the disease.

6.1.0.5 Confusion matrix from wiki

6.2 Accuracy

  • Accuracy is one of the most straightforward and commonly used metrics to evaluate the performance of a classification model.
  • It measures the proportion of correctly classified samples out of the total number of samples.

\[ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \]

  • In terms of the confusion matrix, accuracy can also be expressed as:

\[ \text{Accuracy} = \frac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Number of Samples (TP + TN + FP + FN)}} \]

  • Good
    • When the dataset is balanced, meaning the number of positive and negative samples is roughly equal.
    • When all misclassification errors (false positives and false negatives) are equally costly.
    • Easy to understand and implement.
    • Works well for balanced datasets.
  • No good
    • when the dataset is imbalanced:
    • For example, if 95% of the samples belong to 1, a model that always predicts the majority class will achieve 95% accuracy, but this is misleading.
    • Accuracy treats all errors equally, which is not suitable if false positives and false negatives have different costs.
from sklearn.metrics import accuracy_score

# Ground truth labels and model predictions
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 0, 0, 1, 1]

TP = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 1 and yp == 1)
TN = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 0 and yp == 0)
FP = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 0 and yp == 1)
FN = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 1 and yp == 0)
accuracy = (TP + TN) / (TP + TN + FP + FN)
print(f"Accuracy: {accuracy * 100:.2f}%")

accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
Accuracy: 80.00%
Accuracy: 80.00%

6.3 Precision

  • Of all the samples predicted as positive, how many are truly positive?
  • It measures the proportion of true positive predictions out of all the samples that were predicted as positive.

\[ \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}} \]

  • Precision is critical in scenarios where incorrectly predicting positives (FP) has significant consequences.

  • High precision ensures the model is not making too many false positive predictions, thus reducing “false alarms.”

  • Good

    • Ensures positive predictions are reliable.
    • Helps reduce “false alarms” in critical systems.
    • Precision is especially useful for imbalanced datasets where the positive class is rare.
  • No Good

    • Precision does not account for cases where the model misses actual positives (FN).
    • In situations where identifying all positives is crucial (e.g., disease detection), precision alone is insufficient.

from sklearn.metrics import precision_score

# True labels
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]

# Predicted labels
y_pred = [1, 0, 1, 0, 0, 1, 0, 0, 1, 1]

TP = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 1 and yp == 1)
FP = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 0 and yp == 1)
precision = TP / (TP + FP)
print(f"Precision: {precision * 100:.2f}%")

# Compute precision
precision = precision_score(y_true, y_pred)
print(f"Precision: {precision * 100:.2f}%")
Precision: 80.00%
Precision: 80.00%

6.4 Recall

  • Of all the actual positive samples, how many did the model correctly identify?
  • Known as sensitivity or true positive rate (TPR)

\[ \text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}} \]

  • It focuses on the model’s ability to avoid missing true positives (false negatives).
  • Recall is critical in scenarios where missing positive samples (FN) can have severe consequences.
    • Medical Diagnosis: Failing to identify a disease could lead to fatal consequences.
  • Recall and precision often have a trade-off:
    • High recall increases false positives, lowering precision.
    • High precision may miss true positives, lowering recall.
  • Good
    • Recall ensures that the model does not miss positive samples.
    • In applications like medical diagnosis, recall is a priority since false negatives are costly.
  • No good
    • Recall does not account for how many false positives the model predicts.
    • This can lead to overly lenient models that classify many samples as positive to maximize recall.
    • In applications like spam detection, focusing solely on recall might result in excessive false positives (e.g., legitimate emails marked as spam).

from sklearn.metrics import recall_score

# True labels
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]

# Predicted labels
y_pred = [1, 0, 1, 0, 0, 1, 0, 0, 1, 1]


# Calculate recall
TP = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 1 and yp == 1)
FN = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 1 and yp == 0)
recall = TP / (TP + FN)
print(f"Recall: {recall:.2f}")

# Compute recall
recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.2f}")
Recall: 0.80
Recall: 0.80