Faithfulness Scores: Bridging the Gap Between Model Explanations and Ground Truth

Introduction

Modern machine learning models, particularly deep neural networks and Large Language Models (LLMs), are frequently described as “black boxes.” When these models make a prediction, we rarely understand the internal logic behind the decision. To solve this, researchers developed explainability methods like SHAP, LIME, and Integrated Gradients. However, there is a dangerous pitfall: an explanation can look convincing while being completely disconnected from how the model actually functions.

This is where faithfulness scores come in. A faithfulness score quantifies the extent to which an explanation accurately reflects the model’s true decision-making process. If an explanation is unfaithful, it is essentially a “hallucination” of the interpreter—a persuasive narrative that doesn’t actually describe the underlying logic. Understanding and measuring faithfulness is not just an academic exercise; it is a fundamental requirement for building reliable, auditable, and safe AI systems in fields like medicine, law, and finance.

Key Concepts

At its core, faithfulness (often referred to as “fidelity”) asks a simple question: If I change the input in a way that the model finds important, does the explanation change in a corresponding way?

An explanation is considered faithful if it consistently tracks the model’s internal reasoning. If an explanation claims that a specific word in a document caused the model to classify it as “Spam,” then removing that word should cause the model’s prediction score to drop significantly. If the score remains unchanged, the explanation was unfaithful—it identified a feature that wasn’t actually driving the model’s behavior.

Faithfulness scores are typically calculated through perturbation analysis. You systematically mask or modify features the model deems important and measure the change in the model’s output. If the model’s confidence collapses when “important” features are removed, the explanation is highly faithful. If the model is unfazed, the explanation is misleading.

Step-by-Step Guide: Measuring Faithfulness in Your Models

To implement faithfulness scoring in your pipeline, follow these practical steps:

Generate Explanations: Run your chosen interpretability method (e.g., LIME or SHAP) on your model to identify feature importance scores for a given input.
Establish a Baseline: Record the model’s prediction confidence for the original, unmodified input.
Perform Perturbation: Systematically remove features in descending order of their importance scores. Start by masking the most important feature, then the second, and so on.
Track the Output Drop: Observe how the model’s confidence score changes as you remove these features.
Calculate the Area Over the Perturbation Curve (AOPC): A faithful model will show a sharp, monotonic decline in confidence as important features are removed. Calculate the “drop” over time to quantify how well the explanation aligns with the model’s actual response.
Compare against Randomization: Always compare your results to a “random” perturbation baseline. If your explanation performs no better than removing random features, your explanation method is providing zero value.

Examples and Real-World Applications

The necessity of faithfulness becomes clear when examining high-stakes deployments.

Healthcare Diagnostics: Imagine an AI tool that predicts the risk of lung cancer from X-rays. An unfaithful explanation might highlight a watermark on the image as the “reason” for the diagnosis simply because that watermark appears in many positive cases. A clinician relying on this unfaithful explanation would be misled. By using faithfulness scores, developers can verify that the model is actually looking at relevant clinical markers, like nodules or opacities, rather than noise or metadata.

Credit Scoring: In financial services, models must adhere to “Right to Explanation” regulations. If a model denies a loan, the institution must explain why. If the explanation is unfaithful—claiming income was the primary factor when the model was actually biased by neighborhood demographics—the institution risks legal action and systemic unfairness. Faithfulness scoring acts as a diagnostic guardrail to ensure the explanations provided to consumers are legally and factually accurate.

Common Mistakes

Confusing Interpretability with Faithfulness: Just because an explanation is easy for a human to read doesn’t mean it’s true. A visually appealing heat map can still be a complete fabrication of the model’s logic.
Ignoring Feature Interaction: Many simple perturbation tests remove features in isolation. However, deep learning models rely on complex interactions between features. If you remove features one by one without considering how they influence each other, your faithfulness score will be artificially inflated.
Over-Reliance on Global Metrics: Faithfulness can vary drastically across different subsets of data. A model might be faithful when classifying “clear” cases but unfaithful when processing noisy or edge-case data. Always report faithfulness scores across different data strata.
Ignoring the “OOD” Problem: When you mask or delete features, you often push the model into “Out-of-Distribution” (OOD) space. The model might behave erratically simply because it hasn’t seen the “masked” input before, not because the explanation was wrong. Use robust masking techniques (like blurring or using reference values) to keep the input in-distribution.

Advanced Tips

To move beyond basic metrics, integrate these advanced strategies into your evaluation framework:

Use Sensitivity Analysis: Perform local sensitivity analysis to check if small, “meaningless” changes to the input lead to massive changes in the explanation. A stable, faithful explanation should not jump wildly due to minor image noise or tiny perturbations in text tokens.

Leverage “Faithfulness Metrics” Benchmarks: Rather than building your own metrics from scratch, utilize established libraries like Captum (for PyTorch) or SHAP’s built-in diagnostic tools. These libraries implement standardized faithfulness metrics like Sufficiency and Comprehensiveness, which measure both whether the identified features are enough to make the prediction and whether all important features were correctly captured.

Human-AI Alignment Testing: Combine algorithmic faithfulness scores with human-in-the-loop testing. If your faithfulness score is high but humans still find the explanation confusing or non-logical, it suggests that while the explanation is truthful, it is not communicative. Aim for a balance where the explanation is both accurate to the model and legible to the stakeholder.

Conclusion

Faithfulness scores are the “truth serum” for AI explainability. In an era where trust is the primary currency of AI adoption, we can no longer settle for explanations that are merely persuasive. We must demand that our models be accountable for their reasoning.

By implementing perturbation-based testing, avoiding common pitfalls like feature-isolation bias, and prioritizing rigorous evaluation, you transform interpretability from a marketing buzzword into a robust engineering practice. Whether you are building medical diagnostic tools or financial risk assessments, remember this: an explanation is only as good as its fidelity to the model’s internal reality. If you aren’t measuring faithfulness, you aren’t really explaining your model—you are just guessing.