steering – BossMind

Interpretability tools allow engineers to map internal activations to human-understandable concepts or features.

Steven HaynesApril 29, 2026May 9, 20260

Outline Introduction: The “Black Box” problem and the shift toward mechanistic interpretability. Key Concepts: Understanding neurons, features, and the dictionary…