Abstract
Mechanistic interpretability is the program of explaining what AI systems are doing in
terms of their internal mechanisms. I analyze some aspects of the program, along with setting
out some concrete challenges and assessing progress to date. I argue for the importance of
propositional interpretability, which involves interpreting a system’s mechanisms and behav-
ior in terms of propositional attitudes: attitudes (such as belief, desire, or subjective probabil-
ity) to propositions (e.g. the proposition that it is hot outside). Propositional attitudes are the
central way that we interpret and explain human beings and they are likely to be central in AI
too. A central challenge is what I call thought logging: creating systems that log all of the rel-
evant propositional attitudes in an AI system over time. I examine currently popular methods
of interpretability (such as probing, sparse auto-encoders, and chain of thought methods) as
well as philosophical methods of interpretation (including those grounded in psychoseman-
tics) to assess their strengths and weaknesses as methods of propositional interpretability.