Propositional interpretability in artificial intelligence

David J. Chalmers

Propositional interpretability in artificial intelligence

Abstract

Mechanistic interpretability is the program of explaining what AI systems are doing in terms of their internal mechanisms. I analyze some aspects of the program, along with setting out some concrete challenges and assessing progress to date. I argue for the importance of propositional interpretability, which involves interpreting a system’s mechanisms and behav- ior in terms of propositional attitudes: attitudes (such as belief, desire, or subjective probabil- ity) to propositions (e.g. the proposition that it is hot outside). Propositional attitudes are the central way that we interpret and explain human beings and they are likely to be central in AI too. A central challenge is what I call thought logging: creating systems that log all of the rel- evant propositional attitudes in an AI system over time. I examine currently popular methods of interpretability (such as probing, sparse auto-encoders, and chain of thought methods) as well as philosophical methods of interpretation (including those grounded in psychoseman- tics) to assess their strengths and weaknesses as methods of propositional interpretability.

View on PhilPapers

Author's Profile

David Chalmers

New York University

Archival history

Archival date: 2025-01-27
View all versions

Keywords

Add keywords

Reprint years

Analytics

Added to PP
2025-01-27

Downloads
886 (#30,778)

6 months
886 (#1,296)

Historical graph of downloads since first upload

This graph includes both downloads from PhilArchive and clicks on external links on PhilPapers.

How can I increase my downloads?

Applied ethics	Epistemology	History of Western Philosophy	Meta-ethics	Metaphysics	Normative ethics
Philosophy of biology	Philosophy of language	Philosophy of mind	Philosophy of religion	Science Logic and Mathematics	More ...

Propositional interpretability in artificial intelligence

Abstract

Author's Profile

Archival history

Categories

Keywords

Reprint years

Analytics