Interventionist Methods for Interpreting Deep Neural Networks

Raphaël Millière; Cameron Buckner

Interventionist Methods for Interpreting Deep Neural Networks

In Gualtiero Piccinini, Neurocognitive Foundations of Mind (forthcoming) Copy BIBT_EX

Abstract

Recent breakthroughs in artificial intelligence have primarily resulted from training deep neural networks (DNNs) with vast numbers of adjustable parameters on enormous datasets. Due to their complex internal structure, DNNs are frequently characterized as inscrutable ``black boxes,'' making it challenging to interpret the mechanisms underlying their impressive performance. This opacity creates difficulties for explanation, safety assurance, trustworthiness, and comparisons to human cognition, leading to divergent perspectives on these systems. This chapter examines recent developments in interpretability methods for DNNs, with a focus on interventionist approaches inspired by causal explanation in philosophy of science. We argue that these methods offer a promising avenue for understanding how DNNs process information compared to merely behavioral benchmarking and correlational probing. We review key interventionist methods and illustrate their application through practical case studies. These methods allow researchers to identify and manipulate specific computational components within DNNs, providing insights into their causal structure and internal representations. We situate these approaches within the broader framework of causal abstraction, which aims to align low-level neural computations with high-level interpretable models. While acknowledging current limitations, we contend that interventionist methods offer a path towards more rigorous and theoretically grounded interpretability research, potentially informing both AI development and computational cognitive neuroscience.

View on PhilPapers

Author Profiles

Cameron Buckner

University of Florida

Raphaël Millière

Macquarie University

Archival history

Archival date: 2024-10-02
View all versions

Keywords

interpretability causal intervention causal abstraction deep learning

Reprint years

Analytics

Added to PP
2024-10-02

Downloads
789 (#35,274)

6 months
443 (#4,139)

Historical graph of downloads since first upload

This graph includes both downloads from PhilArchive and clicks on external links on PhilPapers.

How can I increase my downloads?

Applied ethics	Epistemology	History of Western Philosophy	Meta-ethics	Metaphysics	Normative ethics
Philosophy of biology	Philosophy of language	Philosophy of mind	Philosophy of religion	Science Logic and Mathematics	More ...

Interventionist Methods for Interpreting Deep Neural Networks

Abstract

Author Profiles

Archival history

Categories

Keywords

Reprint years

Analytics