Abstract
Recent breakthroughs in artificial intelligence have primarily resulted from training deep neural networks (DNNs) with vast numbers of adjustable parameters on enormous datasets. Due to their complex internal structure, DNNs are frequently characterized as inscrutable ``black boxes,'' making it challenging to interpret the mechanisms underlying their impressive performance. This opacity creates difficulties for explanation, safety assurance, trustworthiness, and comparisons to human cognition, leading to divergent perspectives on these systems. This chapter examines recent developments in interpretability methods for DNNs, with a focus on interventionist approaches inspired by causal explanation in philosophy of science. We argue that these methods offer a promising avenue for understanding how DNNs process information compared to merely behavioral benchmarking and correlational probing. We review key interventionist methods and illustrate their application through practical case studies. These methods allow researchers to identify and manipulate specific computational components within DNNs, providing insights into their causal structure and internal representations. We situate these approaches within the broader framework of causal abstraction, which aims to align low-level neural computations with high-level interpretable models. While acknowledging current limitations, we contend that interventionist methods offer a path towards more rigorous and theoretically grounded interpretability research, potentially informing both AI development and computational cognitive neuroscience.