Disagreement, AI alignment, and bargaining

Philosophical Studies:1-31 (forthcoming)
  Copy   BIBTEX

Abstract

New AI technologies have the potential to cause unintended harms in diverse domains including warfare, judicial sentencing, biomedicine and governance. One strategy for realising the benefits of AI whilst avoiding its potential dangers is to ensure that new AIs are properly ‘aligned’ with some form of ‘alignment target.’ One danger of this strategy is that – dependent on the alignment target chosen – our AIs might optimise for objectives that reflect the values only of a certain subset of society, and that do not take into account alternative views about what constitutes desirable and safe behaviour for AI agents. In response to this problem, several AI ethicists have suggested alignment targets that are designed to be sensitive to widespread normative disagreement amongst the relevant stakeholders. Authors inspired by voting theory have suggested that AIs should be aligned with the verdicts of actual or simulated ‘moral parliaments’ whose members represent the normative views of the relevant stakeholders. Other authors inspired by decision theory and the philosophical literature on moral uncertainty have suggested that AIs should maximise socially expected choiceworthiness. In this paper, I argue that both of these proposals face several important problems. In particular, they fail to select attractive ‘compromise options’ in cases where such options are available. I go on to propose and defend an alternative, bargaining-theoretic alignment target, which avoids the problems associated with the voting- and decision-theoretic approaches.

Author's Profile

Harry R. Lloyd
Yale University

Analytics

Added to PP
2024-11-19

Downloads
315 (#77,460)

6 months
315 (#6,202)

Historical graph of downloads since first upload
This graph includes both downloads from PhilArchive and clicks on external links on PhilPapers.
How can I increase my downloads?