AI Alignment Problem: “Human Values” don’t Actually Exist

Download Edit this record How to cite View on PhilPapers
Abstract
Abstract. The main current approach to the AI safety is AI alignment, that is, the creation of AI whose preferences are aligned with “human values.” Many AI safety researchers agree that the idea of “human values” as a constant, ordered sets of preferences is at least incomplete. However, the idea that “humans have values” underlies a lot of thinking in the field; it appears again and again, sometimes popping up as an uncritically accepted truth. Thus, it deserves a thorough deconstruction, which I will do by listing and analyzing comprehensively the hidden assumptions of the idea that “humans have values.” This deconstruction of human values will be centered around the following ideas: “Human values” are useful descriptions, but not real objects; “human values” are bad predictors of behavior; the idea of a “human value system” has flaws; “human values” are not good by default; and human values cannot be separated from human minds. The method of analysis is listing hidden assumptions on which the idea of “human values” is built. I recommend that either the idea of “human values” should be replaced with something better for the goal of AI safety, or at least be used very cautiously. The approaches to AI safety which don’t use the idea of human values at all may require more attention, like the use of full brain models, boxing, and capability limiting.
PhilPapers/Archive ID
TURAAP
Upload history
Archival date: 2019-04-22
View other versions
Added to PP index
2019-04-22

Total views
312 ( #21,538 of 64,211 )

Recent downloads (6 months)
72 ( #9,526 of 64,211 )

How can I increase my downloads?

Downloads since first upload
This graph includes both downloads from PhilArchive and clicks on external links on PhilPapers.