AI Alignment Problem: “Human Values” don’t Actually Exist


Abstract. The main current approach to the AI safety is AI alignment, that is, the creation of AI whose preferences are aligned with “human values.” Many AI safety researchers agree that the idea of “human values” as a constant, ordered sets of preferences is at least incomplete. However, the idea that “humans have values” underlies a lot of thinking in the field; it appears again and again, sometimes popping up as an uncritically accepted truth. Thus, it deserves a thorough deconstruction, which I will do by listing and analyzing comprehensively the hidden assumptions of the idea that “humans have values.” This deconstruction of human values will be centered around the following ideas: “Human values” are useful descriptions, but not real objects; “human values” are bad predictors of behavior; the idea of a “human value system” has flaws; “human values” are not good by default; and human values cannot be separated from human minds. The method of analysis is listing hidden assumptions on which the idea of “human values” is built. I recommend that either the idea of “human values” should be replaced with something better for the goal of AI safety, or at least be used very cautiously. The approaches to AI safety which don’t use the idea of human values at all may require more attention, like the use of full brain models, boxing, and capability limiting.

Author's Profile


Added to PP

517 (#16,767)

6 months
99 (#8,278)

Historical graph of downloads since first upload
This graph includes both downloads from PhilArchive and clicks on external links on PhilPapers.
How can I increase my downloads?