Bigger Isn’t Better: The Ethical and Scientific Vices of Extra-Large Datasets in Language Models

WebSci '21: Proceedings of the 13th Annual ACM Web Science Conference (Companion Volume) (2021)
  Copy   BIBTEX

Abstract

The use of language models in Web applications and other areas of computing and business have grown significantly over the last five years. One reason for this growth is the improvement in performance of language models on a number of benchmarks — but a side effect of these advances has been the adoption of a “bigger is always better” paradigm when it comes to the size of training, testing, and challenge datasets. Drawing on previous criticisms of this paradigm as applied to large training datasets crawled from pre-existing text on the Web, we extend the critique to challenge datasets custom-created by crowdworkers. We present several sets of criticisms, where ethical and scientific issues in language model research reinforce each other: labour injustices in crowdwork, dataset quality and inscrutability, inequities in the research community, and centralized corporate control of the technology. We also present a new type of tool for researchers to use in examining large datasets when evaluating them for quality.

Author Profiles

Trystan S. Goetze
Cornell University
Darren Abramson
Dalhousie University

Analytics

Added to PP
2021-06-22

Downloads
471 (#32,253)

6 months
122 (#24,852)

Historical graph of downloads since first upload
This graph includes both downloads from PhilArchive and clicks on external links on PhilPapers.
How can I increase my downloads?