Bigger Isn’t Better: The Ethical and Scientific Vices of Extra-Large Datasets in Language Models

WebSci '21 Proceedings of the 13th Annual ACM Web Science Conference (Companion Volume) (2021)
Download Edit this record How to cite View on PhilPapers
Abstract
The use of language models in Web applications and other areas of computing and business have grown significantly over the last five years. One reason for this growth is the improvement in performance of language models on a number of benchmarks — but a side effect of these advances has been the adoption of a “bigger is always better” paradigm when it comes to the size of training, testing, and challenge datasets. Drawing on previous criticisms of this paradigm as applied to large training datasets crawled from pre-existing text on the Web, we extend the critique to challenge datasets custom-created by crowdworkers. We present several sets of criticisms, where ethical and scientific issues in language model research reinforce each other: labour injustices in crowdwork, dataset quality and inscrutability, inequities in the research community, and centralized corporate control of the technology. We also present a new type of tool for researchers to use in examining large datasets when evaluating them for quality.
PhilPapers/Archive ID
GOEBIB
Upload history
Archival date: 2021-06-22
View other versions
Added to PP index
2021-06-22

Total views
52 ( #53,766 of 2,444,737 )

Recent downloads (6 months)
52 ( #13,007 of 2,444,737 )

How can I increase my downloads?

Downloads since first upload
This graph includes both downloads from PhilArchive and clicks on external links on PhilPapers.