Abstract
Statistical fairness criteria are widely used for diagnosing and ameliorating algorithmic bias. However, these fairness criteria are controversial as their use raises several difficult questions. I argue that the major problems for statistical algorithmic fairness criteria stem from an incorrect understanding of their nature. These criteria are primarily used for two purposes: first, evaluating AI systems for bias, and second constraining machine learning optimization problems in order to ameliorate such bias. The first purpose typically involves treating each criterion as a necessary condition for fairness. The second use involves treating criteria as sufficient conditions for fairness. Since the criteria are used for both roles, some researchers have treated them as both necessary and sufficient conditions, i.e., as definitions of algorithmic fairness. However, serious problems have been raised for the use of these fairness criteria. Under ordinary circumstances, it is impossible to satisfy multiple criteria at the same time. Moreover, there are counterexamples to both the sufficiency and necessity for fairness of each criterion. I argue that we should instead understand fairness criteria as merely providing evidence of fairness. In other words, satisfaction (or violation) of these criteria should be understood as potential evidence of fairness (or bias). Whether a criterion counts as evidence in a particular case will depend on stakeholders' background knowledge and the specific features of the system's task. This evidence account of fairness conditions provides guidance for recognizing both the appropriate uses and the limitations of fairness criteria.