Science

What Does 'Fair' Actually Mean?

Fairness definitions, their incompatibilities, bias mitigation limits, and the emerging accountability landscape

Learning Objectives

By the end of this module you will be able to:

Define demographic parity, equalized odds, and calibration as distinct formal fairness criteria.
Explain why multiple fairness definitions cannot be simultaneously satisfied in most real-world settings.
Evaluate the effectiveness and known limitations of common bias mitigation techniques.
Describe the key elements of existing algorithmic accountability laws, including NYC Local Law 144 and U.S. fair lending frameworks.
Identify why "color-blind" algorithmic design can perpetuate rather than eliminate discrimination.

Core Concepts

The Three Main Fairness Definitions

Researchers and regulators use several competing mathematical definitions of algorithmic fairness. Understanding them separately is a prerequisite to understanding why they conflict.

Demographic Parity (also called statistical parity or group fairness) requires that an algorithm produce positive outcomes at equal rates across demographic groups — regardless of underlying differences in the population. A hiring algorithm satisfies demographic parity if it selects the same percentage of men and women. A criminal-justice algorithm satisfies it if it classifies equal proportions of Black and white defendants as low-risk. The definition is explicitly outcome-focused: it does not care whether one group has a higher base rate of the predicted characteristic.

Equalized Odds conditions on actual qualifications or outcomes rather than just final rates. It requires that an algorithm have equal true positive rates and equal false positive rates across groups. In a hiring context, equalized odds demands that the model is equally good at identifying qualified candidates from every demographic — it penalizes a model that achieves high overall accuracy by performing well on the majority group while making errors on the minority group. As FairLearn's documentation puts it, equalized odds "ensures that the algorithm is equally accurate at identifying qualified candidates from both advantaged and disadvantaged groups."

Calibration (predictive parity) requires that among all individuals who receive a given risk score, the proportion who actually have the predicted outcome is the same across groups. A risk score of 70% means the same thing for everyone, regardless of their race or gender.

Why three definitions?

Each definition encodes a different moral intuition about what discrimination means. Demographic parity asks: are outcomes equal? Equalized odds asks: is the algorithm equally accurate for everyone? Calibration asks: is the score equally meaningful for everyone? These are distinct questions, and they can give opposite verdicts on the same system.

The Mathematical Incompatibility

Here is the uncomfortable result: these definitions cannot all be satisfied at the same time, in most real-world settings.

Research by Corbett-Davies and colleagues, published in Algorithmic decision making and the cost of fairness, demonstrated formally that when demographic groups differ in their underlying base rates — which is almost always the case for historically disadvantaged groups — achieving one fairness definition tends to guarantee violation of another. The Parity Measures chapter of the Fairness & Algorithmic Decision Making textbook states the formal condition precisely: if group membership (A) and the true outcome (Y) are not independent, demographic parity and equalized odds cannot both hold simultaneously.

This is not a software bug or a problem to be fixed with more data. It is a mathematical theorem. The choice of fairness metric is therefore a value choice, not a technical one: it encodes a policy commitment about what kind of discrimination society is most concerned with preventing.

Auditors and regulators must explicitly choose which definition of fairness to enforce, each encoding different policy commitments about what discrimination means.

There is a further tension between fairness and predictive accuracy. Research on criminal justice algorithms shows that the algorithm that maximizes predictive accuracy without fairness constraints is structurally different from the algorithm that achieves equal error rates across groups — and the practical cost of imposing fairness constraints can be substantial.

Color-Blindness Does Not Solve the Problem

A common intuition is that removing protected attributes — race, gender, zip code — from the input features should produce a neutral algorithm. This intuition is wrong in practice.

When an algorithm is "blinded" to race, it typically falls back on proxy variables: zip codes, credit scores, educational histories. These variables are themselves shaped by historical discrimination and residential segregation. Research on algorithmic fairness and color-blind racism demonstrates that "selecting the best candidates using race-blind algorithms requires incorporating, rather than ignoring, the biasing factor" — meaning removing race as a variable can increase discrimination by preventing the system from accounting for how historical inequities are embedded in the proxies it still uses.

The proxy problem appears across domains:

In credit scoring, machine learning models trained on historical data carry forward the discriminatory lending patterns of the past. Including gender as a variable in an ML algorithm resulted in systematically fewer loan approvals for women; removing it does not remove the underlying disparity if other features correlate with gender.
In healthcare, training data composed primarily of patients from high-income countries causes AI health tools to systematically underperform for populations not well-represented in training sets.
In robo-advisory financial tools, algorithmic bias stems from the composition of the existing client base used for training — so past human discrimination becomes encoded in the algorithm even when no explicit demographic variable is included.

Compare & Contrast

Demographic Parity vs. Equalized Odds

	Demographic Parity	Equalized Odds
What it measures	Equal selection rates across groups	Equal true positive and false positive rates
Conditions on	Nothing — just outcome rates	Actual qualification / true outcome
Intuition	"Equal representation in results"	"Equal accuracy for everyone"
When it's useful	When historical underrepresentation is the core concern	When predictive accuracy parity is the core concern
Risk	May require accepting lower predictive validity for one group	May require different thresholds per group, which some view as discriminatory

The Corbett-Davies et al. paper showed that achieving demographic parity in criminal justice risk assessment requires applying race-specific risk thresholds — a result that many find ethically problematic in itself, illustrating the impossibility of a purely neutral solution.

Internal vs. External Audits

Regulatory accountability frameworks increasingly distinguish between audits conducted internally by an organization and those conducted by independent third parties. The distinction matters because:

Internal audits may have conflicts of interest, limited scope, and no obligation to publish results.
External independent audits reduce conflicts of interest and provide more credible evaluation. Brookings Institution research notes that "employers cannot rely solely on vendors' self-assessments of algorithmic bias."

The tradeoff is that external audits require access to proprietary systems — which creates its own accountability barriers (discussed below).

Worked Example

The COMPAS Risk Assessment Tool

COMPAS is a commercial recidivism-prediction tool used in U.S. criminal sentencing and bail decisions. It provides a 1–10 risk score using 137 input features. This tool illustrates several of the concepts in this module at once.

The accuracy question. A validation study by Dressel and Farid found that a simple linear model using only two features — defendant age and prior criminal history — achieved the same 67% predictive accuracy as COMPAS's 137-feature system. The added complexity of the commercial tool does not translate to better predictions.

The fairness conflict. ProPublica's 2016 investigation argued COMPAS violated equalized odds: it assigned higher false positive rates (incorrectly flagging non-reoffenders as high risk) to Black defendants and higher false negative rates to white defendants. Northpointe (the vendor) responded that the tool satisfied calibration: a score of 7 means roughly the same probability of reoffending regardless of race.

Both claims were mathematically correct. As Corbett-Davies and colleagues demonstrated, this is not a contradiction — it is a direct consequence of the fact that equalized odds and calibration cannot both be satisfied when base rates differ between groups. The debate was not about facts; it was about which fairness definition should govern criminal justice algorithms.

The accountability barrier. COMPAS's inner workings are proprietary and protected as a trade secret. Defendants subject to COMPAS scores have been denied the ability to challenge the algorithm's methodology in court. This is an example of what researchers call algorithmic opacity: corporate secrecy, technical complexity, and managerial invisibility combining to block meaningful oversight.

The lesson from COMPAS

When someone claims an algorithm is "fair," the next question to ask is: fair by which definition? A system can satisfy one definition perfectly while violating another. The choice of metric encodes a value judgment that is not visible in the headline claim.

Boundary Conditions

Where Bias Mitigation Works — and Where It Doesn't

Technical bias mitigation is real and has documented effects:

Fairness-aware techniques (re-weighting, adversarial debiasing, post-processing recalibration) applied to credit scoring models reduce bias metrics with average accuracy losses under 1.5% AUC decline, suggesting bias reduction and predictive accuracy are not fundamentally incompatible.
In healthcare, a scoping review found that 67% of studies implementing mitigation methods successfully increased fairness. Optum's Impact Pro algorithm showed an 84% reduction in racial bias after refactoring to remove cost-based health proxies.

But mitigation has hard limits:

1. Brittleness across contexts. Fairness-preserving algorithms are sensitive to small fluctuations in dataset composition. Survey research shows that gains in fairness tend to be unstable when the system is deployed in a context that differs from the training environment.

2. Cross-metric gaming. Optimizing for one fairness metric can worsen another. A study can report success at reducing bias — using one metric — while bias persists or worsens under a different metric. This is why healthcare research documents significant inconsistency in which metrics are used across studies, making cross-study comparison unreliable.

3. Aggregate metrics miss individual harm. Research on bias measurement limitations notes that standard aggregate statistical metrics may optimize for statistical parity while leaving substantive inequality in place. A model with equal false positive rates by group can still systematically harm the individuals in a community who are misclassified.

4. Structural causes require structural solutions. Survey research on debiasing makes the point plainly: "algorithmic bias fundamentally reflects broader structural injustices, meaning technical interventions alone cannot resolve what are ultimately societal problems requiring combined technical and political solutions."

5. Timing matters. Healthcare algorithm lifecycle research shows that bias is more easily mitigated when addressed early in the algorithm lifecycle — during problem formulation and data selection — than after a model has been built and deployed.

6. External validation gaps. Approximately 75% of deployed healthcare AI models have only been internally validated, never tested on diverse external populations. Bias that would be visible in diverse cohorts remains invisible.

What mitigation cannot do

Technical debiasing can reduce measurable bias on a chosen metric. It cannot resolve the value conflict between incompatible fairness definitions. It cannot compensate for structurally biased training data. And it cannot substitute for governance.

Quiz

Q1. A hiring algorithm selects 30% of female applicants and 30% of male applicants. Which fairness definition does this satisfy?

Answer

Demographic parity — equal selection rates regardless of underlying qualification differences.

Q2. Two researchers analyze the same criminal justice algorithm. Researcher A concludes it is biased; Researcher B concludes it is fair. Both are using valid evidence. What is the most likely explanation?

Answer

They are using different fairness definitions (e.g., equalized odds vs. calibration). Because these definitions are mathematically incompatible when base rates differ, a system can satisfy one while violating another. Neither researcher is wrong — they are answering different questions.

Q3. An employer removes race, gender, and nationality from the inputs of their resume-screening algorithm. Does this guarantee the algorithm will not discriminate?

Answer

No. The algorithm will likely rely on proxy variables — educational institutions, zip codes, career histories — that are themselves correlated with race and gender through historical discrimination. Research shows that color-blind algorithmic design can increase disparity by preventing the system from accounting for how historical inequities are embedded in those proxies.

Q4. NYC Local Law 144 requires employers to do what before deploying an automated hiring tool?

Answer

Conduct an independent third-party bias audit evaluating disparate impact across gender and race/ethnicity categories (including intersectional categories). Employers must also publicly disclose the audit report and give candidates at least 10 business days' notice before being evaluated, with an option to request an alternative assessment.

Q5. Research on the COMPAS tool found that equivalent predictive accuracy could be achieved with how many features, compared to the 137 features COMPAS uses?

Answer

Two features: defendant age and prior criminal history. This finding raises questions about the legitimacy of commercially complex risk assessment tools.

Key Takeaways

Fairness is not a single concept. Demographic parity, equalized odds, and calibration each measure something real — and something different. They encode distinct value commitments about what discrimination means.
These definitions are mathematically incompatible in most real-world settings. Satisfying one often guarantees violating another, particularly when demographic groups differ in base rates. This is a theorem, not a gap to engineer around.
Color-blind design does not produce neutral outcomes. Removing protected attributes shifts reliance to proxy variables that are themselves shaped by historical discrimination. In several documented cases, it increases rather than decreases disparity.
Bias mitigation works within limits. Technical interventions can meaningfully reduce bias on specific metrics — but gains are brittle, metric-specific, and insufficient on their own when the root causes are structural.
The governance landscape is emerging but has real gaps. NYC Local Law 144 is the first jurisdictional mandate requiring independent audits of hiring algorithms, but research has documented that vague definitions and employer discretion enable "null compliance" — avoiding scrutiny without achieving fairness.

Further Exploration

Technical Foundations

FairLearn: Common Fairness Metrics — Practical reference for demographic parity, equalized odds, and related metrics with worked examples
Algorithmic decision making and the cost of fairness — Corbett-Davies et al. — The foundational paper demonstrating the mathematical incompatibility of fairness definitions in criminal justice contexts
Fairness & Algorithmic Decision Making — Parity Measures chapter — Accessible textbook treatment of the formal conditions under which fairness definitions conflict

Policy and Governance

Null Compliance: NYC Local Law 144 and algorithm accountability (FAccT 2024) — Empirical study of how employers have interpreted LL 144's requirements
Auditing employment algorithms for discrimination — Brookings — Policy-oriented overview of the auditing landscape and audit structure limits

Historical and Sociological Context

Algorithmic Fairness and Color-blind Racism — Analysis of why race-neutral algorithmic design fails and how it can deepen existing disparities
A Sociotechnical View of 'Fair' Algorithms in Criminal Justice — Situates the fairness-accuracy tradeoff in the broader social and institutional context