By Tom Slee
My last post introduced the idea of algorithmic accountability. Now let’s break it down a bit and see what kinds of problems it tries to address.
Two powerful and accessible books that explore algorithmic accountability are “The Black Box Society” by law professor Frank Pasquale and more recently “Weapons of Math Destruction” by data scientist Cathy O’Neil. One Human Capital Management-related example from “Weapons of Math Destruction” sketches out many of the problems that emerged as machine-learning techniques got applied, too carelessly perhaps, to important decisions. She tells about the use of algorithmic “value-added models” to rate American high-school teacher performance. The models sound plausible. First, use student exam grades to measure teacher effectiveness objectively. Second, acknowledging that not all classes are equal, scale the results by how well the students did the year before entering the class. If students come in with very high scores and leave the class with mediocre results, the teacher has done badly; if the students enter the class with poor results and improve, the teacher deserves a high rating.
Sounding plausible is one thing, but O’Neil presents strong evidence that value-added models have consistently failed as assessment techniques for several reasons:
- Bad statistics. Teachers only teach a small number of students each year, so the sample size is small. Measuring changes in achievement magnifies the need for a large sample if significance is to be achieved. One teacher O’Neil discusses got a rating of only 6 (out of 100) one year, and 96 the next: the kind of variation that happens if sample sizes are too small.
- Ill-defined scales. Before machine learning models make predictions or assessments, they are typically trained on a data set with known outcomes. But there is no absolute measure of teacher quality to use as a yardstick, so the model is not grounded in a solid foundation of well-calibrated data.
- Bad incentives. Management theorist Peter Drucker famously said “if you can’t measure it, you can’t improve it”. But for every memorable adage there is an equal and opposite memorable adage, and here it takes the form of Goodhart’s Law: “when a measure becomes a target, it ceases to be a good measure” (see also Campbell’s Law). Once the tests become the measure of success, teachers have strong incentives to teach to the test or, even more seriously, to cheat.
- Lack of transparency. The fact that the model was privately owned intellectual property meant that teachers could not appeal the score. Teachers were told that the model was outsourced to experts and algorithms and so was beyond appeal.
The value-add model, at least as implemented in the cases O’Neil describes, displays several of the problems that are now becoming common concerns as we talk about algorithmic decision-making.
The roots of machine-learning bias
One additional problem, not mentioned yet, is the problem of bias. Machines may not suffer from the same biases that we humans have, but they have their own problems. Machine learning techniques may exacerbate bias in decision-making because of badly conceived models or because of unrecognized biases in training data or because of disparate sample sizes across subgroup. It’s a challenge recently addressed by SAP’s Yvonne Baur, Brenda Reid, Steve Hunt, and Fawn Fitter in The Digitalist. An influential attempt to list the ways in which bias can creep into machine learning algorithms is in a paper called “Big Data’s Disparate Impact” by Solon Barocas and Andrew Selbst. They identify five mechanisms by which Big Data and the algorithms that process it may unfairly affect different groups:
- Target variables. As with the value-add model for teacher assessments, the ultimate goal of “quality” is not accessible, so a proxy is chosen instead. How is a successful hire to be identified? If performance reviews are chosen as a measure, then any bias in the company will naturally be promulgated by the hiring algorithm. Longevity with the employer and other combination of measures each have their own ambiguities.
- Training data. Just as target variables may have inbuilt bias, so may data used to train the model. In a dramatic but perhaps not surprising demonstration (not related to HCM), Aylin Caliskan, Joanna Bryson and Arvind Narayanan showed that an algorithm that learns word associations from training data containing large volumes of text mimics the stereotypes found in that text. Using social media data builds in other sources of bias. There is no easy escape.
- Feature selection. Features are the variables or attributes that an organization may build into a model. Do you include the reputation of the applicant’s university in the score for a job applicant? Or the zipcode of their home address? Both may correlate with protected categories such as race.
- Proxies. Criteria that are genuinely relevant in making rational and well-informed decisions may also happen to serve as reliable proxies for membership in a protected class. “Employers may find that they subject members of protected groups to consistently disadvantageous treatment because the criteria that determine the attractiveness of employees happen to be held at systematically lower rates by members of these groups”
- Masking. All the above mechanisms can happen inadvertently and without intent, but they can also happen with intent if the employer has a prejudice, and the algorithm may then serve to mask their bias.
Algorithmic accountability remedies
In response to these problems, calls for algorithmic accountability appeal to several related remedies.
- Explanation. At a minimum, there is the “right to explanation”. Experts such as Bryce Goodman and Seth Flaxman of Oxford University have argued that the European General Data Protection Regulation (GDPR) – to take effect in 2018 -- establishes such a right, although other scholars disagree.
- Transparency. Closely related is the notion of transparency: that decision-makers cannot use the complexity and proprietary nature of many algorithmic models as a shield against inquiry.
- Audits. Also related is the notion of audits: the idea that (like other aspects of company operations), algorithmic techniques could be checked by some independent third parties.
- Fairness. From within computer science, there have been attempts to address problems of unfairness and bias by building fairness requirements into the algorithms themselves, in a program that pioneer Cynthia Dwork has called Fairness Through Awareness.
Each of these concepts and programs has its own complexities. What’s more, there is of course no single set of rules that apply to all algorithms, industries, or jurisdictions. The US has taken a broadly sectoral approach to the problem (different rules for different industries) while the EU is pursuing a more general approach. Medical information, social media information, data that is obviously “personally identifiable” such as names and addresses, all introduce their own tangles and twists. There is no one set of rules about fairness and accountability that applies to HCM decisions, advertising, education, and the many other sectors where algorithms are becoming ubiquitous.
Machine learning algorithms are developing rapidly, but the social context in which they operate is also going to change as societies (and sections of society) debate the limits on how these algorithms should be used. The use and development of novel techniques carries promise, but it is inevitably accompanied by risks: legal risks from individual cases, policy risks where limits may be placed on business models, and reputation risks where even legal algorithms may be perceived as unfair by the public. Just as businesses have had to invest in understanding the norms, rules, and policy frameworks that shape privacy expectations, so businesses should invest in understanding the norms, rules, and policy frameworks that govern how machine learning can be applied.