Why Amazon's AI-driven high volume hiring project failed

June 28, 2020

Viktor Nordmark

Envision the dream of a well-functioning recruiting AI. One that supports you, a recruiter, with every possible monotone task you might have, all while making hiring a pure bliss. You’d be able to operate quicker than ever and hire all the right people on a whim.

At least that’s what Amazon was hoping for as they set up a team of engineers in Edinburgh, Scotland in 2014.

Learn more

A team designated for one goal; finding a way of automating recruiting. Three years later, Amazon deemed the project a complete failure and abandoned it. What happened during these three years and what went wrong? And even more importantly - how can it be avoided?

‍

Why Amazon wanted to use AI in recruiting

Amazon has since long been big on automation. Their e-commerce runs on different types of automation tools and products, ranging from; Alexa (virtual assistant AI technology), to their recommendation engine “People who bought this also bought”, all down into their warehouses where they recently placed new packing machines that pack four to five times the speed of a human packer. So when timing was right, i.e computing power costs were low, and machine learning (ML) was starting to become a viable technology, Amazon decided it was time to automate the recruitment sector, mainly for internal usage to start with. Amazon was growing rapidly and a hiring spree was imminent.

Starting off in 1995 as an online-bookshop with 11 employees and fast forward to 2020 to find that they now have 840 000 employees. With an additional number of 100 000 new hires these past months due to the coronavirus. The company has had an exponential growth in hiring which helps one understand the eagerness of automating their hiring process. Albeit their good intent, the execution of the project was done too fast and too sloppy to hold any type of value other than becoming a world-wide acknowledged disaster which, hopefully, many in the AI recruitment field learned a lot from.

‍

A Hubert E-book The next generation of candidate selection for volume hiring Challenges in high volume hiring needs to be tackled in a deliberate way. Thus is a guide on how to best handle the screening step Download E-book

The development of the Amazon AI recruiter

A key feature to their e-commerce is that of ratings. Any product can be rated from one to five stars and gives a good clue as to how well-liked the products are. When Amazon then built their experimental hiring tool, they created something similar, but for people.

The AI was supposed to rank job candidates from one to five stars in order to quickly identify top talent and discard the rest. To achieve this, they created approximately 500 computer models that would trawl through past candidates’ resumés and pick up on about 50,000 key terms that showed up in the resumés. The computer models were focused on specific job functions and locations and the engineers trained the AI system by using the top performing employees’ resumés as models for what they sought after in new hires. They then matched those resumés with numerous amounts of resumes from the past ten years. But the outcome simply wasn't that good.

According to Reuters, “the algorithms learned to assign little significance to skills that were common across IT applicants, such as the ability to write various computer codes.” In practise, it turned out that the algorithms focused on verbs commonly used by male engineers’ such as “captured” and “executed”. The underlying problem was that the majority of the resumés from successful employees had one decisive thing in common; almost all of them were men. This leads us to the main issue; Amazon’s AI technology favoured men and thus discriminated against women.

As AI and machine learning became a “thing”, many following the development hoped for an unbiased technology, one that would save the human race from discriminating against each other. The hope was for the AI to deliver unbiased results within virtually all areas in which human decision-making and attached bias can have discriminating consequences. As the Amazon engineers progressed in their creation of the AI recruiter, they didn’t think about what type of implicit bias the data they fed the AI with may hold. So forth, the resumés they fed the AI with consisted of mainly men.

It's no news that within most STEM professions as well as in tech companies, men are most frequently hired, Amazon is no exception.

As the company began to use the AI for its purpose, they soon started to notice that hardly no women were suggested by it. They also saw that the AI downgraded resumés containing the words “women’s” and filtered out women who had attended women’s colleges only.

Amazon abandoned the project in 2017 and was only using a "watered-down" version of as late as October 2018. Whether the software is still in use today is unclear. But if a leading tech company can’t use AI as a bulletproof way of hiring, which industry could? And should we? Let’s dive deeper into how bias affects humans and in return AI, and our attempts at working with AI to our benefit.

‍

Human bias in AI training data

Let’s leave Amazon for a moment and instead cross paths with AI in the US judicial systems. COMPAS, an AI tool created by Northpointe and used to assist judges, probation and parole officers to assess the likelihood of a criminal defendant to become a recidivist (a criminal who reoffends). In short, this tool determines the future of people. ProPublica investigated the risk-assessment AI to see whether the algorithm used was biased against certain groups or not. They found out that the tool indicated; “that black defendants were far more likely than white defendants to be incorrectly judged to be at a higher risk of recidivism, while white defendants were more likely than black defendants to be incorrectly flagged as low risk”.

In one case, a Wisconsin man; Mr. Loomis, claimed that he was incorrectly sentenced to six years of jail due to the fact that the judge had consulted COMPAS. As part of Loomis appeal, he requested to view the source code of the algorithm, a request Northpointe declined due to proprietary reasons. The case was going to the Supreme Court but the Supreme Court declined to hear the case (2017). Loomis’s lawyer, Mr. Rosenberg wrote in his brief: “The COMPAS system is full of holes and violates the requirement that a sentence be individualized.”

Both COMPAS and Amazon’s AI recruiter are examples of when ADM (Algorithmic Decision-Making) systems disrupt our ethics and change our morale for the worse. The main issues with allowing ADM systems to decide people’s fates stem primarily from:

1. Our innate trust in science. Which clearly becomes a problem when applied to machine learning. Cathy O’Neil, a mathematician and author of 'Weapons of Math Destruction', a book that investigates the risk of algorithmic bias in different settings, says; “People are often too willing to trust in mathematical models because they believe it will remove human bias. Algorithms replace human processes, but they’re not held to the same standards. People trust them too much.”‍

2. This in combination with a seemingly global reluctance to hold anyone accountable for the possible negative outcomes makes for a sour aftertaste. If an AI overview system would be in place, based on peoples’ value judgements (not algorithm creators’), a system that could determine whether an AI is fit to be applied onto the general public or not, the usage would be safer for all. A small organization that’s attempting to uphold such an initiative is AlgorithmWatch. Their manifesto reads: “AlgorithmWatch is a non-profit research and advocacy organisation to evaluate and shed light on algorithmic decision making processes that have a social relevance, meaning they are used either to predict or prescribe human action or to make decisions automatically.”

The underlying design of the systems is also to blame in many cases. When programming an ADM, the programmer can choose to optimize it for true positives. This means that the AI will identify as many people as possible who are at high risk of, for example; slacking off at work or committing a crime. The problem with this is that it increases the risk of false positives i.e those who are wrongly accused of slacking. To avoid this, the dial can be adjusted to deliver as few false positives as possible, but this can lead to more false negatives: people who actually do slack slip through. We’ll take a look at how this can be fixed in a bit.

We can conclude that using ADM systems in the public sector or in big corporations comes with potentially damaging results and could be devastating for minorities and those in poverty if biased against in a poorly designed system. As an example, banks can use AI to decide whether to grant individual consumers credit and set interest rates for them as well as decide if someone should get a loan or not. Which means that if an AI is trained on previous reports holding information about a low-income taker’s ability/inability to repay her debts, a person in the same situation might be denied a loan. Which means the AI has not considered the individuals’ potential, but rather looked at statistics spanning over several years.

ML is a relatively new technology and from looking at the evidence, it is too premature to be fit to decide people’s fates as of today. Although Amazon managed to notice their mistake before much damage was done, the possible negative consequences could’ve been devastating for the company. To ponder a bit about what could’ve happened; their male/female employee quota would’ve been close to non-existent which is a big no-no in today’s day and age. Women across the globe, in the name of feminism, would shun the company, making Amazon look like old dinosaurs and in effect, most likely, decrease their sales (women stand for 70-80% of all consumer purchases).

The mistakes aside, what can possibly be done to avoid this from happening? AI is after all, when done and used right, a source of great efficiency and support for many sectors.

‍

Debiasing Algorithmic Decision-Making and how to mitigate the risks during usage

Regarding true or false positives, adjusting those can improve ADM outcomes. By raising the incidence of true positives or lowering the false positives, you can improve a statistical measure known as positive predictive value (PPV). PPV is the percentage of all positives that are true. However, no matter how the dials are set, all algorithms today will hold bias since they’re based on generalized statistics, not on someone’s individual situation. This sheer fact alone breathes life into the idea that the results coming out of ADM systems should be used as advice and direction, not as truth.

Despite this, it’s widely used by two superpowers, both the UK government and the US, in important decision-making. There is an evident discrepancy between government' usage of AI and what they communicate to the public about usage. For example, the public sector in the UK is based on the Nolan Principles; the basis of the ethical standards expected of public office holders. Four of the principles are: objectivity, accountability, openness, and honesty. However, these principles do not apply to AI in the UK as of today. Even though AI is used by insurance companies, public schools and the Department for Work and Pension. Whether the principles should or shouldn’t be applied to AI is, thankfully, under discussion.

Alexander Woolfson for the Article writes: “Can we simply apply the same ethical standards when thinking about a different form of computational cognition?”

A very legit question. Human decision-making does differ from computational and should probably be surveilled differently. Before a decision is reached in this, the Committee on Standards in Public Life proposes multiple ways of mitigating the risks with AI (to mention a few):

Greater transparency by public bodies in use of algorithms
New guidance to ensure algorithmic decision-making abides by equalities law
The creation of a single coherent regulatory framework to govern this area

The ethics of AI is a whole topic in itself and there are, of course, two sides to a coin. Whilst some rave about regulatory systems due to issues with the democratization of computing infrastructure (which allows anyone to build an algorithm however they want). Others believe it will hinder the development of AI.

Dr. Rumman Chowdhury, Accenture’s lead for responsible AI told Forbes “What we don’t want is every AI project at a company having to be judged by some governance group, that’s not going to make projects go forward. I call that model ‘police patrol,' where we have the police going around trying to stop and arrest criminals. That doesn't create a good culture for ethical behavior." But in the name of AI used for public purposes, Mr. Chowdhury agrees that since AI is democratized, we need to be ready to democratize the overview and regulation of AI ethics.

First reported by Forbes, another way to tackle this problem is by doing what a big tech corporation has done: “Researchers at IBM are working on automated bias-detection algorithms, which are trained to mimic human anti-bias processes we use when making decisions, to mitigate against our own inbuilt biases.” Even the other big leaders in tech (Google, Apple, Facebook) have formed the Partnership on AI in 2016. As a way of ensuring that research on the ethics of AI, including issues of bias, progresses and sustains.

As we globally venture deeper into the field of machine learning, it becomes imperative for companies as well as for governments to ensure the AI they use is tested, iterated on and proof-checked by an initiative that can act as a safety-net and thereby validate AI made for the public. If there’s even a slight chance of someone being discriminated against or treated unjust, the AI should be heavily scruitinized in order to be used. And on the other hand, considering the substantial positive impact AI can have on our world when used right and just, it’s equally important that companies are supported in their development of it.

Implementation period

Insight

Why Amazon's AI-driven high volume hiring project failed

June 28, 2020

Viktor Nordmark

Join our newsletter!

Learn more

Get the latest on AI in recruitment - Join our newsletter!