In his 1951 lecture Intelligent machinery: A heretical theory, Alan Turing, who is widely recognized as the father of computer science, offered several ideas that foreshadowed much of cutting-edge AI safety discussion of the early 21st century. The following ominous words have become especially well-known.
“My contention is that machines can be constructed which will simulate the behaviour of the human mind very closely. […] Let us now assume, for the sake of argument, that these machines are a genuine possibility, and look at the consequences of constructing them. […] It seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers. There would be no question of the machines dying, and they would be able to converse with each other to sharpen their wits. At some stage therefore we should have to expect the machines to take control.”
Those last few words about machines taking control suggest that once AI machines at the capability level that Turing hints at – today known as superintelligence – exist, our fate will depend on what they are inclined to do. This highlights the importance of making sure they are programmed to have the right goals and motivations – a task that is central to much of present-day AI safety research and has been dubbed AI Alignment. In particular, we need to figure out how to avoid scenarios where the first superintelligent machine has goals that are incompatible with human flourishing and goes on to destroy our entire future from then on, such as (to invoke and oft-cited thought experiment) by turning everything into paperclips.
A tempting reaction here is that avoiding such scenarios must be easy. More than one prominent thinker has succumbed to that temptation, and as an example, let me mention the panel discussion on AI risk in Brussels in 2017 where I found myself face to face with best-selling author and cognitive scientist Steven Pinker. Confronted with my sketch of so-called Omohundro–Bostrom theory suggesting how easily we might end up with a superintelligent AI destroying us, he exhibited a body language that clearly demonstrated how silly he found everything I had said, and stated his simple solution in just eleven words: “The way to avoid this is: don’t build such stupid systems!”
This, however, is easier said than done. (If there is a single conclusion that current research on AI Alignment has gravitated towards, this is it.)
There are several reasons for this difficulty, and the one I wish to highlight in the present blog post is that an AI system does not always do as we intended. This is a recurrent theme in science fiction writer Isaac Asimov’s robot stories from the 1940s onwards. Robots in his world are programmed to obey his Three Laws of Robotics, namely that a robot must (1) not allow a human to come to harm, (2) obey orders from humans, and (3) protect its own existence, where (1) has priority over (2) and (3), and (2) has priority over (3). This all sounds benign and fail-safe, but what happens time and again in Asimov’s stories is that, due to some unforeseen loophole in the Three Laws, something unexpected and bad happens.
This showed deep insight on Asimov’s part. Yet, in order to convincingly argue that an AI system does not always do as we intended, it helps if I can go beyond science fiction stories and offer real-life examples. In doing so, I should take care to ensure that the examples I give are from the real world, rather than from the realm of urban legends.
In particular, I should avoid what has become one of the most oft-cited examples of an AI system failing to do as we intended, namely the tank neural network. A neural network is trained using supervised learning to distinguish photos with tanks from photos without tanks. It learns to do this perfectly on photos from the training set, but performs no better than chance when exposed to other photos. The reason (so the story goes) is that all photos with tanks were taken on a cloudy day, and all photos without tanks on a sunny day. Instead of learning to recognize tanks, the neural network learned to recognize sunny weather. This story, with countless minor variations, appears in equally countless articles, textbooks, popular accounts, blog posts and so on. As shown in the ambitious study by Gwern (2019), all attempts to trace it back to an original source quickly encounter dead ends. Examples of the story appear as early as in the 1960s, but with no original source in sight. It’s an urban legend.
A similar anecdote that has gained popularity lately is that of a neural network trained to distinguish between photos of wolves and huskies, respectively, but which instead learned to detect whether the photo had snow on the ground. In this case, the neural network does actually exist and is reported on in a 2016 paper by Ribeiro, Singh and Guestrin, but it was trained with a bad data set intentionally, for the purpose of an experimental study on how human subjects react to various AI systems.
So neither the tank example nor the huskies example hold water as cautionary tales about real-life AI systems failing to do as we intended. But are there other examples that do survive scrutiny? Gwern (2019) looks into this as well, and the answer is yes. If we wish to stick with image classification examples, there is a 2017 study by Kuehlkamp, Becker and Bowyer on how earlier neural networks trained to determine a person’s gender based on iris texture were actually heavily influenced by whether or not the person wears mascara. A recent example with greater practical importance is the commercially used neural network-based skin cancer diagnostics tool studied by Winkler et al (2019). It turns out that the classifier is heavily influenced by the presence (or not) of the kind of purple markings that doctors often use to indicate what they suspect to be malignant skin cancers, and that without this clue its performance was far worse. The fact that the tool performs on the level of a professional dermatologist becomes rather less impressive once we account for it having incorporated information provided by such professionals.
Hence, if we want to exhibit image classification examples of AI systems doing unintended things, then we should avoid tanks and huskies, and instead talk about irises or melanoma, or any of the half dozen similarly convincing examples offered by Gwern (2019).
Going beyond image classifiers, there are plenty of even more striking and sometimes disquieting examples to be found. An especially rich source is the crowd-sourced 2018 paper The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities by Lehman et al. One of my favorite examples from that paper is the so-called tic-tac-toe memory bomb, which arose in a computer torunament between programs playing Gomuko (five in a row tic-tac-toe) on a potentially unbounded board. The normal way to play Gomuko is to play moves in the immediate vicinity of earlier moves, but one of the programs evolved a radically different and (as it turned out) highly successful strategy: to play moves extremely far away. This exploited the fact that opponents had dynamically expanding board representations to include the moves played, and when the opponents encountered the very distant move, they expanded the board so much that they ran out of memory, crashed, and lost by default.
Yet another example: In 2002, computer engineers Jon Bird and Paul Layzell experimented with evolutionary methods for automated hardware design. They set up an optimization criterion meant to produce an oscillator. An oscillating signal they got, but the evolved circuit did not look at all like an oscillator. Instead, a kind of radio receiver had been produced, that picked up a signal from a nearby computer. (This example is particularly worrisome for those of us thinking about whether an alternative to AI Alignment might be to keep a superintelligent AI physically confined from the rest of the world other than via a carefully controlled low-bandwidth communications channel.)
I could go on and on, and perhaps the most momentous example the world has seen so far is how Facebook’s and other Internet platforms’ AI algorithms designed for attention-grabbing produce polarization and anger as unintended and largely unforseen side-effects. The full consequences of this global experiment are yet to be seen. It is clear from this and other examples that AI systems not doing as we intended can cause major trouble long before the superintelligence scenario envisioned by Alan Turing in 1951.