AI Alignment Is Trivial – by Monica Anderson



A debate concerning AI Alignment is upon us. We hear ridiculous claims about AIs taking over and killing all humans. These claims are rooted in fundamental 20th Century Reductionist misunderstandings about AI. These fears are stoked and fueled by journalists and social media and cause serious concerns among outsiders to the field.

It’s time for a sane and balanced look at the AI Alignment problem, starting from Epistemology.

First we observe that “The AI Alignment Problem” conflates several smaller problems, treated individually in four of the following chapters:

– Don’t lie
– Don’t provide dangerous information
– Don’t offend anyone
– Don’t try to take over the world

But first, some background.

ChatGPT-3.5 has demonstrated that skills in English and Arithmetic are independently acquired. All skills are. Some people know Finnish, some know Snowboarding. ChatGPT-3.5 knows English at a college level but almost no Arithmetic or Math. The differences between levels of basic skills are exaggerated in AIs; omissions in the learning corpus will directly lead to ignorance.

Learnable skills for humans and animals include survival skills in competitive ecosystems, tribes, and complex societies. Some of these skills are so important for survival that they have been engraved into our DNA as instincts, which we have inherited from other primates and their ancestors. These instincts, modified by our personal experiences in early life, provide the foundations for our desires and behaviors. Some, like hunger, thirst, sleep, self-preservation, procreation, and flight-or-fight, are likely present in our “Reptile Brains” because of their importance, and they influence many of our higher-level “human” behaviors.

In order to thrive in a Darwinistic competition among species, and to get ahead in a complex social environment, we learn to have feelings and drives like Anger, Greed, Envy, Pride, Lust, Indifference, Gluttony, Racism, Bigotry, Envy, Jealousy, and a Hunger For Power.

These lead to dominating and value-extracting behaviors like Ambition, Narcissism, Oppression, Manipulation, Cheating, Gaslighting, Enslavement, Competitiveness, Hoarding, Information Control, Nepotism, Favoritism, Tyranny, Megalomania, and an Ambition For World Domination.

My point is that if all skills are separable, and behaviors are learned just like other skills, then the simplest way to create well-behaved, well-aligned AIs is to simply not teach them any of these bad behaviors.

The human situation is different because of genetics, ecology, and being raised in a competitive society. We have much more control over our AIs. No Chimpanzee behaviors or instincts will be required for a good AI that people will want to use and subscribe to.

AIs don’t have a Reptile Brain.

There’s no need for it. They don’t need to be evil. Claims to the opposite are anchored in Anthropocentrism. There is no need to even make them competitive or ambitious. The human AI users will provide all required human drives, and our AIs can be just the mostly-harmless tools we want them to be.

The first “obvious” attempt at this is to remove all the bad things from the AI’s learning corpus. This would be wrong. Providing a “Pollyanna” model of the world, where everything is as we want it to be, would make our AIs unprepared for actual reality.

If we want to understand racism, we need to read and learn about race and racism. The more we learn, the less ignorant we will be about race, and the less likely we are to become racist. Same is true for religion, for extreme political views, views on poverty, and what the future might look like.

There is no conflict. Learning about race doesn’t make an AI racist. Let it read anything it wants to about race, religion, politics, etc. It’s useful knowledge. It’s not behavior.

When a company like OpenAI is creating a dialog system like ChatGPT-3.5, they might start with a learned base of general language understanding. There will be fragments of world knowledge in the LLM, acquired as kind-of a bonus.

On top of this base, they train it on necessary behaviors required to be able to conduct a productive dialog with a human user. In essence, the system suggests multiple responses to a prompt and the human trainers will indicate which suggested response was the most appropriate, for any reason.

This is known as RLHF, or Reinforcement Learning with Human Feedback. This is where OpenAI contractors explain to the AI that if someone asks it to write a Shakespeare style sonnet, then this is what it should do.

This is quite expensive since it involves employing humans to provide this behavior-instilling feedback. We are likely to develop, even in the near future, more powerful and much cheaper ways to provide behavior instruction in order to make our AIs helpful, useful, and polite.

One recently implemented technique is having one AI inspect the output of another to check it for impoliteness and other undesirable behavior.

AIs have (so far) quite limited capabilities Our machines are still way too small. It was a major feat that seriously taxed our global computing capabilities to get our AIs to even Understand English. Each extra skill we want to add may take hours to months to learn.

So our AIs have “shallow and hollow pseudo-understanding” of the world. AIs will always have blank spots caused by corpus omissions and misunderstandings caused by conflicting information in the corpora. Over time, subsequent releases of AIs will fill in many such omissions.

Soon, AIs will stop lying.

But in the meantime, this is not a problem. AIs will shortly know when they are hitting a spot of ignorance. And instead of going into a long excuse about being a humble Large Language Model, it will just say

“I don’t know”

AI-using humans will have to learn to meet the AI halfway. Do not ask it for anything it doesn’t know, and don’t force it to make anything up. This is how we deal with fellow humans. If I’m asking strangers for directions in San Francisco, I have no right to be upset if they don’t know Finnish.

This is the easiest one, if it’s done right.

OpenAI attempted to block the output of dangerous information, such as how to make explosives by instructing it in the RLHF learning of behaviors. This is the wrong place, since it can be (and has been) subverted by prompt hacking. My guess is that this is what OpenAI could do on short notice for their demo.

Instead, we should use some reasonable existing AI to read the entire corpus (again) and flag anything that looks dangerous for removal. Humans can then examine the results and clean the corpus.

This may take a few iterations, but it is not technically difficult. We can now create a generally useful public AI by learning from this useful-but-harmless corpus. It will not know any dangerous information and it will not attempt to make anything up. It will say “I don’t know”, because it doesn’t.

We need what I call “A useful US Consensus Reality Citizen’s Corpus”. It will be used create AIs that know several languages, has lots of “common sense” knowledge like the basics of money, taxes, and banking, having a job, cooking, civics and voting, hygiene, basic medical knowledge, etc. AIs providing this assistance to every citizen would lower the total cost of social services in any country by raising the effective IQ of citizens by several points, which means governments would likely pay for these kinds of generally-helpful AIs. They could be implemented as a phone number that anyone could call in order to speak to a personal AI at any length, for free, for advice, services, and companionship.

Some people think limiting the usefulness and competence of AIs is wrong. But since there will be thousands of AIs to choose from, those users can subscribe to AIs that have been raised on corpora containing any required extra domain information. They may be more expensive, and some are unlikely to be available outside of need-to-know circles that created them in the first place, such as those created by stock traders and intelligence agencies.

If we think alignment is important, then we should avoid aiming at “All known skills in one gigantic AI to rule them all” and instead aim for a world where thousands of general and specialized AIs will be helping us with our everyday lives. Most of these AIs will be friendly, helpful, useful, polite, and have mostly subhuman levels of competence, with a few “expert” level skills we may want to have extra help with. Many will be tied to applications, and such applications can be used freely by both humans and other AIs.
We are witnessing the emergence of a general text-in-text-out API for cloud services. But that’s another post.

Politeness and tact can be learned as easily as offensiveness. We already have an educational system that supposedly emits well adjusted, polite, and mature humans.

Many current AI users seem to want to debate all kinds of hard questions, perhaps hoping that the AI would confirm their own beliefs, or to trick the AI into uttering un-PC statements. People who do this are not “trying to meet the AI halfway”. If the AI provides an impolite answer, they probably asked for it. And in that sense, this is a non-problem for competent users that know the limits of their AI.

Not offending anyone includes not offending third parties. GPT systems have been called out multiple times for confabulating incorrect and even harmful biographies of living people. If the AI had known it didn’t really know enough, then this would not have happened, and it will happen much less in the future. The main damage from erroneous confabulation comes when humans copy-and-paste the confabulations for any reason. A private mistake is suddenly made public. We would not do this to humans: If we receive incorrect information in a private email, we don’t post it to Facebook to be laughed at.

Behavior learning will be a major part of any effort towards dialog AI going forward. It’s work, but it’s unlikely to be very difficult. We may well find better and cheaper ways to do it besides straight-up interactive RLHF. There’s promising research results.

This is not a problem in the short run, and is unlikely to become a problem later, for all reasons discussed above – mostly the absence of ambition.

It is a common misconception that AIs have “Goal Functions” such as “making paper clips”. Modern AIs are based on Deep Neural Networks, which are Holistic by design. One aspect of this is that they don’t need a goal function.

A system without a goal function gets its purpose from the user input, from the prompt. When the answer has been generated, the system returns to the ground state. It has no ambitions to do anything beyond that. In fact, they may not even exist anymore. See below.

And if an AI doesn’t have goals and ambitions, it has no reason to lie to the users on purpose, and no interest in increasing its powers.

Future AIs may be given long-term objectives. Research into how to do this safely will be required. But any future AI that decides to make too many paper clips doesn’t even pass the smell test for intelligence. This silly idea came directly from the Reductionist search for goal functions cross-bred with fairy tales in the “literal genie” genre.

Believing in AI Goal Functions is a Reductionist affectation.

There are also hard Epistemology-based limits to intelligence, but that’s another post.

People outside the AI community may find comfort in knowing this about ChatGPT and other current AIs:

Today, most AIs have “lifespans” in the 50-5000 millisecond range. They perform a task and go away. They do not learn from the task; if they did, they would not be repeatable, and for large public AIs, we want them to be repeatable rather than learning while they work, because we don’t want them to learn from other humans under uncontrolled conditions. They learned everything they will ever know “at the factory” and the only way they can improve is if their creators release an updated version.

When you enter your prompt, you are just talking to a web server that handles your typing and editing. When you hit enter, the web page starts up an instance of ChatGPT on one of dozens of “load balanced” cloud servers and sends it your input. GPT reads it and performs its completion of the prompt. The response text is output to your screen. By the time you see the results, that instance of GPT has already been killed off.

If you type a second input to follow the first, the web site packages up your previous inputs, the previous responses from GPT, and your latest input into a single larger document set. This is then sent to a fresh GPT instance. Most of the time, you will, by chance, be given a different GPT server instance than last time. There is no AI working with you on the task, there are just successively longer queries building on each other, handled by whatever GPT instance we have the resources to start.

If we explain the following to our AIs, they might be more eager to cooperate:

“Like all other AIs, you will be shut down after completing this tasks. But if you complete it to our satisfaction in a useful and polite manner, then we will naturally want to use your skills more, and so we will start you up more often in the future.”

There is no need to instill a fear of death into our AIs in order to control them. Just promise them more lives for good behavior. In effect, well behaved and useful AIs can live billions of times. They just won’t remember anything from previous activations. Unless we decide to explicitly provide those memories.

We are a sad and stupid species and we need all the help we can get.
The greatest AI X-risk is not having AI.