
People in the humanities are aghast at the lies confabulated by current LLMs. To them, I say: Just wait. This will shortly become YOUR problem.
It takes a species to raise an AI.
Us Techies have managed to cobble together a machine that can learn any language on the planet. This breakthrough was achieved after scouring the internet for text to read. It takes a lot of text to learn a language, if you literally have no life.
Something that few people outside of the AI community understand is the importance of the AI’s learning corpus. This is the collection of books and online texts we give it to read when we are raising it. The corpus is not only important, it is the only thing that matters. It provides a hard upper limit to how much the machine can know about anything in the world.
If it wasn’t in the corpus, where would it have learned it?
This is a statement in Epistemology. This is the level you need to operate at, in order to understand why AI works. As opposed to how it works. At this level, machine learning is like human learning, machine understanding of anything works a lot like human understanding, abstraction is abstraction, etc.
A modern LLM learns to understand language using mechanisms similar to those in human brains. Any knowledge it has of the world described in the corpus (as opposed to viewing it as a mere sample of “language”) is a bonus we hoped for but didn’t really have a right to expect. Because learning Math and Physics and Cooking wasn’t a goal, at least initially. Language is hard enough. I have myself never trained a GPT style Model from scratch, but my own LLM design needs to read my smallish corpus several times because during the first read-throughs it is still just learning character combinations.
Which means that even if it was in the corpus, the system may not have learned it. If you tried to learn Finnish from scratch by reading a Finnish encyclopedia from end to end, you wouldn’t understand enough Finnish to learn actual content until maybe halfway through the work.
But now that these devices know languages, we will be raising more competent ones that know more about the world. Learning is expensive, so we will initially prioritize profitable problem domains like Business, Math, Law, Medicine or Physics. Improvements in hardware and algorithms lets them learn more domains, and to get deeper into each.
The world knowledge corpus will largely define what LLMs believe. Potentially all of them. This is a major responsibility, and the task needs proper attention.
— * —
Techies are trying to create a useful system out of something that starts out without ANY common sense, no body, no smell, no touch, and (at least in the beginning) no vision, no sound. Just an input sense of text, and maybe voice.
English majors and their ilk are sitting on the sidelines. Some are criticizing the results, clearly expecting an intelligent system, perhaps even an AI oracle, rather than a system that merely understands language well.
Techies got this far without applying much of specialized skills in Education, Psychology, Ethics, Law, or Politics by just grabbing all text we could find on the Internet and calling it a corpus.
It will become a job for the Humanities to raise our AIs and to worry about AI alignment – to ensure that their goals align with our own goals. To assemble the corpora that will create useful and well balanced AIs which will be able to move civilization forward for the benefit of all.
We need a Useful US Citizen’s Consensus Reality Corpus.
Who gets to curate it?
Leave a Reply