Artists and authors are complaining that Generative AI is copying their works, and some systems will provide direct excerpts from published documents without providing a source reference.
This is a problem that can be solved with “regular programming” by the LLM provider. Anthropic Claude has been providing real and valid references for some time whenever I ask for something in the style of a research report, and other LLMs are following suit. This is not exactly the same as a learning source reference, but it’s a step in the right direction.
But this seems to not be enough for some artists and authors. They feel corpus inclusion is theft.
LLM producers can remove any artist’s or author’s work from the corpus to be used for the next release. That is trivial to do and can easily be documented, for instance by providing a public table of contents, with source links, for the entire learning corpus.
When near-future LLM++ systems start selecting our input media and are answering our questions it means that if an artist has asked to have their works removed from the corpus, the world will know nothing about them and their works in a few years.
If you want to be famous, known, and admired in a few years, then you need to let AIs read and admire your stuff today.
Taking your works out of all learning corpora is a direct trip to oblivion.
This is not “a threat voiced by LLM providers”. Rather, it is a simple consequence of individual decisions made by authors and artists. LLM providers like OpenAI and Google have enough pictures and text to learn hundreds of languages and create images of anything we can imagine and many things we can’t. These companies don’t care much about any individual document or artwork.
In this context it might be worth mentioning that if you want an LLM to create a fantasy painting of a cat, like a Puss in Boots, most of the information about what cats look like comes from pictures of real cats, rather than artworks of cats. Art styles come from specific artists, but if you prompt for a cat in a box in the style of Rembrandt, the results are original art. I discuss this more in my post about AI and creativity.
— * —
Anyone selling something on the web, including blog entries, would be a fool to block Google and other search engines from indexing their stuff so that it can be found. The web server file “robots.txt” can be used to block indexing; be careful about what you put in there, if you want others to find you.
LLMs are not search engines. For factual queries, the service they provide is a single, simple, answer rather than 100s of documents for you to read and evaluate yourself. Many people lack the competence to evaluate the veracity, usefulness, and applicability of dozens of search results. These people are the main target audience for LLM produced search summaries, such as those now provided by Microsoft, Google, and others.
It will just take a couple more generations of LLM releases before their result summaries become so good that people will stop reading the regular search result page. And will therefore stop clicking on result links. Which means we need to re-think search monetization and probably search as a whole. What we have today will just stop working. And one of the few things we can say for certain is that their corpora will continue to matter. So make sure your works are in every one of them.
Longer term, AI will change everything. Today we are discussing compensation to artists and authors, but in a decade or two, there are no guarantees we’ll even use money.