Artificial Intelligence (AI) large-language models (LLMs) such as OpenAI’s ChatGPT and Google’s Gemini are devouring “high-quality text data” from the Internet at such a rapid clip they could soon run out of it, resulting in AI “inbreeding.”
Data-hungry AI firms are already using “AI-generated, or synthetic, data as training material” to beat the looming shortages — but researchers warn this “could actually cause crippling malfunctions.”
Training AI on text that is itself generated by AI is described as “the computer-science version of inbreeding” by the Wall Street Journal and generally results in gibberish and “model collapse.”
One experiment involving modeling along these lines resulted in an LLM spitting out a diatribe about a species of jackrabbit that does not exist when it was supposed to discuss 14th-century English architecture.
The amount of high-quality text available for AI to train on is running out so fast that OpenAI is already looking to scour YouTube videos for more, transcribing the audio using their Whisper program.
Nevertheless, it could become difficult for AI to continue to progress at its current speed once the available online resources are exhausted.