The battle over the future of artificial intelligence flared again this week as YouTube‘s CEO weighed in on a controversial topic: Whether leading AI companies are improperly using YouTube videos to train their powerful new AI models.
In an exclusive interview with Bloomberg, YouTube CEO Neal Mohan didn’t mince words. If companies like OpenAI are scraping YouTube videos to train AI systems like the new Sora video generator, that would constitute “a clear violation” of YouTube’s rules, he said.
“From a creator’s perspective, when a creator uploads their hard work to our platform, they have certain expectations,” Mohan told Bloomberg’s Emily Chang. “One of those expectations is that the terms of service is going to be abided by. It does not allow for things like transcripts or video bits to be downloaded, and that is a clear violation of our terms of service. Those are the rules of the road in terms of content on our platform.”
Mohan’s stern remarks throw new fuel on the smoldering debate around training data for generative AI – the technology that can create shockingly realistic content like videos, images, audio and prose by digesting massive datasets. As companies race to build ever-more capable AI assistants, they are aggressively vacuuming up as much online content as possible to feed their data-hungry AI models.
But much of that content was created by humans who may not have agreed to have their work repurposed for AI training. YouTube videos are a particularly tantalizing target given the platform’s ultra-viral nature and the difficulty of sourcing quality video datasets elsewhere.
Neither Mohan nor YouTube’s parent company Alphabet have accused OpenAI of any wrongdoing outright. But the comments mark a rare public rebuke from a major tech platform over this simmering AI training data controversy.
OpenAI has stayed mum on exactly what data sources were used to train Sora, the company’s new AI video generator. In an interview last month, OpenAI’s Chief Technology Officer Mira Murati told The Wall Street Journal she wasn’t certain whether Sora ingested YouTube user videos during the training process.
However, WSJ also reported this week that OpenAI has mulled training its upcoming GPT-5 language model on transcripts of public YouTube videos, citing people familiar with the matter. Such a move could run afoul of YouTube’s terms if executed without proper licensing.
For its part, Mohan said YouTube’s parent Alphabet is careful to honor the rights of creators when utilizing their videos to train the company’s own generative AI like the PaLM model. Although Mohan acknowledged Google may use “some portion” of YouTube’s video corpus for AI training, he said that is done “in concert with whatever the terms of service or the contract that that creator has signed.”
The AI training data dilemma shows no signs of going away as generative AI soars in popularity and capability. Even as companies pledge to develop AI responsibly, questions will persist around whether Internet users truly consented to have their content reused to develop commercial AI products and services.
Discover more from GadgetBond
Subscribe to get the latest posts sent to your email.
