Last month, it was reported that Apple, NVIDIA, and other major players in the AI industry had used a public data set containing YouTube video transcripts to train their AI products, which is a violation of YouTube’s terms of service.
YouTube has stated that any unauthorized scraping or downloading of its content is strictly prohibited, especially when used for commercial projects.
Recent investigations found that NVIDIA, Apple, and other AI companies used an academic data set containing subtitles from over 170,000 YouTube videos to train AI models.
NVIDIA has now been accused of instructing employees to scrape videos from Netflix, YouTube, and other sources to add to data sets used for its Omniverse 3D world generator, self-driving car systems, a digital human AI avatar product, and the Cosmos deep learning model.
The report also claims NVIDIA attempted to hide its actions by using multiple virtual machines to avoid detection.
“We are finalizing the v1 data pipeline and securing the necessary computing resources,” Ming-Yu Liu, NVIDIA’s VP of Research and a leader on the Cosmos project, wrote in a May email, according to 404, “to build a video data factory that can yield a human lifetime visual experience worth of training data per day.”
According to internal communications reviewed by 404 Media, when employees raised concerns about the origin of the data and the ethics of how it was obtained, managers assured them that they had approval to use the content for training from the company’s top leadership.
“This is an executive decision,” Liu wrote to a hesitant underling on one such occasion, according to Slack messages reviewed by 404. “We have an umbrella approval for all of the data.”
NVIDIA stated that its AI training practices fully comply with copyright law, both in letter and in spirit.
Other Stories You May Like