Meta recently deployed new bots that crawl the web, gathering data for its AI models and related products. These bots include features that make it more difficult for website owners to prevent their content from being scraped and collected.
According to the company, the Meta-ExternalAgent bot is intended “for use cases such as training AI models or enhancing products by directly indexing content.”
Another bot, named Meta-ExternalFetcher, is associated with the company’s AI assistant services and gathers links to support specific product functions.
According to Originality.ai, a startup that specializes in detecting AI-generated content, these bots first emerged in July. Startups and tech giants are competing to develop the most advanced AI models, and a crucial component is high-quality training data.
One of the primary methods to acquire this data is to deploy bots to crawl and scrape online content, which Google, OpenAI, Anthropic, and several other AI companies have implemented.
Content owners use the robots.txt rule to prevent automated scraping of websites, but AI companies like OpenAI and Anthropic have been found to be ignoring or circumventing this system. Meta may also be attempting to bypass the robots.txt rule with its new bot, Meta-ExternalFetcher.
The Meta-ExternalAgent bot serves two purposes: gathering AI training data and indexing content. Website owners may want to prevent Meta from collecting their data for AI model training, but they may desire the tech giant to index their sites to attract more human visitors.
Combining these functions in a single bot makes it harder to block. According to Originality.ai, only 1.5% of the top websites are blocking the new Meta-ExternalAgent bot, while an earlier Meta crawler called FacebookBot, which has been scraping online data for years to train Meta’s large language models and AI speech-recognition technology, is blocked by almost 10% of the top websites, including X and Yahoo.
The new Meta bot, Meta-ExternalFetcher, is being blocked by less than 1% of the top websites. The CEO of Originality.ai, Jon Gillham, stated that companies should allow websites to block their data from being used for training without reducing the visibility of the websites’ content in their products.
Meta is not respecting website owners’ previous decisions on its older bots, and any website that previously blocked the FacebookBot now needs to also block the new Meta-ExternalAgent crawler to ensure their data is not used to train Meta’s AI models.
A Meta spokesperson stated the company aims “to facilitate publishers in expressing their preferences.”
The spokesperson also wrote in an email to the reporter, “Similar to other companies, we train our generative AI models on publicly accessible online content. We acknowledge that some publishers and web domain owners desire options regarding their websites and generative AI.”
Additionally, the spokesperson mentioned that Meta has multiple web-crawling bots to avoid “consolidating all use cases under a single agent, offering more flexibility for web publishers.”
Stories You May Like