Cloudflare recently launched a new tool that is free and helps in blocking AI companies’ bots from scraping its clients’ websites to gather content for training large language models.
The tool is accessible to all Cloudflare customers, including those on free plans. According to the company, the feature will be continuously updated as they discover new fingerprints of bots that are widely scraping the web for model training.
Cloudflare’s team announced this update in a blog post and also provided data on how their clients are dealing with the increase in bots that scrape content for training generative AI models.
Based on the company’s internal data, 85.2 percent of customers have opted to block AI bots, even those that properly identify themselves, from accessing their websites.
Over the past year, Cloudflare has also recognized the most frequently active bots. Bytespider, owned by Bytedance, made attempts to access 40 percent of websites under Cloudflare’s control, while GPTBot from OpenAI tried on 35 percent.
These two bots were among the top four AI bot crawlers based on the number of requests on Cloudflare’s network, alongside Amazonbot and ClaudeBot.
Blocking AI bots from accessing content is proving to be quite challenging. The competition to develop models quicker has resulted in companies finding ways to bypass or violate the current rules regarding blocking scrapers.
Perplexity AI was recently alleged to have scraped websites without proper permissions. However, if a major backend company like Cloudflare takes significant action to stop this behavior, it could yield positive outcomes.
The company expressed concern that certain AI companies may try to bypass regulations in order to access content, and they anticipate that these companies will continuously adjust their methods to avoid detection.
They plan to remain vigilant, implement additional bot detection measures in their AI Scrapers and Crawlers rule, and improve their machine learning models to ensure that content creators can flourish and maintain complete control over how their content is utilized for training or inference.
Stories You May Like