AI Crawlers Increase Wikimedia Commons Bandwidth By 50 Percent

The Wikimedia Foundation, which oversees Wikipedia and several other crowdsourced projects, announced on Wednesday that the bandwidth used for multimedia downloads from Wikimedia Commons has increased by 50% since January 2024.

In a blog post released on Tuesday, they explained that this rise isn’t due to more people wanting to access information, but rather because of automated scrapers that are consuming data to train AI models.

Their infrastructure can handle sudden spikes in traffic from human users during popular events, but the level of traffic from these bots is unprecedented and poses significant risks and costs.

AI Crawlers Increase Wikimedia Commons Bandwidth by 50 Percent

Wikimedia Commons serves as a free resource for images, videos, and audio files that are either open licensed or in the public domain.

Wikimedia reported that nearly two-thirds (65%) of the most resource-intensive traffic comes from bots, while these bots account for only 35% of total pageviews.

This difference occurs because frequently accessed content is cached closer to users, while less popular content is stored further away in a more expensive core data center, which is what bots tend to seek.

Human readers usually focus on specific topics, while crawler bots read many pages at once, including less popular ones. This bulk reading leads to requests being sent to the core data center, increasing the cost of resource consumption.

As a result, the Wikimedia Foundation’s reliability team is dedicating a lot of time and resources to block these crawlers to ensure regular users are not disrupted. This effort is compounded by the cloud costs the Foundation has to manage.

This situation reflects a growing trend that threatens the open internet. Recently, software engineer Drew DeVault pointed out that AI crawlers often ignore “robots.txt” files meant to limit automated traffic. Similarly, engineer Gergely Orosz noted that AI scrapers from companies like Meta have raised bandwidth demands for his own projects.

Developers, especially in open-source infrastructure, are pushing back against this issue. Some tech companies are also stepping in to help, like Cloudflare, which recently introduced AI Labyrinth, a tool that uses AI-generated content to slow down crawlers.

Ultimately, this ongoing struggle may force many publishers to hide their content behind logins and paywalls, which could have negative effects on web users everywhere.

Other Stories You May Like

Report: Google Is Working On New AI-Powered Search Engine

ChatGPT Falsely Accuses George Washington University Law Professor Of Sexual Assault

How To Generate Working Windows License Keys Using ChatGPT?

Ethicists Criticize AI Pause Letter For Ignoring Actual Harms

OpenAI Says ChatGPT Bug Exposed Payment Info Of ChatGPT Plus Subscribers

Help Someone By Sharing This Article

AI Crawlers Increase Wikimedia Commons Bandwidth by 50 Percent