Reddit has initiated a lawsuit against Perplexity and three other AI companies involved in data scraping for allegedly using its content without permission.
Rather than entering into a licensing agreement with Reddit, Perplexity is accused of collaborating with at least one data scraping firm that managed to access Reddit’s information by bypassing Google’s protective measures against scraping.
Reddit has emerged as a frequently cited resource in the fields of artificial intelligence and chatbots, serving as an essential database for training various language models.

While Reddit has agreed to license its extensive data to companies like OpenAI and Google, it is now pursuing legal action against AI firms that refuse to compensate for its content.
Earlier this year, Reddit took legal action against Anthropic for using its data to develop its chatbot named Claude. In October, Reddit filed another lawsuit against Perplexity and several data scraping businesses, including SerpApi from Texas, Lithuania’s Oxylabs UAB, and AWM Proxy, which has ties to a former Russian botnet. The lawsuit claims that SerpApi has publicly promoted its partnership with Perplexity.
The lawsuit accuses Perplexity of utilizing Reddit’s content for commercial purposes without proper authorization. Reddit asserts that Perplexity should be required to pay for data licensing, similar to agreements made with OpenAI and Google.
In contrast, Perplexity argues that it is within its rights to summarize and reference Reddit discussions, claiming these are public data.
According to Reddit’s legal team, the defendants allegedly took content from Google search results, which is an indirect method of accessing Reddit’s material. Reddit has strict rules against commercial use of its content without a formal agreement.
Both Reddit and Google have implemented strategies to block unauthorized scraping, but the companies involved in scraping have found ways around Google’s defenses.
Reddit’s lawyers likened the actions of the defendants to bank robbers who, unable to break into a bank vault, instead target an armored truck carrying cash. This analogy highlights the seriousness of the allegations.
Perplexity is said to have been aware that its actions were illegal, as Reddit sent a cease-and-desist letter to the company in May 2024.
Perplexity has denied the allegations of scraping Reddit’s data and claims to respect robots.txt files, which are used to indicate which areas of a site are off-limits to scrapers.
Following the cease-and-desist, Reddit noticed a dramatic increase in the presence of its content within Perplexity’s search results.
To verify that Perplexity was indeed scraping its data, Reddit posted content that was only visible to Google. Within just a few hours, Reddit staff found that the post had appeared in Perplexity’s search results.
In the past, website publishers might have hired data scraping firms to enhance their visibility on Google.
However, the rise of AI companies, including Google, has disrupted this practice by scraping and reusing content from other sources, diverting traffic away from the original publishers toward their own AI products.
Numerous online publishers, such as The nytimes, Dow Jones, and Getty Images, have taken legal action against AI companies for allegedly stealing their copyrighted material. Social media platforms like Meta, X, and LinkedIn have also filed lawsuits against data scraping firms for misusing their data.
Online publishers have experienced significant drops in traffic due to Google’s AI Overview product, which generates AI-driven responses that appear above traditional search results.
Since the launch of Google AI Overview in May 2024, the rate of searches resulting in no clicks or visits has climbed from 56% to 69%, according to a study by Similarweb. Concurrently, organic traffic has plummeted from 2.3 billion visits to 1.7 billion.
Many publishers that have sued AI companies for unauthorized scraping have also entered into paid licensing agreements with those same companies. Notable examples include The New York Times and News Corp, along with several other publishers, including Reddit itself.
Recognizing the potential value of its content for generative AI tools, Reddit began blocking access and negotiating licensing agreements in 2023.
Since the introduction of ChatGPT in late 2022, many companies have invested heavily in computing resources and AI talent to gain a competitive edge in the AI market. However, obtaining large datasets for training AI models remains a challenge.
Reddit has become an invaluable source of knowledge for AI companies, with its users sharing candid insights and experiences on a wide range of topics for two decades.
The platform’s upvoting system helps highlight the most valuable contributions within each community, or subreddit. An AI executive once described Reddit’s collection of content as a treasure trove, suggesting that all that is needed is to organize the dataset and market it effectively.
According to a report from an AI search analytics firm, Reddit is the most cited source on both Perplexity and Google AI Overviews, and it ranks as the second most cited source on ChatGPT.
Instead of pursuing a licensing deal for Reddit’s data, the lawsuit claims that Perplexity is willing to go to great lengths to acquire the data it needs.
In a response posted on Reddit regarding the lawsuit, Perplexity argued that it would be “impossible” to agree to a licensing deal with Reddit, as it does not train foundational models.
The company maintains that its chatbot summarizes and cites discussions from Reddit, asserting that Reddit’s stance contradicts the principles of an open internet by attempting to limit user access to information through the chatbot.
Other Stories You May Like