Stack Overflow: We’ll Charge Companies For Training LLMs

The Q&A website for programmers has joined Reddit in requesting payment for the usage of its data in the training of algorithms and bots similar to ChatGPT.

The move by Stack Overflow to demand payment from companies using its data for generative AI purposes is part of a broader strategy and has not been previously disclosed. This development comes after Reddit’s recent announcement that it will also start charging specific AI developers for scraping its content beginning in June.

According to external analyses and their own disclosures, Meta, Google, and OpenAI, the creators of ChatGPT, have all developed AI systems by using datasets that have been curated from various online sources, including Stack Overflow and Reddit.

AI text generators or chatbots can become more fluent and knowledgeable by incorporating text from online discussions among experts or casual conversations into machine learning algorithms referred to as large language models (LLMs). Utilizing LLMs to generate programming code is regarded as one of the technology’s most significant prospects.

Chandrasekar, CEO of Stack Overflow, states that community platforms that contribute to the development of LLMs must receive compensation so that companies can reinvest in their communities and ensure their sustained growth. He expresses his support for Reddit’s strategy in this regard.

According to Chandrasekar, the potential additional revenue is crucial to ensure that Stack Overflow continues to attract users and maintain its status as a source of high-quality information. He also asserts that this would help train future chatbots, which require new knowledge to advance.

However, the restriction of valuable data could discourage certain AI training efforts and impede the progress of LLMs, posing a threat to any platform people use for information and conversation. Chandrasekar believes that proper licensing would only hasten the development of high-quality LLMs.

AI developers aim to reduce the massive expenses of creating large-scale AI systems, which require significant amounts of costly computing power. The need to pay for data they once obtained for free could prolong the already uncertain timelines for achieving profitability with their emerging technologies.

Large language models have the ability to produce strings of text by utilizing the word patterns acquired from web pages, books, and other forms of text present in their training data. Alongside ChatGPT, these programs serve as the core of search chatbots like Microsoft Bing chat and Google’s Bard.

They are the foundation of an expanding array of applications that instantaneously generate professional and creative content. The corresponding models that generate AI-composed illustrations and videos rely on patterns extracted from image datasets, such as photographs collected from Pinterest and Flickr.

Frequently, data sets utilized in the development of AI are compiled through unofficial methods, such as deploying software that extracts content from websites. In the US, this is generally considered lawful, although copyright concerns and the terms of use of websites have resulted in disagreements surrounding the practice.

Some websites like Reddit and Stack Overflow have been more receptive to AI developers. They provide downloadable “data dumps” or real-time data portals in the form of APIs to assist software in accessing their content. According to Chandrasekar, Stack Overflow’s LLM developers obtain data through a combination of dumps, APIs, and scraping, which are all presently free.

Chandrasekar claims that LLM developers are violating Stack Overflow’s terms of service. According to the site’s TOS, users own the content they post on Stack Overflow. However, it all falls under a Creative Commons license that requires anyone who later uses the data to acknowledge where it came from.

When AI firms sell their models to clients, they “are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license,” Chandrasekar notes.

Stack Overflow and Reddit have not disclosed any pricing information yet. According to Reddit spokesperson Tim Rathschmidt, they are currently working on it and will share more details with partners in the coming weeks. Chandrasekar states that Stack Overflow will examine Reddit’s approach and discuss with potential customers who have already expressed interest in data access.

Stack Overflow and Reddit will continue offering free data access to some users and companies. However, Stack Overflow’s Chandrasekar clarifies that they only seek compensation from companies using their data for developing LLMs for commercial purposes. According to him, charging for products built on community-driven sites like theirs is unfair.

On the other hand, Reddit CEO Steve Huffman stated that they are unwilling to offer large corporations free data access. He expressed his concern about companies generating value from crawling Reddit’s content without returning any value to its users.

As the demand for LLM-based products grows and profits soar, other companies holding vast amounts of data required for machine learning algorithms to train are also seeking compensation. Some news publishers have raised concerns over Microsoft’s Bing chatbot and its use of their content.

Although there are growing expectations that products built on large language models (LLMs) will generate huge profits, so far, there have been only a few public deals over access to training data. For example, Shutterstock agreed to license content to OpenAI, while Getty Images is suing Stability AI, an OpenAI competitor, for allegedly using over 12 million photos without seeking a license. The AI startup’s response is due in US federal court next week.

At present, AI developers are not under significant pressure to pay for access to data. Some companies with large volumes of academic text or casual conversations have indicated that they have no plans to start charging for their APIs or similar data portals.

For instance, according to spokesperson David Knutson, PLOS, a publisher of scientific research whose content has been used in AI training, is “not likely” to change its fairly unrestrictive terms of use. Meanwhile, online community platform Discord has no plans to modify its free API offerings, which are provided under terms that forbid AI training, says spokesperson Swaleha Carlson.

While Stack Overflow is exploring a range of AI-related initiatives, including developing its own generative AI services, charging for its API is just one part of its broader AI strategy, which it expects to unveil in a few months. About 10 percent of the nearly 600 staff at Stack Overflow are focused on this initiative, which includes developing an assistant function that could help guide people as they compose questions to post.

The main action taken by the Stack Overflow community so far is to prohibit users from posting AI-generated responses. According to Chandrasekar, the release of ChatGPT led to a surge in incorrect answers, posing a challenge for the company’s few hundred moderators.

If Stack Overflow succeeds in licensing the questions and answers that its users contribute for free to AI makers, these users may rightly demand compensation. Chandrasekar states that considerable thought is being given to ensure that the community members and contributors who make the site what it is today are taken care of within the context of these developments.

Related Stories:

Help Someone By Sharing This Article