Artificial intelligence chatbots like ChatGPT gather data from Reddit

Chatbots derive data mostly from Reddit comments, followed by Wikipedia, YouTube, Google, data shows

ISTANBUL 

Artificial intelligence (AI) chatbots, or called large language models (LLMs), like ChatGPT, were found to gather most of their data from Reddit comments, according to the Statista statistics portal.

LLMs blew up with the emergence of ChatGPT and became integral to our daily lives with the launch of several models, such as Google Gemini, Chinese DeepSeek, Meta’s Llama and the X social media platform’s Grok.

The data showed that many LLMs, including ChatGPT, refer to publicly available websites to generate responses.

Reddit is at the top of the list of sources LLMs cite, with a 40.11% share, according to Statista.

Experts said the use of Reddit, where users discuss specific topics on a myriad of subjects -- all divided into what the platform calls “subreddits” -- shows that the development of AI chatbots prioritizes natural conversations between real people over official information.

Following Reddit, the most cited platform by LLMs was Wikipedia with a 26.3% share, which significantly lags behind Reddit, as it features edited articles instead of a social media platform model Reddit operates on.

YouTube was found to have a 23.5% share, followed by Google with 23.2%, yelp.com with 21%, Facebook with 19.9%, Amazon with 18.7%, Tripadvisor with 12.4%, mapbox.com with 11.2% and openstreetmap.com with 11.2%.

Meanwhile, some deals between social media companies and AI makers have come to the fore.

Google and Reddit inked a deal in 2024 to feed Google’s AI with its data for $60 million per year, according to a report by Reuters. Reddit signed a similar data-sharing agreement with OpenAI for ChatGPT.