To put that in perspective: 15 websites, out of roughly 1.1 billion on the internet, control more than two-thirds of what AI engines recommend to billions of users every day . This concentration is far more extreme than anything Google's PageRank algorithm produced during its 25-year reign over web discovery
.
These domains consistently appear at the top of citation rankings across ChatGPT, Google AI Mode, Gemini, Perplexity, and AI Overviews:
Peec AI's analysis of 30 million sources found the top 10 most-cited domains across all platforms to be: Reddit, YouTube, LinkedIn, Wikipedia, Forbes, Facebook, Yelp, Amazon, TechRadar, and Healthline .
Reddit's user-generated discussions and forums provide a vast, diverse dataset of conversational and problem-solving content. In one Statista study from June 2025, Reddit captured 40.1% of all cited references, far ahead of second-place Wikipedia at 26.3% . On Perplexity, Reddit can account for roughly 1 in 5 citations
.
Analysts point to Reddit's ability to answer long-tail, opinion-based, and how-to questions that traditional encyclopedic sources struggle with — making it especially valuable for conversational AI .
While Reddit leads overall, individual engine rankings reveal important differences:
Only 7 websites appear in the top 50 most-cited domains across all three major engines (ChatGPT, Perplexity, Google AI Overviews), and only 11% of domains are cited by both ChatGPT and Perplexity .
It's important to separate what LLMs cite in their outputs from what they are trained on. For training data, the dominant source by volume is Common Crawl — an open repository of petabytes of raw web data that feeds models like GPT-3, LLaMA, and T5 . OpenAI's GPT-3, for example, drew 60% of its training tokens from a filtered version of Common Crawl
.
The citation lists above reflect what LLMs reference when generating responses — a much smaller, more curated set of sources that the model has learned to treat as authoritative.
If your goal is to be cited by AI engines, the data is clear: you need to earn a place in the short list of trusted domains. The long tail of the web is functionally invisible to most AI outputs outside of niche queries.
Strategies that work include contributing to Wikipedia, getting coverage on Forbes or Healthline, building a strong YouTube and LinkedIn presence, and earning citations on Reddit. Formats that boost citation success include listicles (which account for approximately 50% of top AI citations) and pages with ordered or unordered lists (present on 80% of AI-cited pages) .
In short: Reddit, Wikipedia, and YouTube are the three most-cited domains across major LLM engines today, with a small cluster of authoritative media, health, and reference sites rounding out the top tier. Getting cited by AI means getting cited by these domains first.
Comments
0 comments