Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Reddit Doesn't Get Cited, But It Shapes What Does: Training Data Influence on AI Brand Recommendations

2026·0 Zitationen·Zenodo (CERN European Organization for Nuclear Research)Open Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

AI chatbots functionally never cite Reddit. In a companion study of 6,699 URLs cited by ChatGPT and Perplexity across 120 product recommendation queries, we observed zero Reddit citations in our sample — despite Reddit occupying 38.3% of Google's Top-3 organic positions for those same queries. This paper investigates whether Reddit's influence on AI operates through an alternative pathway: absorption into training data rather than retrieval during inference. We collected 12,187 posts and 103,696 comments from 60 subreddits spanning 12 consumer product categories and extracted brand mentions using an upvote-weighted scoring system that accounts for community engagement signals. We then correlated Reddit's brand consensus rankings against AI brand recommendation rankings derived from four major platforms — ChatGPT, Claude, Perplexity, and Gemini — each queried three times across 50 product recommendation queries. The correlation was strong, consistent, and statistically significant across every category tested. The mean Spearman rank correlation was *ρ* = .554 across all 12 consumer categories, with all 12 reaching significance at *p* < .05 and 8 of 12 surviving Bonferroni correction. Fisher's combined probability test confirmed the aggregate effect (χ²(22) = 188.42, *p* < 10⁻⁸). The strongest correlations emerged in Office and Workspace (*ρ* = .746), Outdoor and Camping (*ρ* = .674), and Automotive (*ρ* = .665). Per-platform analysis revealed heterogeneous sensitivity to Reddit consensus, with Gemini and ChatGPT showing stronger alignment than Perplexity and Claude. Three robustness analyses confirmed the reliability of these findings: a weighting sensitivity analysis demonstrated that all five alternative scoring schemes produced significant correlations (mean *ρ* ranging from .487 to .555); an independent brand extraction using named entity recognition rather than AI-derived dictionaries replicated the correlation (*ρ* = .430, 7/12 categories significant); and a partial correlation analysis controlling for market popularity via Google Trends and Wikipedia page views showed minimal attenuation (mean *ρ*_partial = .554 controlling for Google Trends, .534 for Wikipedia, .529 for both, vs. raw *ρ* = .554). These findings support the training data pathway hypothesis: Reddit functions as what we term a *shadow corpus* — a source whose influence on AI outputs is mediated through pre-training absorption rather than real-time citation. For Generative Engine Optimization practitioners, this means that community consensus shapes AI recommendations in ways that cannot be observed through citation analysis alone.

Autoren

Anthony M. Lee

Institutionen

Institute of Automation(DE)

Themen

AI in Service InteractionsEthics and Social Impacts of AIArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

Reddit Doesn't Get Cited, But It Shapes What Does: Training Data Influence on AI Brand Recommendations

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen