OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 04.04.2026, 12:15

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Reddit Doesn't Get Cited (Through the API): Training Data Influence, Access-Channel Divergence, and the Shadow Corpus in AI Brand Recommendations

2026·0 Zitationen·Zenodo (CERN European Organization for Nuclear Research)Open Access
Volltext beim Verlag öffnen

0

Zitationen

1

Autoren

2026

Jahr

Abstract

AI chatbots functionally never cite Reddit — through their APIs. In a companion study of 6,699 URLs cited by ChatGPT and Perplexity across 120 product recommendation queries, we observed zero Reddit citations in our sample — despite Reddit occupying 38.3% of Google's Top-3 organic positions for those same queries. This paper investigates Reddit's influence on AI through two complementary analyses: a training data correlation study and a systematic comparison of Reddit citation behavior across API and web UI access channels. For the training data analysis, we collected 12,187 posts and 103,696 comments from 60 subreddits spanning 12 consumer product categories and extracted brand mentions using an upvote-weighted scoring system. We then correlated Reddit's brand consensus rankings against AI brand recommendation rankings derived from four major platforms — ChatGPT, Claude, Perplexity, and Gemini — each queried three times across 50 product recommendation queries. The correlation was strong, consistent, and statistically significant across every category tested. The mean Spearman rank correlation was *ρ* = .554 across all 12 consumer categories, with all 12 reaching significance at *p* < .05 and 8 of 12 surviving Bonferroni correction. Fisher's combined probability test confirmed the aggregate effect (χ²(22) = 188.42, *p* < 10⁻⁸). Three robustness analyses — weighting sensitivity, independent brand extraction via NER, and partial correlation controlling for market popularity — confirmed the reliability of these findings. For the access-channel analysis, we built browser automation scrapers that collected citation data from the web UIs of four platforms (Google AI Mode, Perplexity, ChatGPT, and Claude) across 100 queries spanning 13 domains and five intent types, then compared these against API results for the same queries. The divergence was stark: APIs produced 0% Reddit citation rates across all platforms, while web UIs produced 44% (Google AI Mode), 20% (Perplexity), and 17% (ChatGPT). Validation queries — those seeking opinions and comparisons — surfaced Reddit at the highest rates (71% on Google AI Mode, 46% on Perplexity). Only Claude maintained zero Reddit citations across both access channels. These findings support a three-channel model of Reddit's influence on AI: (1) a *training data pathway* through which Reddit's community consensus is absorbed into model weights during pre-training (*ρ* = .554); (2) a *web UI citation pathway* through which Reddit is actively retrieved and cited in consumer-facing interfaces (27% aggregate rate); and (3) an *API citation pathway* that categorically suppresses Reddit (0% rate). Reddit functions as what we term a *shadow corpus* — a source whose influence is partially invisible depending on which access channel is examined. For Generative Engine Optimization practitioners, this means that community consensus shapes AI recommendations through both absorbed training signal and selective real-time retrieval, and that studying only API outputs dramatically underestimates Reddit's role in AI-generated responses.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

AI in Service InteractionsArtificial Intelligence in Healthcare and EducationEthics and Social Impacts of AI
Volltext beim Verlag öffnen