Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
0
Zitationen
31
Autoren
2025
Jahr
Abstract
We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.
Ähnliche Arbeiten
Autoren
- Jiacheng Liu
- Thomas N. Blanton
- Yanai Elazar
- Sewon Min
- Y. Chen
- Arnavi Chheda-Kothary
- Huy Tran
- B. Bischoff
- Eric Marsh
- Michael Schmitz
- Cassidy Trier
- Aaron Sarnat
- John T. James
- Jon Borchardt
- Bailey Kuehl
- Evie Yu-Yen Cheng
- Karen Farley
- Sruthi Sreeram
- Thomas Anderson
- David Albright
- Carissa Schoenick
- Luca Soldaini
- Dirk Groeneveld
- Rock Yuren Pang
- Pang Wei Koh
- Noah A. Smith
- Sophie Lebrecht
- Yejin Choi
- Hannaneh Hajishirzi
- Ali Farhadi
- Jesse Dodge