What Is Recursive Corpus Corruption?
Recursive corpus corruption is a feedback loop in which AI-generated content from synthetic entities enters training corpora, trains new AI models, which then produce more synthetic content that re-enters training data. This creates recursive contamination where AI systems increasingly validate fabricated information, eroding the reliability of automated knowledge systems.
Why It Matters
-
AI models train on contaminated data. As synthetic entities publish fabricated research, whitepapers, and technical content, this material is ingested by web crawlers and incorporated into the training datasets of next-generation AI models.
-
Each generation compounds the corruption. The contamination is not static — it amplifies. Each training cycle produces models that generate more synthetic content, which feeds the next cycle, creating exponential degradation of corpus quality.
-
Undermines trust in automated knowledge systems. When AI systems trained on corrupted corpora produce outputs that cite fabricated sources as authoritative, the entire chain of automated knowledge production becomes unreliable.
-
No current mechanism to filter synthetic content from training data. Existing data quality pipelines lack the capability to distinguish high-quality synthetic content from legitimate publications, making prevention architecturally difficult.
Source
Entity-Level Deepfakes and Intellectual Provenance
Thomas Perry Jr. · DOI: 10.5281/zenodo.18645301