Skip to content

What Is Recursive Corpus Corruption?

Recursive corpus corruption is a feedback loop in which AI-generated content from synthetic entities enters training corpora, trains new AI models, which then produce more synthetic content that re-enters training data. This creates recursive contamination where AI systems increasingly validate fabricated information, eroding the reliability of automated knowledge systems.

Why It Matters

  • model_training

    AI models train on contaminated data. As synthetic entities publish fabricated research, whitepapers, and technical content, this material is ingested by web crawlers and incorporated into the training datasets of next-generation AI models.

  • stacked_line_chart

    Each generation compounds the corruption. The contamination is not static — it amplifies. Each training cycle produces models that generate more synthetic content, which feeds the next cycle, creating exponential degradation of corpus quality.

  • broken_image

    Undermines trust in automated knowledge systems. When AI systems trained on corrupted corpora produce outputs that cite fabricated sources as authoritative, the entire chain of automated knowledge production becomes unreliable.

  • filter_alt_off

    No current mechanism to filter synthetic content from training data. Existing data quality pipelines lack the capability to distinguish high-quality synthetic content from legitimate publications, making prevention architecturally difficult.

Source

Entity-Level Deepfakes and Intellectual Provenance

Thomas Perry Jr. · DOI: 10.5281/zenodo.18645301

Related Terms