LLMTrace

Welcome to the official page for LLMTrace, a large-scale, bilingual (English and Russian) dataset designed to advance the state of the art in AI-generated text detection. Our work addresses the critical need for modern, diverse, and fine-grained data to train robust classifiers and segment-level detectors.

This resource is ideal for researchers, developers, and practitioners working on identifying AI-generated content, ensuring academic integrity, and combating misinformation.

Key Features:

Bilingual: Parallel corpora for both English and Russian.
Modern & Diverse LLMs: Includes texts from a wide range of cutting-edge proprietary and open-source models.
Two Critical Tasks: Supports both standard text classification (Human vs. AI) and novel AI interval detection.
Rich Annotations: Provides detailed metadata on domains, generation methods, and character-level spans for mixed texts.

Resources

📄 Research Paper

For a detailed description of our dataset creation methodology, statistics, and benchmark experiments, please read our paper.

Read the paper on arXiv: Link

💾 Datasets

The dataset is split into two primary components, tailored for classification and detection tasks. Both are available for download.

Classification Dataset

Use this for training standard Human vs. AI text classifiers.

Download from Hugging Face 🤗

Detection Dataset

A versatile dataset for more advanced tasks. Use it for:

Training models to localize AI-generated spans in mixed-authorship texts.
Training 3-way classifiers to distinguish between human, ai, and mixed texts.

Download from Hugging Face 🤗

🤖 Pre-trained Models

We provide several baseline models trained on LLMTrace. These models can be used directly for inference or as a starting point for your own research.

Explore and download models on Hugging Face 🤗:

Dataset Statistics

Classification Dataset

The dataset contains a substantial number of examples for both languages.

Detection Dataset

The dataset contains a substantial number of examples for both languages.

Citation

If you use our datasets or models in your research, please cite our papers.

[1] Tolstykh, I., Tsybina, A., Yakubson, S., & Kuprashevich, M. (2025). LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text. (Preprint). [Paper]

[2] Tolstykh, I., Tsybina, A., Yakubson, S., Gordeev, A., Dokholyan, V., & Kuprashevich, M. (2024). GigaCheck: Detecting LLM-generated Content. arXiv preprint arXiv:2410.23728. [Paper]

BibTeX

            
@article{Layer2025LLMTrace,
    Title = {{LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text}},
    Author = {Irina Tolstykh and Aleksandra Tsybina and Sergey Yakubson and Maksim Kuprashevich},
    Year = {2025},
    Eprint = {arXiv:2509.21269}
}

            
@article{tolstykh2024gigacheck,
    title={{GigaCheck: Detecting LLM-generated Content}},
    author={Irina Tolstykh and Aleksandra Tsybina and Sergey Yakubson and Aleksandr Gordeev and Vladimir Dokholyan and Maksim Kuprashevich},
    journal={arXiv preprint arXiv:2410.23728},
    year={2024},
    eprint={2410.23728},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}