HomeDatasetsZyphra/Zyda-2
Z

Zyphra/Zyda-2

Text Generation · Zyphra· 174.7K
odc-by 13 TB task_categories:text-generationlanguage:enlicense:odc-bysize_categories:n>1Tregion:us

Zyda-2 is a 5 trillion token language modeling dataset created by collecting open and high quality datasets and combining them and cross-deduplication and model-based quality filtering. Zyda-2 comprises diverse sources of web data, highly educational content, math, code, and scientific papers.

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull Zyphra/Zyda-2

Dataset details

Task
Text Generation
Language
en
License
odc-by
Size
13 TB
Creator
Zyphra
Downloads
174.7K
Source
huggingface_datasets
Updated
2025-08-06

About Zyphra/Zyda-2

Zyda-2 is a 5 trillion token language modeling dataset created by collecting open and high quality datasets and combining them and cross-deduplication and model-based quality filtering. Zyda-2 comprises diverse sources of web data, highly educational content, math, code, and scientific papers.