HomeDatasetsjobs-git/Zyda-2
Z

jobs-git/Zyda-2

Text Generation · jobs-git· 160.7K
odc-by 1.5 TB task_categories:text-generationlanguage:enlicense:odc-bysize_categories:n>1Tregion:us

Zyda-2 is a 5 trillion token language modeling dataset created by collecting open and high quality datasets and combining them and cross-deduplication and model-based quality filtering. Zyda-2 comprises diverse sources of web data, highly educational content, math, code, and scientific papers.

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull jobs-git/Zyda-2

Dataset details

Task
Text Generation
Language
en
License
odc-by
Size
1.5 TB
Creator
jobs-git
Downloads
160.7K
Source
huggingface_datasets
Updated
2025-03-07

About jobs-git/Zyda-2

Zyda-2 is a 5 trillion token language modeling dataset created by collecting open and high quality datasets and combining them and cross-deduplication and model-based quality filtering. Zyda-2 comprises diverse sources of web data, highly educational content, math, code, and scientific papers.