HomeDatasetsallenai/dolma3_mix-6T
D

allenai/dolma3_mix-6T

Text Generation · allenai· 78.6K
odc-by 4.0 TB task_categories:text-generationlanguage:enlicense:odc-byarxiv:2512.13961region:us

Dolma 3 Mix (6T) The Dolma 3 Mix (6T) is the collection of data used during the pretraining stage to train the Olmo-3-1125-32B model. This dataset is made up of ~6 trillion tokens from a diverse mix of web content, academic publications, code, and more. The majority of this dataset comes from Common Crawl.

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull allenai/dolma3_mix-6T

Dataset details

Task
Text Generation
Language
en
License
odc-by
Size
4.0 TB
Creator
allenai
Downloads
78.6K
Source
huggingface_datasets
Updated
2026-01-15

About allenai/dolma3_mix-6T

Dolma 3 Mix (6T) The Dolma 3 Mix (6T) is the collection of data used during the pretraining stage to train the Olmo-3-1125-32B model. This dataset is made up of ~6 trillion tokens from a diverse mix of web content, academic publications, code, and more. The majority of this dataset comes from Common Crawl.