allenai/dolma3_mix-6T

Name: allenai/dolma3_mix-6T
Creator: allenai
License: odc-by
Keywords: huggingface, task_categories:text-generation, language:en, license:odc-by, arxiv:2512.13961, region:us, text-generation

Text Generation · allenai· 78.6K

odc-by 4.0 TB task_categories:text-generationlanguage:enlicense:odc-byarxiv:2512.13961region:us

Dolma 3 Mix (6T) The Dolma 3 Mix (6T) is the collection of data used during the pretraining stage to train the Olmo-3-1125-32B model. This dataset is made up of ~6 trillion tokens from a diverse mix of web content, academic publications, code, and more. The majority of this dataset comes from Common Crawl.

Open in MLForge Sign up free Desktop app

# download instantly
mlforge datasets pull allenai/dolma3_mix-6T

Dataset details

Task

Text Generation

Language

License

odc-by

Size

4.0 TB

Creator

allenai

Downloads

78.6K

Source

huggingface_datasets

Updated

2026-01-15

allenai/dolma3_mix-6T

Dataset details

About allenai/dolma3_mix-6T