HomeDatasetsmlfoundations/dclm-baseline-1.0
D

mlfoundations/dclm-baseline-1.0

General · mlfoundations· 551.2K
cc-by-4.0 license:cc-by-4.0arxiv:2406.11794region:us

DCLM-baseline DCLM-baseline is a 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks. Below are comparisions of model trained on DCLM-baseline with other models in the 7B regime. Model Params Tokens Open dataset? CORE MMLU EXTENDED Open weights, closed datasets Llama2 7B 2T ✗ 49.2 45.8 34.1 DeepSeek 7B 2T ✗ 50.7 48.5 35.3 Mistral-0.3 7B ? ✗ 57.0 62.7 45.1 QWEN-2 7B ? ✗ 57.5 71.9 50.5 Llama3 8B 15T ✗ 57.6… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0.

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull mlfoundations/dclm-baseline-1.0

Dataset details

Task
General
License
cc-by-4.0
Classes
8
Creator
mlfoundations
Downloads
551.2K
Source
huggingface_datasets
Updated
2024-07-22

About mlfoundations/dclm-baseline-1.0

DCLM-baseline DCLM-baseline is a 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks.