Name: mlfoundations/dclm-baseline-1.0
Creator: mlfoundations
License: cc-by-4.0
Keywords: huggingface, license:cc-by-4.0, arxiv:2406.11794, region:us

DCLM-baseline DCLM-baseline is a 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks. Below are comparisions of model trained on DCLM-baseline with other models in the 7B regime. Model Params Tokens Open dataset? CORE MMLU EXTENDED Open weights, closed datasets Llama2 7B 2T ✗ 49.2 45.8 34.1 DeepSeek 7B 2T ✗ 50.7 48.5 35.3 Mistral-0.3 7B ? ✗ 57.0 62.7 45.1 QWEN-2 7B ? ✗ 57.5 71.9 50.5 Llama3 8B 15T ✗ 57.6… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0.

mlfoundations/dclm-baseline-1.0

Dataset details

About mlfoundations/dclm-baseline-1.0