HomeDatasetsHuggingFaceTB/smollm-corpus
S

HuggingFaceTB/smollm-corpus

General · HuggingFaceTB· 36.2K
odc-by 735 GB language:enlicense:odc-bysize_categories:100M<n<1Bformat:parquetmodality:tabular

This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull HuggingFaceTB/smollm-corpus

Dataset details

Task
General
Language
en
License
odc-by
Size
735 GB
Rows / images
237.0M
Creator
HuggingFaceTB
Downloads
36.2K
Source
huggingface_datasets
Updated
2024-09-06

About HuggingFaceTB/smollm-corpus

This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.