HomeDatasetsallenai/c4
C

allenai/c4

Text Generation · allenai· 827.1K
["odc-by"] 488 KB task_categories:text-generationtask_categories:fill-masktask_ids:language-modelingtask_ids:masked-language-modelingannotations_creators:no-annotation

C4 Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4). For reference, these are the sizes of the variants: en: 305GB en.noclean: 2.3TB en.noblocklist: 380GB realnewslike: 15GB multilingual (mC4): 9.7TB (108 subsets, one per… See the full description on the dataset page: https://huggingface.co/datasets/allenai/c4.

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull allenai/c4

Dataset details

Task
Text Generation
Language
af
License
["odc-by"]
Size
488 KB
Creator
allenai
Downloads
827.1K
Source
huggingface_datasets
Updated
2024-01-09