HomeDatasetsallenai/MADLAD-400
M

allenai/MADLAD-400

Text Generation · allenai· 41.6K
odc-by 35 TB task_categories:text-generationlicense:odc-bysize_categories:n>1Tarxiv:2309.04662arxiv:2010.14571

MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it i

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull allenai/MADLAD-400

Dataset details

Task
Text Generation
License
odc-by
Size
35 TB
Creator
allenai
Downloads
41.6K
Source
huggingface_datasets
Updated
2024-09-09

About allenai/MADLAD-400

MADLAD-400 (Multilingual Audited Dataset: Low-resource And Document-level) is a document-level multilingual dataset based on Common Crawl, covering 419 languages in total. This uses all snapshots of CommonCrawl available as of August 1, 2022. The primary advantage of this dataset over similar datasets is that it is more multilingual (419 languages), it is audited and more highly filtered, and it is document-level. The main disadvantage is also its strength -- being more filtered, it may lack the recall needed for some applications.