HomeDatasetsjobs-git/HPLT2.0_cleaned
H

jobs-git/HPLT2.0_cleaned

Fill Mask · jobs-git· 157.3K
cc0-1.0 1.7 TB task_categories:fill-masktask_categories:text-generationtask_ids:language-modelingmultilinguality:multilinguallanguage:ace

This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project. The source of the data is mostly Internet Archive with some additions from Common Crawl.

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull jobs-git/HPLT2.0_cleaned

Dataset details

Task
Fill Mask
Language
ace
License
cc0-1.0
Size
1.7 TB
Creator
jobs-git
Downloads
157.3K
Source
huggingface_datasets
Updated
2025-03-07

About jobs-git/HPLT2.0_cleaned

This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project. The source of the data is mostly Internet Archive with some additions from Common Crawl.