HomeDatasetsPleIAs/common_corpus
C

PleIAs/common_corpus

General · PleIAs· 84.8K
Unknown 7.5 TB language:enlanguage:frlanguage:delanguage:zhlanguage:it

Common Corpus is the largest open licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners.

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull PleIAs/common_corpus

Dataset details

Task
General
Language
en
License
Unknown
Size
7.5 TB
Rows / images
69.9K
Creator
PleIAs
Downloads
84.8K
Source
huggingface_datasets
Updated
2026-05-06

About PleIAs/common_corpus

Common Corpus is the largest open licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners.