HomeDatasetsHelsinki-NLP/nemotron-cc-translated
N

Helsinki-NLP/nemotron-cc-translated

Translation · Helsinki-NLP· 43.6K
cc0-1.0 8.3 TB task_categories:translationtask_categories:text-generationlanguage:boslanguage:bullanguage:cat

nemotron-cc-tanslated is a collection of automatically translated documents from nemotron-cc taken out of the high-quality subset. Translations are based on OPUS-MT and HPLT-MT models. The data in v1.0 covers 156,431,999 documents with over 70 billion space-searated tokens of English data translated into 36 languages. The total v1.0 data set includes over 2.4 trillion tokens and the translated doc

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull Helsinki-NLP/nemotron-cc-translated

Dataset details

Task
Translation
Language
bos
License
cc0-1.0
Size
8.3 TB
Rows / images
7.4B
Creator
Helsinki-NLP
Downloads
43.6K
Source
huggingface_datasets
Updated
2026-04-27

About Helsinki-NLP/nemotron-cc-translated

nemotron-cc-tanslated is a collection of automatically translated documents from nemotron-cc taken out of the high-quality subset. Translations are based on OPUS-MT and HPLT-MT models. The data in v1.0 covers 156,431,999 documents with over 70 billion space-searated tokens of English data translated into 36 languages. The total v1.0 data set includes over 2.4 trillion tokens and the translated documents are aligned across all languages.