Helsinki-NLP/nemotron-cc-translated
nemotron-cc-tanslated is a collection of automatically translated documents from nemotron-cc taken out of the high-quality subset. Translations are based on OPUS-MT and HPLT-MT models. The data in v1.0 covers 156,431,999 documents with over 70 billion space-searated tokens of English data translated into 36 languages. The total v1.0 data set includes over 2.4 trillion tokens and the translated doc
mlforge datasets pull Helsinki-NLP/nemotron-cc-translated
Dataset details
About Helsinki-NLP/nemotron-cc-translated
nemotron-cc-tanslated is a collection of automatically translated documents from nemotron-cc taken out of the high-quality subset. Translations are based on OPUS-MT and HPLT-MT models. The data in v1.0 covers 156,431,999 documents with over 70 billion space-searated tokens of English data translated into 36 languages. The total v1.0 data set includes over 2.4 trillion tokens and the translated documents are aligned across all languages.