HomeDatasetsnvidia/Nemotron-Pretraining-Code-v2
N

nvidia/Nemotron-Pretraining-Code-v2

Text Generation · nvidia· 6.1K
other 845 GB task_categories:text-generationlicense:othersize_categories:100M<n<1Bformat:parquetmodality:text

Nemotron-Pre-Training-Dataset-v2.1 Dataset Description The Nemotron-Pre-Training-Dataset-v2.1 extends the previously released Nemotron pretraining datasets with refreshed, higher-quality, and more diverse data across math, code, English Common Crawl, and large-scale synthetic corpora. Designed for the NVIDIA Nemotron 3 family of LLMs, the dataset introduces new Common Crawl code extraction, 2.5T new English web tokens… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Code-v2.

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull nvidia/Nemotron-Pretraining-Code-v2

Dataset details

Task
Text Generation
License
other
Size
845 GB
Creator
nvidia
Downloads
6.1K
Source
huggingface_datasets
Updated
2025-12-22