Task
Text Generation
Nemotron-Pre-Training-Dataset-v2.1 Dataset Description The Nemotron-Pre-Training-Dataset-v2.1 extends the previously released Nemotron pretraining datasets with refreshed, higher-quality, and more diverse data across math, code, English Common Crawl, and large-scale synthetic corpora. Designed for the NVIDIA Nemotron 3 family of LLMs, the dataset introduces new Common Crawl code extraction, 2.5T new English web tokens… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Code-v2.