Name: nvidia/Nemotron-Pretraining-Code-v2
Creator: nvidia
License: other
Keywords: huggingface, task_categories:text-generation, license:other, size_categories:100M<n<1B, format:parquet, modality:text, library:datasets, library:dask, library:polars, text-generation

Nemotron-Pre-Training-Dataset-v2.1 Dataset Description The Nemotron-Pre-Training-Dataset-v2.1 extends the previously released Nemotron pretraining datasets with refreshed, higher-quality, and more diverse data across math, code, English Common Crawl, and large-scale synthetic corpora. Designed for the NVIDIA Nemotron 3 family of LLMs, the dataset introduces new Common Crawl code extraction, 2.5T new English web tokens… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Code-v2.

nvidia/Nemotron-Pretraining-Code-v2

Dataset details