Task
Text Generation
Nemotron-Pre-Training-Dataset-v1 Release 👩💻 Authors: Rabeeh Karimi Mahabadi, Sanjeev Satheesh 📘 Paper: Nemotron-cc-math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset 📝 Blog: Nemotron-cc-math blog Data Overview We’re excited to introduce Nemotron-CC-Math - a large-scale, high-quality math corpus extracted from Common Crawl which was used in nemotron pre-training. This dataset is built to preserve and surface high-value mathematical and code content… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1.