Name: nvidia/Nemotron-CC-Math-v1
Creator: nvidia
License: other
Keywords: huggingface, task_categories:text-generation, license:other, size_categories:100M<n<1B, format:parquet, modality:text, library:datasets, library:dask, library:polars, text-generation

Nemotron-Pre-Training-Dataset-v1 Release 👩‍💻 Authors: Rabeeh Karimi Mahabadi, Sanjeev Satheesh 📘 Paper: Nemotron-cc-math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset 📝 Blog: Nemotron-cc-math blog Data Overview We’re excited to introduce Nemotron-CC-Math - a large-scale, high-quality math corpus extracted from Common Crawl which was used in nemotron pre-training. This dataset is built to preserve and surface high-value mathematical and code content… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1.

nvidia/Nemotron-CC-Math-v1

Dataset details