Name: approximatelabs/tablib-v1-full
Creator: approximatelabs
License: other
Keywords: huggingface, license:other, size_categories:10B<n<100B, format:parquet, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars

TabLib A minimally-preprocessed dataset of 627M tables (69 TiB) extracted from HTML, PDF, CSV, TSV, Excel, and SQLite files from GitHub and Common Crawl. This includes 867B tokens of "context metadata": each table includes provenance information and table context such as filename, text before/after, HTML metadata, etc. A smaller 0.1% sample of this dataset can be found here. For more information, read the paper & announcement blog. Dataset Details… See the full description on the dataset page: https://huggingface.co/datasets/approximatelabs/tablib-v1-full.

approximatelabs/tablib-v1-full

Dataset details