Task
General
TabLib A minimally-preprocessed dataset of 627M tables (69 TiB) extracted from HTML, PDF, CSV, TSV, Excel, and SQLite files from GitHub and Common Crawl. This includes 867B tokens of "context metadata": each table includes provenance information and table context such as filename, text before/after, HTML metadata, etc. A smaller 0.1% sample of this dataset can be found here. For more information, read the paper & announcement blog. Dataset Details… See the full description on the dataset page: https://huggingface.co/datasets/approximatelabs/tablib-v1-full.