HomeDatasetsapproximatelabs/tablib-v1-full
T

approximatelabs/tablib-v1-full

General · approximatelabs· 14.2K
other 27 TB license:othersize_categories:10B<n<100Bformat:parquetmodality:textlibrary:datasets

TabLib A minimally-preprocessed dataset of 627M tables (69 TiB) extracted from HTML, PDF, CSV, TSV, Excel, and SQLite files from GitHub and Common Crawl. This includes 867B tokens of "context metadata": each table includes provenance information and table context such as filename, text before/after, HTML metadata, etc. A smaller 0.1% sample of this dataset can be found here. For more information, read the paper & announcement blog. Dataset Details… See the full description on the dataset page: https://huggingface.co/datasets/approximatelabs/tablib-v1-full.

Open in MLForge Sign up free Desktop app
# download instantly
mlforge datasets pull approximatelabs/tablib-v1-full

Dataset details

Task
General
License
other
Size
27 TB
Creator
approximatelabs
Downloads
14.2K
Source
huggingface_datasets
Updated
2023-10-13