HomeModelsImage Text To TextHuggingFaceTB/SmolVLM2-500M-Video-Instruct
S

HuggingFaceTB/SmolVLM2-500M-Video-Instruct

Image Text To Text·HuggingFaceTB· 670.7K· 155
transformers apache-2.0 507.5M params dataset:HuggingFaceM4/the_cauldrondataset:HuggingFaceM4/Docmatixdataset:lmms-lab/LLaVA-OneVision-Datadataset:lmms-lab/M4-Instruct-Datadataset:HuggingFaceFV/finevideo

SmolVLM2-500M-Video is a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. Despite its compact size, requiring only 1.8GB of GPU RAM for video inference, it delivers robust performance on complex mult

Open in MLForge Sign up free Desktop app Source ↗
# pull & run locally
pip install mlforge-sdk && mlforge pull HuggingFaceTB/SmolVLM2-500M-Video-Instruct

Model details

Task
Image Text To Text
Provider
HuggingFaceTB
Framework
transformers
Parameters
507.5M
Size
9.4 GB
License
apache-2.0
Downloads
670.7K
Likes
155
Paper
arXiv:2504.05299
Updated
2025-04-08

About HuggingFaceTB/SmolVLM2-500M-Video-Instruct

SmolVLM2-500M-Video is a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. Despite its compact size, requiring only 1.8GB of GPU RAM for video inference, it delivers robust performance on complex multimodal tasks. This efficiency makes it particularly well-suited for on-device applications where computational resources may be limited. Model Summary

Related Image Text To Text

G google/gemma-4-26B-A4B-it Image Text To Text ·26.5B params 13.1M 1.2K 🤗 HF G google/gemma-4-31B-it Image Text To Text ·32.7B params 11.2M 3.1K 🤗 HF Q Qwen/Qwen3.5-9B Image Text To Text ·9.7B params 9.8M 1.6K 🤗 HF Q Qwen/Qwen3.5-4B Image Text To Text ·4.7B params 9.6M 683 🤗 HF Q Qwen/Qwen2.5-VL-7B-Instruct Image Text To Text ·8.3B params 9.4M 1.6K 🤗 HF Q Qwen/Qwen3.6-35B-A3B-FP8 Image Text To Text ·36.0B params 5.8M 284 🤗 HF