HuggingFaceTB/SmolVLM2-500M-Video-Instruct

Image Text To Text·HuggingFaceTB· 670.7K· 155

transformers apache-2.0 507.5M params dataset:HuggingFaceM4/the_cauldrondataset:HuggingFaceM4/Docmatixdataset:lmms-lab/LLaVA-OneVision-Datadataset:lmms-lab/M4-Instruct-Datadataset:HuggingFaceFV/finevideo

SmolVLM2-500M-Video is a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. Despite its compact size, requiring only 1.8GB of GPU RAM for video inference, it delivers robust performance on complex mult

Open in MLForge Sign up free Desktop app Source ↗

# pull & run locally
pip install mlforge-sdk && mlforge pull HuggingFaceTB/SmolVLM2-500M-Video-Instruct

Model details

Task

Image Text To Text

Provider

HuggingFaceTB

Framework

transformers

Parameters

507.5M

Size

9.4 GB

License

apache-2.0

Downloads

670.7K

Likes

155

Paper

arXiv:2504.05299

Updated

2025-04-08

About HuggingFaceTB/SmolVLM2-500M-Video-Instruct

Related Image Text To Text

G google/gemma-4-26B-A4B-it Image Text To Text ·26.5B params 13.1M 1.2K 🤗 HF G google/gemma-4-31B-it Image Text To Text ·32.7B params 11.2M 3.1K 🤗 HF Q Qwen/Qwen3.5-9B Image Text To Text ·9.7B params 9.8M 1.6K 🤗 HF Q Qwen/Qwen3.5-4B Image Text To Text ·4.7B params 9.6M 683 🤗 HF Q Qwen/Qwen2.5-VL-7B-Instruct Image Text To Text ·8.3B params 9.4M 1.6K 🤗 HF Q Qwen/Qwen3.6-35B-A3B-FP8 Image Text To Text ·36.0B params 5.8M 284 🤗 HF