HuggingFaceTB/SmolVLM-256M-Instruct

Image Text To Text·HuggingFaceTB· 979.4K· 372

transformers apache-2.0 256.5M params dataset:HuggingFaceM4/the_cauldrondataset:HuggingFaceM4/Docmatixarxiv:2504.05299base_model:HuggingFaceTB/SmolLM2-135M-Instructbase_model:quantized:HuggingFaceTB/SmolLM2-135M-Instruct

SmolVLM-256M is the smallest multimodal model in the world. It accepts arbitrary sequences of image and text inputs to produce text outputs. It's designed for efficiency. SmolVLM can answer questions about images, describe visual content, or transcribe text. Its lightweight architecture makes it suitable for on-device applications while maintaining strong performance on multimodal tasks. It can ru

Open in MLForge Sign up free Desktop app Source ↗

# pull & run locally
pip install mlforge-sdk && mlforge pull HuggingFaceTB/SmolVLM-256M-Instruct

Model details

Task

Image Text To Text

Provider

HuggingFaceTB

Framework

transformers

Parameters

256.5M

Size

5.3 GB

License

apache-2.0

Downloads

979.4K

Likes

372

Paper

arXiv:2504.05299

Updated

2025-04-08

About HuggingFaceTB/SmolVLM-256M-Instruct

Related Image Text To Text

G google/gemma-4-26B-A4B-it Image Text To Text ·26.5B params 13.1M 1.2K 🤗 HF G google/gemma-4-31B-it Image Text To Text ·32.7B params 11.2M 3.1K 🤗 HF Q Qwen/Qwen3.5-9B Image Text To Text ·9.7B params 9.8M 1.6K 🤗 HF Q Qwen/Qwen3.5-4B Image Text To Text ·4.7B params 9.6M 683 🤗 HF Q Qwen/Qwen2.5-VL-7B-Instruct Image Text To Text ·8.3B params 9.4M 1.6K 🤗 HF Q Qwen/Qwen3.6-35B-A3B-FP8 Image Text To Text ·36.0B params 5.8M 284 🤗 HF