microsoft/Phi-3.5-vision-instruct

Image Text To Text·microsoft· 1.6M· 736

transformers mit 4.1B params arxiv:2404.14219

image text to text · transformers model

Open in MLForge Sign up free Desktop app Source ↗

# pull & run locally
pip install mlforge-sdk && mlforge pull microsoft/Phi-3.5-vision-instruct

Model details

Task

Image Text To Text

Provider

microsoft

Framework

transformers

Parameters

4.1B

License

mit

Downloads

1.6M

Likes

736

Paper

arXiv:2404.14219

Updated

2025-12-10

About microsoft/Phi-3.5-vision-instruct

Phi-3.5-vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.

Related Image Text To Text

G google/gemma-4-26B-A4B-it Image Text To Text ·26.5B params 13.1M 1.2K 🤗 HF G google/gemma-4-31B-it Image Text To Text ·32.7B params 11.2M 3.1K 🤗 HF Q Qwen/Qwen3.5-9B Image Text To Text ·9.7B params 9.8M 1.6K 🤗 HF Q Qwen/Qwen3.5-4B Image Text To Text ·4.7B params 9.6M 683 🤗 HF Q Qwen/Qwen2.5-VL-7B-Instruct Image Text To Text ·8.3B params 9.4M 1.6K 🤗 HF Q Qwen/Qwen3.6-35B-A3B-FP8 Image Text To Text ·36.0B params 5.8M 284 🤗 HF