HomeModelsImage Text To Textzai-org/GLM-OCR
G

zai-org/GLM-OCR

Image Text To Text·zai-org· 3.2M· 1.9K
transformers mit 1.3B params

image text to text · transformers model

Open in MLForge Sign up free Desktop app Source ↗
# pull & run locally
pip install mlforge-sdk && mlforge pull zai-org/GLM-OCR

Model details

Task
Image Text To Text
Provider
zai-org
Framework
transformers
Parameters
1.3B
License
mit
Downloads
3.2M
Likes
1.9K
Paper
arXiv:2603.10910
Updated
2026-05-19

About zai-org/GLM-OCR

GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.

Related Image Text To Text

G google/gemma-4-26B-A4B-it Image Text To Text ·26.5B params 13.1M 1.2K 🤗 HF G google/gemma-4-31B-it Image Text To Text ·32.7B params 11.2M 3.1K 🤗 HF Q Qwen/Qwen3.5-9B Image Text To Text ·9.7B params 9.8M 1.6K 🤗 HF Q Qwen/Qwen3.5-4B Image Text To Text ·4.7B params 9.6M 683 🤗 HF Q Qwen/Qwen2.5-VL-7B-Instruct Image Text To Text ·8.3B params 9.4M 1.6K 🤗 HF Q Qwen/Qwen3.6-35B-A3B-FP8 Image Text To Text ·36.0B params 5.8M 284 🤗 HF