cl-nagoya/ruri-v3-310m
Ruri v3 is a general-purpose Japanese text embedding model built on top of ModernBERT-Ja. Ruri v3 offers several key technical advantages: - State-of-the-art performance for Japanese text embedding tasks. - Supports sequence lengths up to 8192 tokens - Previous versions of Ruri (v1, v2) were limited to 512. - Expanded vocabulary of 100K tokens, compared to 32K in v1 and v2 - The larger voc
pip install mlforge-sdk && mlforge pull cl-nagoya/ruri-v3-310m
Model details
About cl-nagoya/ruri-v3-310m
Ruri v3 is a general-purpose Japanese text embedding model built on top of ModernBERT-Ja. Ruri v3 offers several key technical advantages: - State-of-the-art performance for Japanese text embedding tasks. - Supports sequence lengths up to 8192 tokens - Previous versions of Ruri (v1, v2) were limited to 512. - Expanded vocabulary of 100K tokens, compared to 32K in v1 and v2 - The larger vocabulary make input sequences shorter, improving efficiency. - Integrated FlashAttention, following ModernBERT's architecture - Enables faster inference and fine-tuning. - Tokenizer based solely on SentencePiece - Unlike previous versions, which relied on Japanese-specific BERT tokenizers and required pre-tokenized input, Ruri v3 performs tokenization with SentencePiece only—no external word segmentation tool is required.