cl-nagoya/ruri-v3-310m

Sentence Similarity·cl-nagoya· 624.8K· 81

apache-2.0 314.6M params dataset:cl-nagoya/ruri-v3-dataset-ftarxiv:2409.07737base_model:cl-nagoya/ruri-v3-pt-310mbase_model:finetune:cl-nagoya/ruri-v3-pt-310mlicense:apache-2.0region:us

Open in MLForge Sign up free Desktop app Source ↗

# pull & run locally
pip install mlforge-sdk && mlforge pull cl-nagoya/ruri-v3-310m

Model details

Task

Sentence Similarity

Provider

cl-nagoya

Parameters

314.6M

Size

1.2 GB

License

apache-2.0

Downloads

624.8K

Likes

Paper

arXiv:2409.07737

Updated

2025-04-17

About cl-nagoya/ruri-v3-310m

Ruri v3 is a general-purpose Japanese text embedding model built on top of ModernBERT-Ja. Ruri v3 offers several key technical advantages: - State-of-the-art performance for Japanese text embedding tasks. - Supports sequence lengths up to 8192 tokens - Previous versions of Ruri (v1, v2) were limited to 512. - Expanded vocabulary of 100K tokens, compared to 32K in v1 and v2 - The larger vocabulary make input sequences shorter, improving efficiency. - Integrated FlashAttention, following ModernBERT's architecture - Enables faster inference and fine-tuning. - Tokenizer based solely on SentencePiece - Unlike previous versions, which relied on Japanese-specific BERT tokenizers and required pre-tokenized input, Ruri v3 performs tokenization with SentencePiece only—no external word segmentation tool is required.

Related Sentence Similarity

A sentence-transformers/all-MiniLM-L6-v2 Sentence Similarity ·22.7M params 245.3M 5.0K 🤗 HF P sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 Sentence Similarity ·117.7M params 50.8M 1.3K 🤗 HF A sentence-transformers/all-mpnet-base-v2 Sentence Similarity ·109.5M params 33.9M 1.3K 🤗 HF B BAAI/bge-m3 Sentence Similarity 31.1M 3.1K 🤗 HF N nomic-ai/nomic-embed-text-v1.5 Sentence Similarity ·136.7M params 18.3M 856 🤗 HF M intfloat/multilingual-e5-small Sentence Similarity ·117.7M params 10.0M 344 🤗 HF