mistralai/Voxtral-Mini-4B-Realtime-2602

About mistralai/Voxtral-Mini-4B-Realtime-2602

Voxtral Mini 4B Realtime 2602 is a multilingual, realtime speech-transcription model and among the first open-source solutions to achieve accuracy comparable to offline systems with a delay of = 3600 / 0.8 = 45000. In theory, you should be able to record with no limit; in practice, pre-allocations of RoPE parameters among other things limits --max-model-len. For the best user experience, we recommend to simply instantiate vLLM with the default parameters which will automatically set a maximum model length of 131072 (~ca. 3h). - We strongly recommend using websockets to set up audio streaming sessions. For more info on how to do so, check Usage. - We recommend using a delay of 480ms as we found it to be the sweet spot of performance and low latency. If, however, you want to adapt the delay, you can change the "transcriptiondelayms": 480 parameter in the tekken.json file to any multiple of 80ms between 80 and 1200, as well as 2400 as a standalone value.

Model details

About mistralai/Voxtral-Mini-4B-Realtime-2602

Related Automatic Speech Recognition