Model Gallery

6 models from 1 repositories

Filter by type:

Filter by tags:

voxtral-mini-4b-realtime
Voxtral Mini 4B Realtime is a speech-to-text model from Mistral AI. It is a 4B parameter model optimized for fast, accurate audio transcription with low latency, making it ideal for real-time applications. The model uses the Voxtral architecture for efficient audio processing.

Repository: localaiLicense: apache-2.0

lfm2.5-audio-1.5b-realtime
LFM2.5-Audio-1.5B is LiquidAI's any-to-any audio foundation model. The 1.2B LFM2.5 backbone plus a FastConformer audio encoder and an LFM2-based audio detokenizer give real-time speech-to-speech with text + audio output interleaved at 12.5 Hz / 24 kHz. This entry runs in S2S (speech-to-speech) mode and is the model the LocalAI realtime API any-to-any path consumes. Switch to ASR, TTS, or chat by picking the sibling gallery entries.

Repository: localaiLicense: LFM-Open-License-v1.0

localvqe-v1.1-1.3m
LocalVQE v1.1 (1.3 M parameters, F32) — joint acoustic echo cancellation, noise suppression, and dereverberation for 16 kHz mono speech. DeepVQE-style architecture with an S4D bottleneck and an in-graph DCT-II filterbank. ~9.6× realtime on a desktop CPU; 16 ms algorithmic latency. ~5 MB on disk. v1.1 ships the v16 echoaware checkpoint with improved double-talk and near-end single-talk AECMOS scores.

Repository: localaiLicense: apache-2.0

localvqe-v1.2-1.3m
LocalVQE v1.2 (1.3 M parameters, F32) — compact joint acoustic echo cancellation, noise suppression, and dereverberation for 16 kHz mono speech. Shares the same DeepVQE-style architecture (arch_version 3) as v1.3 but with narrower encoder/decoder widths, so it runs at ~9.7× realtime (~1.6 ms per 16 ms frame on a 4-thread Zen4 CPU) — about ¼ the per-hop cost of v1.3. Widens the echo-search window to 1024 ms (v1.1 used 512 ms). ~5 MB on disk. The budget-friendly choice for low-core or power-constrained devices.

Repository: localaiLicense: apache-2.0

localvqe-v1.3-4.8m
LocalVQE v1.3 (4.8 M parameters, F32) — current default release. Joint acoustic echo cancellation, noise suppression, and dereverberation for 16 kHz mono speech, with a wider encoder/decoder trained from scratch under a noise-floor-aware loss recipe. ~4.7× realtime (~3.3 ms per 16 ms frame on a 4-thread Zen4 CPU); ~19 MB on disk. Improves doubletalk speech quality (+0.25 deg MOS) and far-end echo cancellation (ERLE +5.2–9.3 dB) over v1.2; on far-end-only scenes some users may still prefer v1.2's gentler trade-off. Same 16 ms algorithmic latency as the compact models.

Repository: localaiLicense: apache-2.0

parakeet-cpp-realtime_eou_120m-v1
Cache-aware streaming RNNT FastConformer with end-of-utterance (EOU) detection, 120M. Use with streaming transcription. F16 GGUF for the parakeet-cpp backend (C++/ggml port of NVIDIA NeMo Parakeet), byte-identical to NeMo at WER 0. Faster than NeMo on CPU and GPU.

Repository: localaiLicense: cc-by-4.0