Model Gallery

15 models from 1 repositories

Filter by type:

Filter by tags:

voxtral-mini-4b-realtime

Voxtral Mini 4B Realtime is a speech-to-text model from Mistral AI. It is a 4B parameter model optimized for fast, accurate audio transcription with low latency, making it ideal for real-time applications. The model uses the Voxtral architecture for efficient audio processing.

Repository: localaiLicense: apache-2.0

ced-base-f16

CED (Consistent Ensemble Distillation, Xiaomi) is a sound-event classifier that tags everyday sounds (baby cry, footsteps, glass breaking, alarms, dog bark, ...) into the 527-class AudioSet ontology. This is the f16 GGUF for the ced backend (a standalone C++/ggml port). Recommended default: fastest on CPU and near-lossless. Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Repository: localaiLicense: apache-2.0

ced-base-q8

CED (Consistent Ensemble Distillation, Xiaomi) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). This is the q8_0 GGUF for the ced backend: smallest footprint (~88 MB, ~6.5x less memory than the PyTorch reference) and near-lossless (identical top-5 tags). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Repository: localaiLicense: apache-2.0

ced-tiny-f16

CED-tiny (5.5M params, Pi-class / edge) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Repository: localaiLicense: apache-2.0

ced-tiny-q8

CED-tiny (5.5M params, Pi-class / edge) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Repository: localaiLicense: apache-2.0

ced-mini-f16

CED-mini (9.6M params, low-power) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Repository: localaiLicense: apache-2.0

ced-mini-q8

CED-mini (9.6M params, low-power) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Repository: localaiLicense: apache-2.0

ced-small-f16

CED-small (22M params, balanced size/accuracy) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Repository: localaiLicense: apache-2.0

ced-small-q8

CED-small (22M params, balanced size/accuracy) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Repository: localaiLicense: apache-2.0

vibevoice-cpp

VibeVoice Realtime 0.5B (C++ / GGML, Q8_0) - native C++ port of Microsoft VibeVoice via the vibevoice-cpp backend. 24kHz mono TTS with a selectable precomputed voice prompt. Default voice prompt: en-Carter_man. This realtime variant does not accept raw Voice Library reference WAVs.

Repository: localaiLicense: mit

lfm2.5-audio-1.5b-realtime

LFM2.5-Audio-1.5B is LiquidAI's any-to-any audio foundation model. The 1.2B LFM2.5 backbone plus a FastConformer audio encoder and an LFM2-based audio detokenizer give real-time speech-to-speech with text + audio output interleaved at 12.5 Hz / 24 kHz. This entry runs in S2S (speech-to-speech) mode and is the model the LocalAI realtime API any-to-any path consumes. Switch to ASR, TTS, or chat by picking the sibling gallery entries.

Repository: localaiLicense: LFM-Open-License-v1.0

localvqe-v1.1-1.3m

LocalVQE v1.1 (1.3 M parameters, F32) — joint acoustic echo cancellation, noise suppression, and dereverberation for 16 kHz mono speech. DeepVQE-style architecture with an S4D bottleneck and an in-graph DCT-II filterbank. ~9.6× realtime on a desktop CPU; 16 ms algorithmic latency. ~5 MB on disk. v1.1 ships the v16 echoaware checkpoint with improved double-talk and near-end single-talk AECMOS scores.

Repository: localaiLicense: apache-2.0

localvqe-v1.2-1.3m

LocalVQE v1.2 (1.3 M parameters, F32) — compact joint acoustic echo cancellation, noise suppression, and dereverberation for 16 kHz mono speech. Shares the same DeepVQE-style architecture (arch_version 3) as v1.3 but with narrower encoder/decoder widths, so it runs at ~9.7× realtime (~1.6 ms per 16 ms frame on a 4-thread Zen4 CPU) — about ¼ the per-hop cost of v1.3. Widens the echo-search window to 1024 ms (v1.1 used 512 ms). ~5 MB on disk. The budget-friendly choice for low-core or power-constrained devices.

Repository: localaiLicense: apache-2.0

localvqe-v1.3-4.8m

LocalVQE v1.3 (4.8 M parameters, F32) — current default release. Joint acoustic echo cancellation, noise suppression, and dereverberation for 16 kHz mono speech, with a wider encoder/decoder trained from scratch under a noise-floor-aware loss recipe. ~4.7× realtime (~3.3 ms per 16 ms frame on a 4-thread Zen4 CPU); ~19 MB on disk. Improves doubletalk speech quality (+0.25 deg MOS) and far-end echo cancellation (ERLE +5.2–9.3 dB) over v1.2; on far-end-only scenes some users may still prefer v1.2's gentler trade-off. Same 16 ms algorithmic latency as the compact models.

Repository: localaiLicense: apache-2.0

parakeet-cpp-realtime_eou_120m-v1

Cache-aware streaming RNNT FastConformer with end-of-utterance (EOU) detection, 120M. Use with streaming transcription. F16 GGUF for the parakeet-cpp backend (C++/ggml port of NVIDIA NeMo Parakeet), byte-identical to NeMo at WER 0. Faster than NeMo on CPU and GPU.

Repository: localaiLicense: cc-by-4.0