Model Gallery

36 models from 1 repositories

Filter by type:

Filter by tags:

vibevoice-cpp
VibeVoice Realtime 0.5B (C++ / GGML, Q8_0) - native C++ port of Microsoft VibeVoice via the vibevoice-cpp backend. 24kHz mono TTS with voice cloning from a single reference voice prompt. Default voice prompt: en-Carter_man.

Repository: localaiLicense: mit

vibevoice-cpp-asr
VibeVoice ASR 7B (C++ / GGML, Q4_K) - long-form speech-to-text with speaker diarization. Returns per-speaker JSON segments with start/end timestamps. English-only. ~10 GB download.

Repository: localaiLicense: mit

rfdetr-cpp-nano
RF-DETR Nano object detection model, served via the native rfdetr.cpp backend (ggml + purego, no Python). Q8_0 quantization is the recommended default for CPU: same accuracy as F16/F32, ~20MB on disk, fastest CPU latency. Pure C++/ggml runtime; no Python dependencies. Drop-in for the /v1/detection endpoint.

Repository: localaiLicense: apache-2.0

rfdetr-cpp-small
RF-DETR Small object detection model (DINOv2-small backbone, 512px input, 3 decoder layers), served via the native rfdetr.cpp backend (ggml + purego, no Python). A step up from Nano in accuracy while staying lightweight on CPU. F16 quantization is the recommended default: identical accuracy to F32 at roughly half the size. Drop-in for the /v1/detection endpoint.

Repository: localaiLicense: apache-2.0

wan-2.1-t2v-1.3b-ggml
Wan 2.1 T2V 1.3B — text-to-video diffusion model, GGUF-quantized for the stable-diffusion.cpp backend. Generates short (33-frame) 832x480 clips from a text prompt. Cheapest Wan variant, suitable for CPU-offloaded inference with ~10 GB of usable RAM.

Repository: localaiLicense: apache-2.0

wan-2.1-i2v-14b-480p-ggml
Wan 2.1 I2V 14B 480P — image-to-video diffusion, GGUF Q4 quantization. Animates a reference image into a 33-frame 480p clip. Requires more RAM than the 1.3B T2V variant; CPU offload enabled by default.

Repository: localaiLicense: apache-2.0

wan-2.1-flf2v-14b-720p-ggml
Wan 2.1 FLF2V 14B 720P — first-last-frame-to-video diffusion, GGUF Q4_K_M. Takes a start and end reference image and interpolates a 33-frame clip between them. Unlike the plain I2V variant this model feeds the end frame through clip_vision as well, so it conditions semantically (not just in pixel-space) on both endpoints. That makes it the right choice for seamless loops (start_image == end_image) and clean narrative cuts. Native 720p but accepts 480p resolutions; shares the same VAE, t5xxl text encoder, and clip_vision_h as I2V 14B.

Repository: localaiLicense: apache-2.0

wan-2.1-i2v-14b-720p-ggml
Wan 2.1 I2V 14B 720P — image-to-video diffusion, GGUF Q4_K_M. Native 720p sibling of the 480p I2V model: animates a single reference image into a 33-frame clip at up to 1280x720. Trained purely as image-to-video (no first-last-frame interpolation path), so motion is freer and better-suited to single-anchor animation than repurposing the FLF2V 720P variant for i2v. Shares the same VAE, umt5_xxl text encoder, and clip_vision_h as the I2V 14B 480P and FLF2V 14B 720P entries.

Repository: localaiLicense: apache-2.0

sd-1.5-ggml
Stable Diffusion 1.5

Repository: localaiLicense: creativeml-openrail-m

sd-3.5-medium-ggml
Stable Diffusion 3.5 Medium is a Multimodal Diffusion Transformer (MMDiT) text-to-image model that features improved performance in image quality, typography, complex prompt understanding, and resource-efficiency.

Repository: localaiLicense: stabilityai-ai-community

sd-3.5-large-ggml
Stable Diffusion 3.5 Large is a Multimodal Diffusion Transformer (MMDiT) text-to-image model that features improved performance in image quality, typography, complex prompt understanding, and resource-efficiency.

Repository: localaiLicense: stabilityai-ai-community

flux.1-dev-ggml
FLUX.1 [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post. Key Features Cutting-edge output quality, second only to our state-of-the-art model FLUX.1 [pro]. Competitive prompt following, matching the performance of closed source alternatives . Trained using guidance distillation, making FLUX.1 [dev] more efficient. Open weights to drive new scientific research, and empower artists to develop innovative workflows. Generated outputs can be used for personal, scientific, and commercial purposes as described in the flux-1-dev-non-commercial-license. This model is quantized with GGUF

Repository: localaiLicense: flux-1-dev-non-commercial-license

flux.1-dev-ggml-q8_0
FLUX.1 [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post. Key Features Cutting-edge output quality, second only to our state-of-the-art model FLUX.1 [pro]. Competitive prompt following, matching the performance of closed source alternatives . Trained using guidance distillation, making FLUX.1 [dev] more efficient. Open weights to drive new scientific research, and empower artists to develop innovative workflows. Generated outputs can be used for personal, scientific, and commercial purposes as described in the flux-1-dev-non-commercial-license.

Repository: localaiLicense: flux-1-dev-non-commercial-license

flux.1-dev-ggml-abliterated-v2-q8_0
FLUX.1 [dev] is an abliterated version of FLUX.1 [dev]

Repository: localaiLicense: flux-1-dev-non-commercial-license

flux.1-krea-dev-ggml
FLUX.1 Krea [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post and Krea's blog post. Cutting-edge output quality, with a focus on aesthetic photography. Competitive prompt following, matching the performance of closed source alternatives. Trained using guidance distillation, making FLUX.1 Krea [dev] more efficient. Open weights to drive new scientific research, and empower artists to develop innovative workflows. Generated outputs can be used for personal, scientific, and commercial purposes, as described in the flux-1-dev-non-commercial-license.

Repository: localaiLicense: flux-1-dev-non-commercial-license

flux.1-krea-dev-ggml-q8_0
FLUX.1 Krea [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. For more information, please read our blog post and Krea's blog post. Cutting-edge output quality, with a focus on aesthetic photography. Competitive prompt following, matching the performance of closed source alternatives. Trained using guidance distillation, making FLUX.1 Krea [dev] more efficient. Open weights to drive new scientific research, and empower artists to develop innovative workflows. Generated outputs can be used for personal, scientific, and commercial purposes, as described in the flux-1-dev-non-commercial-license.

Repository: localaiLicense: flux-1-dev-non-commercial-license

whisper-1
Port of OpenAI's Whisper model in C/C++

Repository: localaiLicense: mit

whisper-base
Port of OpenAI's Whisper model in C/C++

Repository: localaiLicense: mit

whisper-tiny
Port of OpenAI's Whisper model in C/C++

Repository: localaiLicense: mit

silero-vad-ggml
Silero VAD - pre-trained enterprise-grade Voice Activity Detector.

Repository: localai

ltx-2.3-22b-dev-ggml
LTX-2.3 22B dev - DiT-based audio-video foundation model from Lightricks, GGUF-quantized for the stable-diffusion.cpp backend. Generates synchronized video and audio from a text prompt (T2V), a reference image (I2V), or first/last frame pairs (FLF2V). Uses gemma-3-12b-it as the text encoder and ships dedicated video and audio VAEs plus an embeddings_connectors safetensors that bridges the LLM hidden states to the diffusion model. This entry uses the dynamic (UD) Q4_K_M quantization of the 22B model (~16 GB) paired with the UD-Q4_K_XL QAT Gemma encoder (~7.4 GB). Recommended generation: width=1280, height=720, video_frames=33, fps=24, sampler=euler, cfg_scale=6.0.

Repository: localaiLicense: ltx-2-community-license-agreement

Page 1