LocalAI - Models

laguna-xs-2.1

Laguna XS 2.1 is Poolside's 33B-parameter, 3B-active Mixture-of-Experts model for agentic coding and long-horizon work on local machines. It supports tool use, interleaved reasoning, and a native 262K-token context window. This default entry uses the official 20.3 GB Q4_K_M GGUF. License: OpenMDW 1.1.

Links

Tags

laguna-s-2.1

Laguna S 2.1 is Poolside's 118B-parameter, 8B-active Mixture-of-Experts model for agentic software engineering. It supports tool use and a native one-million-token context window; the official GGUF recommends 256K context for best output quality. This default entry uses the current 96 GB Q4_K_M artifact, with imatrix-quantized routed experts and a Q8_0 signal path. License: OpenMDW 1.1.

Links

Tags

secret-filter

A pattern-based PII detector for high-entropy, highly-regular secrets — API keys, tokens, and private-key blocks — that the NER tier cannot catch (it has no credential class, so it fragments a key and may leave the secret part exposed). Detection is bounded restricted-regex compiled to RE2 (linear time, no backtracking); it runs entirely in-process with no model download, no backend, and zero VRAM. Install it, then reference it under another model's pii.detectors (or set it as the instance-wide default detector on the Middleware page) to block leaks of known credential formats out of the box. Add your own patterns under pii_detection.patterns in a restricted regex subset (e.g. "tok-\\w{32,}"); each must carry a fixed literal anchor of at least 3 characters, so open- ended shapes like email addresses are rejected and left to the NER tier.

Links

Tags

qwen3.5-27b-claude-4.6-opus-reasoning-distilled-heretic-i1

Links

https://huggingface.co/mradermacher/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-heretic-i1-GGUF

Tags

ced-base-f16

CED (Consistent Ensemble Distillation, Xiaomi) is a sound-event classifier that tags everyday sounds (baby cry, footsteps, glass breaking, alarms, dog bark, ...) into the 527-class AudioSet ontology. This is the f16 GGUF for the ced backend (a standalone C++/ggml port). Recommended default: fastest on CPU and near-lossless. Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Links

Tags

lfm2.5-audio-1.5b-tts

LFM2.5-Audio-1.5B in TTS mode. Four baked voices: us_male, us_female, uk_male, uk_female — pick the default at load time via `voice:` option, or override per-request via the OpenAI `/v1/audio/speech` `voice` field.

Links

https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B

Tags

allenai_olmo-3.1-32b-think

The **Olmo-3.1-32B-Think** model is a large language model (LLM) optimized for efficient inference using quantized versions. It is a quantized version of the original **allenai/Olmo-3.1-32B-Think** model, developed by **bartowski** using the **imatrix** quantization method. ### Key Features: - **Base Model**: `allenai/Olmo-3.1-32B-Think` (unquantized version). - **Quantized Versions**: Available in multiple formats (e.g., `Q6_K_L`, `Q4_1`, `bf16`) with varying precision (e.g., Q8_0, Q6_K_L, Q5_K_M). These are derived from the original model using the **imatrix calibration dataset**. - **Performance**: Optimized for low-memory usage and efficient inference on GPUs/CPUs. Recommended quantization types include `Q6_K_L` (near-perfect quality) or `Q4_K_M` (default, balanced performance). - **Downloads**: Available via Hugging Face CLI. Split into multiple files if needed for large models. - **License**: Apache-2.0. ### Recommended Quantization: - Use `Q6_K_L` for highest quality (near-perfect performance). - Use `Q4_K_M` for balanced performance and size. - Avoid lower-quality options (e.g., `Q3_K_S`) unless specific hardware constraints apply. This model is ideal for deploying on GPUs/CPUs with limited memory, leveraging efficient quantization for practical use cases.

Links

https://huggingface.co/bartowski/allenai_Olmo-3.1-32B-Think-GGUF

Tags

qwen3-vl-8b-instruct

Qwen3-VL-8B-Instruct is the 8B parameter model of the Qwen3-VL series. Uses recommended default parameters according to Unsloth documentation for Qwen 3 VL.

Links

https://huggingface.co/unsloth/Qwen3-VL-8B-Instruct-GGUF

Tags

qwen3-vl-8b-thinking

Qwen3-VL-8B-Thinking is the 8B parameter model of the Qwen3-VL series that is thinking. Uses recommended default parameters according to Unsloth documentation for Qwen 3 VL.

Links

https://huggingface.co/unsloth/Qwen3-VL-8B-Thinking-GGUF

Tags

rfdetr-cpp-nano

RF-DETR Nano object detection model, served via the native rfdetr.cpp backend (ggml + purego, no Python). Q8_0 quantization is the recommended default for CPU: same accuracy as F16/F32, ~20MB on disk, fastest CPU latency. Pure C++/ggml runtime; no Python dependencies. Drop-in for the /v1/detection endpoint.

Links

Tags

locate-anything-3b

NVIDIA LocateAnything-3B open-vocabulary object detection (visual grounding), served via the native locate-anything.cpp backend (C++/ggml + purego, no Python). Describe what to find in a text prompt and get labeled boxes back; separate multiple categories with . Q8_0 is the recommended default: box-identical to F16/F32, ~6.3GB, fastest CPU latency. Drop-in for the /v1/detection endpoint (pass the prompt).

Links

Tags

depth-anything-3-base

Depth Anything 3 (base) monocular metric depth + camera pose, served via the native depth-anything.cpp backend (C++/ggml + purego, no Python at inference). Given an image it returns a dense depth map plus the recovered camera extrinsics (3x4) and intrinsics (3x3). Use GenerateImage (src -> normalized depth PNG at dst) or Predict (JSON depth stats + pose). q4_k is the recommended CPU default.

Links

Tags

depth-anything-3-base-q8_0

Depth Anything 3 (base), q8_0 — near-lossless 8-bit quant (~149 MB). Same depth + camera pose output as the q4_k default at higher fidelity.

Links

Tags

depth-anything-2-base

Depth Anything V2 (base / ViT-B) monocular depth, served via the native depth-anything.cpp backend (C++/ggml + purego, no Python at inference). Given an image it returns a dense monocular depth map only — no camera pose, no confidence. This is the relative variant (relative inverse depth). Use GenerateImage (src -> normalized depth PNG at dst) or the Depth endpoint. q4_k is the recommended CPU default.

Links

Tags

depth-anything-2-base-q8_0

Depth Anything V2 (base / ViT-B), q8_0 — near-lossless 8-bit quant. Same relative monocular depth output as the q4_k default at higher fidelity. Use GenerateImage (src -> depth PNG) or the Depth endpoint.

Links

Tags

rfdetr-cpp-small

RF-DETR Small object detection model (DINOv2-small backbone, 512px input, 3 decoder layers), served via the native rfdetr.cpp backend (ggml + purego, no Python). A step up from Nano in accuracy while staying lightweight on CPU. F16 quantization is the recommended default: identical accuracy to F32 at roughly half the size. Drop-in for the /v1/detection endpoint.

Links

Tags

rfdetr-cpp-medium

RF-DETR Medium object detection model (DINOv2-small backbone, 576px input, 4 decoder layers), served via the native rfdetr.cpp backend. Balanced detection quality vs. CPU latency — recommended when Base is not accurate enough but Large is too slow. F16 quantization is the recommended default: identical accuracy to F32, half the size. Drop-in for the /v1/detection endpoint.

Links

Tags

rfdetr-cpp-large

RF-DETR Large object detection model (DINOv2-small backbone, 704px input, 4 decoder layers), served via the native rfdetr.cpp backend. Highest-accuracy detection variant — best for offline workflows and high-resolution inputs where CPU latency is secondary to recall. F16 quantization is the recommended default: identical accuracy to F32, half the size. Drop-in for the /v1/detection endpoint.

Links

Tags

rfdetr-cpp-seg-nano

RF-DETR Seg-Nano instance segmentation model (DINOv2-small backbone, 312px input, 4 decoder layers, 100 queries), served via the native rfdetr.cpp backend. Smallest segmentation variant — fastest CPU latency, ideal for edge deployment. Returns both bounding boxes and per-instance masks via the /v1/detection endpoint. F16 quantization is the recommended default: identical accuracy to F32, half the size.

Links

Tags

rfdetr-cpp-seg-small

RF-DETR Seg-Small instance segmentation model (DINOv2-small backbone, 384px input, 4 decoder layers, 100 queries), served via the native rfdetr.cpp backend. Step up from Seg-Nano in mask quality while staying CPU-friendly. Returns both bounding boxes and per-instance masks via the /v1/detection endpoint. F16 quantization is the recommended default: identical accuracy to F32, half the size.

Links

Tags

rfdetr-cpp-seg-medium

RF-DETR Seg-Medium instance segmentation model (DINOv2-small backbone, 432px input, 5 decoder layers, 200 queries), served via the native rfdetr.cpp backend. Balanced segmentation quality vs. CPU latency — recommended for everyday segmentation workloads. Returns both bounding boxes and per-instance masks via the /v1/detection endpoint. F16 quantization is the recommended default.

Links

Tags

Model Gallery

Filter by type:

Filter by tags:

laguna-xs-2.1

laguna-s-2.1

secret-filter

qwen3.5-27b-claude-4.6-opus-reasoning-distilled-heretic-i1

ced-base-f16

lfm2.5-audio-1.5b-tts

allenai_olmo-3.1-32b-think

qwen3-vl-8b-instruct

qwen3-vl-8b-thinking

rfdetr-cpp-nano

locate-anything-3b

depth-anything-3-base

depth-anything-3-base-q8_0

depth-anything-2-base

depth-anything-2-base-q8_0

rfdetr-cpp-small

rfdetr-cpp-medium

rfdetr-cpp-large

rfdetr-cpp-seg-nano

rfdetr-cpp-seg-small

rfdetr-cpp-seg-medium