LocalAI - Models

pocket-35b-q2

POCKET-35B is an Apache-2.0 Qwen3.5-family mixture-of-experts model from FINAL-Bench/VIDRAFT, derived from Darwin-36B-Opus and packaged for stock llama.cpp. This entry uses the smaller Q2_K GGUF quantization.

Links

Tags

pocket-26b-q2

POCKET-26B is an Apache-2.0 Gemma 4 26B-A4B mixture-of-experts model from FINAL-Bench/VIDRAFT, tuned for Korean and packaged for stock llama.cpp. This entry uses the smaller Q2_K GGUF quantization.

Links

Tags

bonsai-8b-1bit

Bonsai 8B (PrismML) is an end-to-end 1-bit language model built on the Qwen3-8B dense architecture (GQA, SwiGLU, RoPE, RMSNorm, 36 layers, 65,536 context). Every weight is a single sign bit (`-scale` / `+scale`) with one FP16 scale per group of 128 weights, for an effective 1.125 bits/weight and a ~1.15 GB footprint (14.2x smaller than FP16) while matching full-precision 8B instruct models at ~70.5 average across 6 benchmark categories. The Q1_0 quantization is only decodable by the PrismML llama.cpp fork, so this entry runs on LocalAI's `bonsai` backend (that fork), not the stock `llama-cpp` backend. License: Apache 2.0.

Links

Tags

ternary-bonsai-8b

Ternary Bonsai 8B (PrismML) is a 1.58-bit ternary language model on the Qwen3-8B dense architecture. Each weight takes a value from {-1, 0, +1} with one shared FP16 scale per group of 128 weights (GGUF Q2_0, ~2.18 GB deployed, 7.5x smaller than FP16). The extra zero state recovers more of the full-precision model than the 1-bit build: it ranks 2nd among compared 6-9B models at 75.5 average despite being ~1/8th their size. Q2_0 is the recommended, ternary-lossless variant. The Q2_0 kernels are only in the PrismML llama.cpp fork, so this runs on LocalAI's `bonsai` backend. License: Apache 2.0.

Links

Tags

bonsai-27b-1bit

Bonsai 27B (PrismML) is a full 27B-class reasoning model in end-to-end 1-bit weights, derived from the Qwen3.6-27B hybrid-attention backbone (~75% linear attention, 262K context). At a true 1.125 bits/weight it deploys in ~3.9 GB (~14.2x smaller than FP16) while retaining 89.5% of FP16 intelligence across 15 thinking-mode benchmarks (math 91.66, coding 81.88). Ships an optional 4-bit vision tower (mmproj) for image input, included here. The Q1_0_g128 weights and hybrid-attention kernels are only in the PrismML llama.cpp fork, so this runs on LocalAI's `bonsai` backend. A GPU is recommended. License: Apache 2.0.

Links

Tags

laguna-xs-2.1-apex-i-mini

Laguna XS 2.1 in the 12.8 GB APEX-I Mini format, the smallest importance-matrix APEX build for llama.cpp. License: OpenMDW 1.1.

Links

Tags

qwen3.6-35b-a3b-dflash

Qwen3.6-35B-A3B (Mixture-of-Experts, ~3B active per token) paired with its DFlash block-diffusion drafter for speculative decoding on the llama.cpp backend. DFlash speedups on MoE targets are smaller than on dense models, but still useful. DFlash produces a whole block of draft tokens in a single forward pass and injects the target model's hidden states into the drafter's attention, which keeps the drafter tiny while making drafting GPU-friendly. The UD-Q4_K_M file carries the full Qwen3.6-35B-A3B target; the ~0.4 GB Q8_0 drafter (`draft-dflash`) accelerates generation without changing the target's outputs. The drafter is not a standalone chat model: it only runs paired with the target, which is why both are bundled here. Flash attention is required for DFlash and is enabled in this config. A GPU is recommended. License: Apache 2.0.

Links

Tags

dark-scarlett-v0.3-26b-a4b

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: **E2B**, **E4B**, **26B A4B**, and **31B**. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI. Gemma 4 introduces key **capability and architectural advancements**: * **Reasoning** – All models in the family are designed as highly capable reasoners, with configurable thinking modes. ...

Links

https://huggingface.co/ReadyArt/Dark-Scarlett-v0.3-26B-A4B-GGUF

Tags

gemma-4-e2b-it-qat-mtp

Gemma 4 E2B IT QAT (Google DeepMind) paired with its Multi-Token Prediction (MTP) drafter head for speculative decoding on the llama.cpp backend. The Q4_K_XL target carries the full multimodal (text + image) model; the small `mtp-gemma-4-E2B-it` head predicts several tokens ahead which the target verifies in parallel, accelerating generation with no change to output quality. E2B is a MatFormer "effective 2B" elastic variant, well suited to lightweight and on-device deployments. The drafter is not a standalone chat model: it only runs paired with the target, which is why both are bundled here. It uses the upstream `gemma4-assistant` architecture registered by llama.cpp PR #23398, so it loads on stock llama.cpp without any patch. License: Apache 2.0 | Authors: Google DeepMind (target/drafter checkpoints), Unsloth (GGUF conversion)

Links

Tags

gemma-4-e4b-it-qat-mtp

Gemma 4 E4B IT QAT (Google DeepMind) paired with its Multi-Token Prediction (MTP) drafter head for speculative decoding on the llama.cpp backend. The Q4_K_XL target carries the full multimodal (text + image) model; the small `mtp-gemma-4-E4B-it` head predicts several tokens ahead which the target verifies in parallel, accelerating generation with no change to output quality. E4B is a MatFormer "effective 4B" elastic variant, balancing quality and footprint for on-device and edge deployments. The drafter is not a standalone chat model: it only runs paired with the target, which is why both are bundled here. It uses the upstream `gemma4-assistant` architecture registered by llama.cpp PR #23398, so it loads on stock llama.cpp without any patch. License: Apache 2.0 | Authors: Google DeepMind (target/drafter checkpoints), Unsloth (GGUF conversion)

Links

Tags

gemma-4-12b-it-qat-mtp

Gemma 4 12B IT QAT (Google DeepMind) paired with its Multi-Token Prediction (MTP) drafter head for speculative decoding on the llama.cpp backend. The Q4_K_XL target carries the full multimodal (text + image) model; the small `mtp-gemma-4-12B-it` head predicts several tokens ahead which the target verifies in parallel, accelerating generation with no change to output quality. As a dense model, Gemma 4 12B is among the sizes that benefit most from MTP, with the llama.cpp PR reporting well over 1.4x decode speedup. The drafter is not a standalone chat model: it only runs paired with the target, which is why both are bundled here. It uses the upstream `gemma4-assistant` architecture registered by llama.cpp PR #23398, so it loads on stock llama.cpp without any patch. License: Apache 2.0 | Authors: Google DeepMind (target/drafter checkpoints), Unsloth (GGUF conversion)

Links

Tags

gemma-4-31b-it-qat-mtp

Gemma 4 31B IT QAT (Google DeepMind), the largest dense model in the family, paired with its Multi-Token Prediction (MTP) drafter head for speculative decoding on the llama.cpp backend. The Q4_K_XL target carries the full multimodal (text + image) model; the small `mtp-gemma-4-31B-it` head predicts several tokens ahead which the target verifies in parallel, accelerating generation with no change to output quality. Dense models like 31B are the sizes that benefit most from MTP. The drafter is not a standalone chat model: it only runs paired with the target, which is why both are bundled here. It uses the upstream `gemma4-assistant` architecture registered by llama.cpp PR #23398, so it loads on stock llama.cpp without any patch. License: Apache 2.0 | Authors: Google DeepMind (target/drafter checkpoints), Unsloth (GGUF conversion)

Links

Tags

privacy-filter-nemotron

A fine-grained English PII token-classification model: a fine-tune of openai/privacy-filter by OpenMed on NVIDIA's Nemotron-PII dataset. It labels every token with a BIOES tag over 55 PII categories (221 classes), trading the multilingual sibling's language breadth for category depth - identity, contact, address, dates, government IDs, financial, healthcare, enterprise, vehicle and digital entities (including api_key, ipv4/ipv6 and mac_address). For multilingual text prefer privacy-filter-multilingual instead. In LocalAI this is a PII detector for the NER redactor tier: set known_usecases to [token_classify] (as below), and any model opts into redaction by listing this one under pii.detectors. The detection policy (which categories to mask vs block, and the score threshold) lives on this model's own pii_detection block - see the overrides below. It runs locally with no Python, served by the standalone privacy-filter backend's TokenClassify RPC (constrained BIOES Viterbi decode into UTF-8 byte-offset entity spans). Architecture: gpt-oss-style sparse MoE (8 layers, d_model 640, 128 experts top-4, ~1.5B total / ~50M active per token), bidirectional banded attention, o200k tokenizer and a 221-way token-classification head; served via the openai-privacy-filter architecture. F16, ~2.8 GB. (A smaller Q8_0 quant exists on the GGUF repo for RAM-constrained use - validate it on your own data, since for PII a single dropped span is a leak.)

Links

Tags

carnice-v2-27b

# Carnice-V2-27B for Hermes Agent Carnice-V2-27B is a full merged BF16 SFT of `Qwen/Qwen3.6-27B` for Hermes-style agent traces. This repository contains the standalone merged model weights, not only a LoRA adapter. ## BF16 Transformers Loading Fix The BF16 safetensors were republished with corrected `Qwen3_5ForConditionalGeneration` tensor prefixes. The original merge artifact accidentally serialized an extra Unsloth wrapper prefix, which caused direct HF Transformers loads to report the real weights as unexpected keys and initialize expected layers randomly. GGUF files were not affected because the GGUF conversion path normalized those prefixes. ## Benchmarks The benchmark artifact bundle is included under `benchmarks/`. It contains the rendered graph, extracted `metrics.json`, benchmark scripts, and raw result files used to make the chart. Scope note: the IFEval run is a short `limit=20` A/B smoke benchmark, not an official full leaderboard score. Held-out loss/perplexity is the exact assistant-only training-format validation metric from the SFT script. The raw BFCL two-case smoke files are included for auditability, but they are too small to use as a model-quality claim. ...

Links

https://huggingface.co/kai-os/Carnice-V2-27b-GGUF

Tags

supergemma4-26b-uncensored-v2

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: **E2B**, **E4B**, **26B A4B**, and **31B**. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI. Gemma 4 introduces key **capability and architectural advancements**: * **Reasoning** – All models in the family are designed as highly capable reasoners, with configurable thinking modes. ...

Links

https://huggingface.co/Jiunsong/supergemma4-26b-uncensored-gguf-v2

Tags

nanbeige4.1-3b-q8

Nanbeige4.1-3B is built upon Nanbeige4-3B-Base and represents an enhanced iteration of our previous reasoning model, Nanbeige4-3B-Thinking-2511, achieved through further post-training optimization with supervised fine-tuning (SFT) and reinforcement learning (RL). As a highly competitive open-source model at a small parameter scale, Nanbeige4.1-3B illustrates that compact models can simultaneously achieve robust reasoning, preference alignment, and effective agentic behaviors. Key features: Strong Reasoning: Capable of solving complex, multi-step problems through sustained and coherent reasoning within a single forward pass, reliably producing correct answers on benchmarks like LiveCodeBench-Pro, IMO-Answer-Bench, and AIME 2026 I. Robust Preference Alignment: Outperforms same-scale models (e.g., Qwen3-4B-2507, Nanbeige4-3B-2511) and larger models (e.g., Qwen3-30B-A3B, Qwen3-32B) on Arena-Hard-v2 and Multi-Challenge. Agentic Capability: First general small model to natively support deep-search tasks and sustain complex problem-solving with >500 rounds of tool invocations; excels in benchmarks like xBench-DeepSearch (75), Browse-Comp (39), and others.

Links

Tags

nanbeige4.1-3b-q4

Nanbeige4.1-3B is built upon Nanbeige4-3B-Base and represents an enhanced iteration of our previous reasoning model, Nanbeige4-3B-Thinking-2511, achieved through further post-training optimization with supervised fine-tuning (SFT) and reinforcement learning (RL). As a highly competitive open-source model at a small parameter scale, Nanbeige4.1-3B illustrates that compact models can simultaneously achieve robust reasoning, preference alignment, and effective agentic behaviors. Key features: Strong Reasoning: Capable of solving complex, multi-step problems through sustained and coherent reasoning within a single forward pass, reliably producing correct answers on benchmarks like LiveCodeBench-Pro, IMO-Answer-Bench, and AIME 2026 I. Robust Preference Alignment: Outperforms same-scale models (e.g., Qwen3-4B-2507, Nanbeige4-3B-2511) and larger models (e.g., Qwen3-30B-A3B, Qwen3-32B) on Arena-Hard-v2 and Multi-Challenge. Agentic Capability: First general small model to natively support deep-search tasks and sustain complex problem-solving with >500 rounds of tool invocations; excels in benchmarks like xBench-DeepSearch (75), Browse-Comp (39), and others.

Links

Tags

ced-base-q8

CED (Consistent Ensemble Distillation, Xiaomi) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). This is the q8_0 GGUF for the ced backend: smallest footprint (~88 MB, ~6.5x less memory than the PyTorch reference) and near-lossless (identical top-5 tags). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Links

Tags

ced-tiny-q8

CED-tiny (5.5M params, Pi-class / edge) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Links

Tags

ced-mini-q8

CED-mini (9.6M params, low-power) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Links

Tags

ced-small-f16

CED-small (22M params, balanced size/accuracy) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.

Links

Tags

Model Gallery

Filter by type:

Filter by tags:

pocket-35b-q2

pocket-26b-q2

bonsai-8b-1bit

ternary-bonsai-8b

bonsai-27b-1bit

laguna-xs-2.1-apex-i-mini

qwen3.6-35b-a3b-dflash

dark-scarlett-v0.3-26b-a4b

gemma-4-e2b-it-qat-mtp

gemma-4-e4b-it-qat-mtp

gemma-4-12b-it-qat-mtp

gemma-4-31b-it-qat-mtp

privacy-filter-nemotron

carnice-v2-27b

supergemma4-26b-uncensored-v2

nanbeige4.1-3b-q8

nanbeige4.1-3b-q4

ced-base-q8

ced-tiny-q8

ced-mini-q8

ced-small-f16