LocalAI - Models

pocket-35b

POCKET-35B is an Apache-2.0 Qwen3.5-family mixture-of-experts model from FINAL-Bench/VIDRAFT, derived from Darwin-36B-Opus and packaged for stock llama.cpp. This entry uses the quality-oriented Q4_K_M GGUF quantization.

Links

Tags

pocket-35b-q2

POCKET-35B is an Apache-2.0 Qwen3.5-family mixture-of-experts model from FINAL-Bench/VIDRAFT, derived from Darwin-36B-Opus and packaged for stock llama.cpp. This entry uses the smaller Q2_K GGUF quantization.

Links

Tags

pocket-35b-iq1

POCKET-35B is an Apache-2.0 Qwen3.5-family mixture-of-experts model from FINAL-Bench/VIDRAFT, derived from Darwin-36B-Opus and packaged for stock llama.cpp. This entry uses the most compact IQ1_M GGUF quantization.

Links

Tags

pocket-26b

POCKET-26B is an Apache-2.0 Gemma 4 26B-A4B mixture-of-experts model from FINAL-Bench/VIDRAFT, tuned for Korean and packaged for stock llama.cpp. This entry uses the quality-oriented Q4_K_M GGUF quantization.

Links

Tags

pocket-26b-q2

POCKET-26B is an Apache-2.0 Gemma 4 26B-A4B mixture-of-experts model from FINAL-Bench/VIDRAFT, tuned for Korean and packaged for stock llama.cpp. This entry uses the smaller Q2_K GGUF quantization.

Links

Tags

minicpm5-1b-claude-opus-fable5-v2-thinking

# MiniCPM5-1B-Claude-Opus-Fable5-V2-Thinking GGUF quantizations for local deployment: **MiniCPM5-1B-Claude-Opus-Fable5-V2-Thinking-GGUF** 中文说明 **MiniCPM5-1B-Claude-Opus-Fable5-V2-Thinking** is a compact 1B **Thinking** language model built on openbmb/MiniCPM5-1B. Compared with V1, this V2 release is further fine-tuned on **Fable 5** data with a stronger focus on **tool calling / function calling**, while also improving **coding** and **instruction-following**. It keeps MiniCPM5's native Thinking chat template and XML tool-call format. Previous version: **MiniCPM5-1B-Claude-Opus-Fable5-Thinking** (V1) For llama.cpp / Ollama / LM Studio deployment, see the **GGUF repository**. ## Overview ## Capabilities - **Tool calling (enhanced in V2)** — more reliable XML / function-calling style tool use on top of MiniCPM5's native format - **Coding** — code generation, debugging, and software-engineering-style tasks - **Instruction following** — more reliable adherence to user prompts and structured constraints - **Thinking mode** — chain-of-thought reasoning via the MiniCPM5 chat template - **Long context** — up to **128K tokens** (131,072 tokens per `config.json`) ...

Links

https://huggingface.co/GnLOLot/MiniCPM5-1B-Claude-Opus-Fable5-V2-Thinking-GGUF

Tags

bonsai-8b-1bit

Bonsai 8B (PrismML) is an end-to-end 1-bit language model built on the Qwen3-8B dense architecture (GQA, SwiGLU, RoPE, RMSNorm, 36 layers, 65,536 context). Every weight is a single sign bit (`-scale` / `+scale`) with one FP16 scale per group of 128 weights, for an effective 1.125 bits/weight and a ~1.15 GB footprint (14.2x smaller than FP16) while matching full-precision 8B instruct models at ~70.5 average across 6 benchmark categories. The Q1_0 quantization is only decodable by the PrismML llama.cpp fork, so this entry runs on LocalAI's `bonsai` backend (that fork), not the stock `llama-cpp` backend. License: Apache 2.0.

Links

Tags

ternary-bonsai-8b

Ternary Bonsai 8B (PrismML) is a 1.58-bit ternary language model on the Qwen3-8B dense architecture. Each weight takes a value from {-1, 0, +1} with one shared FP16 scale per group of 128 weights (GGUF Q2_0, ~2.18 GB deployed, 7.5x smaller than FP16). The extra zero state recovers more of the full-precision model than the 1-bit build: it ranks 2nd among compared 6-9B models at 75.5 average despite being ~1/8th their size. Q2_0 is the recommended, ternary-lossless variant. The Q2_0 kernels are only in the PrismML llama.cpp fork, so this runs on LocalAI's `bonsai` backend. License: Apache 2.0.

Links

Tags

ternary-bonsai-8b-q2-g64

Ternary Bonsai 8B (PrismML), GGUF Q2_0 with group-64 packing (each FP16 scale shared across 64 weights instead of 128). Slightly larger (~2.31 GB) but matches llama.cpp's native 64-value Q2_0 block layout. Runs on LocalAI's `bonsai` backend. License: Apache 2.0.

Links

Tags

bonsai-27b-1bit

Bonsai 27B (PrismML) is a full 27B-class reasoning model in end-to-end 1-bit weights, derived from the Qwen3.6-27B hybrid-attention backbone (~75% linear attention, 262K context). At a true 1.125 bits/weight it deploys in ~3.9 GB (~14.2x smaller than FP16) while retaining 89.5% of FP16 intelligence across 15 thinking-mode benchmarks (math 91.66, coding 81.88). Ships an optional 4-bit vision tower (mmproj) for image input, included here. The Q1_0_g128 weights and hybrid-attention kernels are only in the PrismML llama.cpp fork, so this runs on LocalAI's `bonsai` backend. A GPU is recommended. License: Apache 2.0.

Links

Tags

ternary-bonsai-27b

Ternary Bonsai 27B (PrismML) is the quality-oriented operating point of the Bonsai 27B family: full 27B-class reasoning in ternary {-1, 0, +1} weights on the Qwen3.6-27B hybrid-attention backbone (262K context). At a true 1.71 bits/weight it deploys in ~7.2 GB (GGUF Q2_0_g128) and retains 95% of FP16 intelligence (80.49 average across 15 thinking-mode benchmarks) - a higher score than a conventional IQ2_XXS build at less than two-thirds its footprint. Ships an optional 4-bit vision tower (mmproj), included. The Q2_0 weights and hybrid-attention kernels are only in the PrismML llama.cpp fork, so this runs on LocalAI's `bonsai` backend. A GPU is recommended. License: Apache 2.0.

Links

Tags

ternary-bonsai-27b-q2-g64

Ternary Bonsai 27B (PrismML), GGUF Q2_0 with group-64 packing (~7.59 GB), matching llama.cpp's native 64-value Q2_0 block layout, with the 4-bit vision tower (mmproj) included. Runs on LocalAI's `bonsai` backend. License: Apache 2.0.

Links

Tags

minicpm5-1b-claude-opus-fable5-thinking

# MiniCPM5-1B-Claude-Opus-Fable5-Thinking GGUF quantizations for local deployment: **MiniCPM5-1B-Claude-Opus-Fable5-Thinking-GGUF** 中文说明 **MiniCPM5-1B-Claude-Opus-Fable5-Thinking** is a compact 1B **Thinking** language model built on openbmb/MiniCPM5-1B. It is further fine-tuned on **Fable 5** data to improve **coding** and **instruction-following** while keeping MiniCPM5's native Thinking chat template and tool-call format. For llama.cpp / Ollama / LM Studio deployment, see the **GGUF repository**. ## Overview ## Capabilities - **Coding** — code generation, debugging, and software-engineering-style tasks - **Instruction following** — more reliable adherence to user prompts and structured constraints - **Thinking mode** — chain-of-thought reasoning via the MiniCPM5 chat template - **Tool calling** — inherits MiniCPM5's XML tool-call format - **Long context** — up to **128K tokens** (131,072 tokens per `config.json`) ## Quick start ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "GnLOLot/MiniCPM5-1B-Claude-Opus-Fable5-Thinking" ...

Links

https://huggingface.co/GnLOLot/MiniCPM5-1B-Claude-Opus-Fable5-Thinking-GGUF

Tags

gemmable-4-12b-mtp

## Gemmable 4 12B Gemmable 4 12B is a GGUF export of Gemma 4 12B fine-tuned on Fable-5 style reasoning and assistant traces. ## Highlights - Base model: `google/gemma-4-12B` - Format: GGUF - Training style: Fable-5 style reasoning and assistant traces - Distribution: fp16 GGUF plus matching assistant GGUFs for each quant - Intended use: local inference, coding, reasoning, and assistant workflows ## How to use ### llama.cpp Standard load: ```bash llama-server -m "gemmable-4-12b-fp16.gguf" ``` Speculative / draft-MTP load: ```bash llama-server -m "gemmable-4-12b-Q4_K_M.gguf" \ --spec-draft-model "gemmable-4-12b-Q4_K_M-mtp.gguf" \ --spec-type draft-mtp \ --spec-draft-n-max 4 ``` Use the matching fp16 or quantized main file with its `-mtp` companion. ### LM Studio 1. Search this repo, download target + mtp file. 2. Load target. 3. Load settings → Speculative Decoding → select mtp file file. (Requires a llama.cpp runtime with Gemma 4 MTP support from ggml-org/llama.cpp#23398. LocalAI's pinned llama.cpp backend already carries it, so this entry runs draft-mtp out of the box.) ## GGUF / local inference notes ...

Links

https://huggingface.co/Mia-AiLab/Gemmable-4-12B-MTP-GGUF

Tags

laguna-xs-2.1-apex-i-quality

Laguna XS 2.1 in the 21.8 GB APEX-I Quality format. This is the highest-fidelity importance-matrix APEX build for llama.cpp. License: OpenMDW 1.1.

Links

Tags

laguna-xs-2.1-apex-i-balanced

Laguna XS 2.1 in the 24.3 GB APEX-I Balanced format, an importance-matrix build balancing fidelity and memory use for llama.cpp. License: OpenMDW 1.1.

Links

Tags

laguna-xs-2.1-apex-i-compact

Laguna XS 2.1 in the 15.8 GB APEX-I Compact format, an importance-matrix build tuned for lower memory use in llama.cpp. License: OpenMDW 1.1.

Links

Tags

laguna-xs-2.1-apex-i-mini

Laguna XS 2.1 in the 12.8 GB APEX-I Mini format, the smallest importance-matrix APEX build for llama.cpp. License: OpenMDW 1.1.

Links

Tags

laguna-xs-2.1-apex-quality

Laguna XS 2.1 in the 21.8 GB APEX Quality format, the highest-fidelity non-imatrix APEX build for llama.cpp. License: OpenMDW 1.1.

Links

Tags

laguna-xs-2.1-apex-balanced

Laguna XS 2.1 in the 24.3 GB APEX Balanced format, balancing fidelity and memory use for llama.cpp. License: OpenMDW 1.1.

Links

Tags

laguna-xs-2.1-apex-compact

Laguna XS 2.1 in the 15.8 GB APEX Compact format, tuned for lower memory use in llama.cpp. License: OpenMDW 1.1.

Links

Tags

Model Gallery

Filter by type:

Filter by tags:

pocket-35b

pocket-35b-q2

pocket-35b-iq1

pocket-26b

pocket-26b-q2

minicpm5-1b-claude-opus-fable5-v2-thinking

bonsai-8b-1bit

ternary-bonsai-8b

ternary-bonsai-8b-q2-g64

bonsai-27b-1bit

ternary-bonsai-27b

ternary-bonsai-27b-q2-g64

minicpm5-1b-claude-opus-fable5-thinking

gemmable-4-12b-mtp

laguna-xs-2.1-apex-i-quality

laguna-xs-2.1-apex-i-balanced

laguna-xs-2.1-apex-i-compact

laguna-xs-2.1-apex-i-mini

laguna-xs-2.1-apex-quality

laguna-xs-2.1-apex-balanced

laguna-xs-2.1-apex-compact