LocalAI - Models

kimi-k3

📰 Tech Blog | 📄 Full Report ## 1. Model Introduction Kimi K3 is an open-weight, native multimodal agentic model and our most capable model to date. It is a 2.8T-parameter model built on Kimi Delta Attention (KDA) and Attention Residuals (AttnRes), with native vision capabilities and a 1-million-token context window. It is the world's first open 3T-class model, designed for frontier intelligence across long-horizon coding, knowledge work, and reasoning. ...

Links

https://huggingface.co/unsloth/Kimi-K3-GGUF

Tags

qwythos-9b-v2

Empero AI # Qwythos-9B-v2 — the new and improved Qwythos The next iteration of Qwythos: **all the reasoning of Qwythos-9B, with the looping behavior fixed.** v2 keeps the deep chain-of-thought, the uncensored research posture, and the 1M-token context of its predecessor, and cleans up the rough edges that showed up in real use. - 🔁 **Looping behavior eliminated** — repetition/degeneration under greedy or low-temperature decoding dropped from **6.7% → 0%**. You can serve it *without* leaning on `repetition_penalty` as a band-aid. - 🧠 **Reasoning fully preserved** — MMLU, GSM8K, GPQA, ARC and HumanEval are all held at (or above) the v1 level. This is a *hygiene* upgrade, not a capability regression. - 🧩 **MTP head restored** — the native multi-token-prediction module (dropped in the previous export) is back, so config and weights agree and speculative-decoding setups work. - 🪪 **Cleaner identity** — the model no longer prefaces unrelated answers with its identity; it introduces itself only when you actually ask. - 🔓 **Still intentionally uncensored** for research, cybersecurity, red-teaming, biology, chemistry, pharmacology and clinical work. - 📜 **St ...

Links

https://huggingface.co/empero-ai/Qwythos-9B-v2-GGUF

Tags

minicpm5-1b-claude-opus-fable5-v2-thinking

# MiniCPM5-1B-Claude-Opus-Fable5-V2-Thinking GGUF quantizations for local deployment: **MiniCPM5-1B-Claude-Opus-Fable5-V2-Thinking-GGUF** 中文说明 **MiniCPM5-1B-Claude-Opus-Fable5-V2-Thinking** is a compact 1B **Thinking** language model built on openbmb/MiniCPM5-1B. Compared with V1, this V2 release is further fine-tuned on **Fable 5** data with a stronger focus on **tool calling / function calling**, while also improving **coding** and **instruction-following**. It keeps MiniCPM5's native Thinking chat template and XML tool-call format. Previous version: **MiniCPM5-1B-Claude-Opus-Fable5-Thinking** (V1) For llama.cpp / Ollama / LM Studio deployment, see the **GGUF repository**. ## Overview ## Capabilities - **Tool calling (enhanced in V2)** — more reliable XML / function-calling style tool use on top of MiniCPM5's native format - **Coding** — code generation, debugging, and software-engineering-style tasks - **Instruction following** — more reliable adherence to user prompts and structured constraints - **Thinking mode** — chain-of-thought reasoning via the MiniCPM5 chat template - **Long context** — up to **128K tokens** (131,072 tokens per `config.json`) ...

Links

https://huggingface.co/GnLOLot/MiniCPM5-1B-Claude-Opus-Fable5-V2-Thinking-GGUF

Tags

ternary-bonsai-8b-q2-g64

Ternary Bonsai 8B (PrismML), GGUF Q2_0 with group-64 packing (each FP16 scale shared across 64 weights instead of 128). Slightly larger (~2.31 GB) but matches llama.cpp's native 64-value Q2_0 block layout. Runs on LocalAI's `bonsai` backend. License: Apache 2.0.

Links

Tags

ternary-bonsai-27b-q2-g64

Ternary Bonsai 27B (PrismML), GGUF Q2_0 with group-64 packing (~7.59 GB), matching llama.cpp's native 64-value Q2_0 block layout, with the 4-bit vision tower (mmproj) included. Runs on LocalAI's `bonsai` backend. License: Apache 2.0.

Links

Tags

minicpm5-1b-claude-opus-fable5-thinking

# MiniCPM5-1B-Claude-Opus-Fable5-Thinking GGUF quantizations for local deployment: **MiniCPM5-1B-Claude-Opus-Fable5-Thinking-GGUF** 中文说明 **MiniCPM5-1B-Claude-Opus-Fable5-Thinking** is a compact 1B **Thinking** language model built on openbmb/MiniCPM5-1B. It is further fine-tuned on **Fable 5** data to improve **coding** and **instruction-following** while keeping MiniCPM5's native Thinking chat template and tool-call format. For llama.cpp / Ollama / LM Studio deployment, see the **GGUF repository**. ## Overview ## Capabilities - **Coding** — code generation, debugging, and software-engineering-style tasks - **Instruction following** — more reliable adherence to user prompts and structured constraints - **Thinking mode** — chain-of-thought reasoning via the MiniCPM5 chat template - **Tool calling** — inherits MiniCPM5's XML tool-call format - **Long context** — up to **128K tokens** (131,072 tokens per `config.json`) ## Quick start ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "GnLOLot/MiniCPM5-1B-Claude-Opus-Fable5-Thinking" ...

Links

https://huggingface.co/GnLOLot/MiniCPM5-1B-Claude-Opus-Fable5-Thinking-GGUF

Tags

qwopus3.6-35b-a3b-coder-mtp

# 🌟 Qwopus3.6-35B-A3B-v1 ## 💡 Base Model Overview **Qwen3.6-35B-A3B** is an advanced hybrid sparse MoE (Mixture-of-Experts) model developed by Alibaba Cloud. It features 35B total parameters with only 3B active parameters per token, ensuring high inference efficiency. Architecturally, it combines Gated DeltaNet linear attention with standard gated attention layers, routing tokens across **256 experts**. It natively supports a massive **262k context window** and is specifically designed for high-performance agentic coding, deep reasoning, and multimodal tasks. ## 🚀 Model Refinement & Logic Tuning （Qwopus3.6-35B-A3B-v1） 🪐**Qwopus3.6-35B-A3B-v1** is a reasoning-enhanced MoE (Mixture of Experts) model fine-tuned on top of **Qwen3.6-35B-A3B**. ### 🛠 Training Strategy The fine-tuning process for this model is structured into **three distinct stages of distributed SFT (Supervised Fine-Tuning)**, progressively scaling reasoning complexity and data diversity. This systematic approach ensures the model inherits the base MoE capabilities while sharpening its logic-handling depth. ...

Links

https://huggingface.co/Jackrong/Qwopus3.6-35B-A3B-Coder-MTP-GGUF

Tags

qwen-agentworld-35b-a3b

# Qwen-AgentWorld-35B-A3B 📑 Technical Report | 📖 Blog | 🤗 Hugging Face | 🤖 ModelScope | 💻 GitHub | 🖥️ Demo > [!Note] > This repository contains the model weights and configuration files for **Qwen-AgentWorld-35B-A3B**, a native language world model trained for agentic environment simulation. > > These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, etc. **Qwen-AgentWorld** is the first language world model to cover seven agent interaction domains within a single model. It simulates agentic environments via long chain-of-thought reasoning, predicting the next environment state given an agent's action and interaction history. Trained through a three-stage pipeline — CPT injects environment knowledge, SFT activates next-state-prediction reasoning, RL sharpens simulation fidelity — Qwen-AgentWorld is a **native world model**: environment modeling is the training objective from the CPT stage onward, not a post-hoc add-on. ## Highlights ...

Links

https://huggingface.co/unsloth/Qwen-AgentWorld-35B-A3B-GGUF

Tags

laguna-xs-2.1

Laguna XS 2.1 is Poolside's 33B-parameter, 3B-active Mixture-of-Experts model for agentic coding and long-horizon work on local machines. It supports tool use, interleaved reasoning, and a native 262K-token context window. This default entry uses the official 20.3 GB Q4_K_M GGUF. License: OpenMDW 1.1.

Links

Tags

laguna-s-2.1-q8

Laguna S 2.1 is Poolside's 118B-parameter, 8B-active Mixture-of-Experts model for agentic software engineering. It supports tool use and a native one-million-token context window; the official GGUF recommends 256K context for best output quality. This entry uses the 129 GB Q8_0 build, with routed experts quantized to Q8_0 and the signal path kept in BF16. License: OpenMDW 1.1.

Links

Tags

laguna-s-2.1

Laguna S 2.1 is Poolside's 118B-parameter, 8B-active Mixture-of-Experts model for agentic software engineering. It supports tool use and a native one-million-token context window; the official GGUF recommends 256K context for best output quality. This default entry uses the current 96 GB Q4_K_M artifact, with imatrix-quantized routed experts and a Q8_0 signal path. License: OpenMDW 1.1.

Links

Tags

gemma-4-12b-agentic-fable5-composer2.5-v2-3.5x-tau2

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind > [!Note] > This model card is for the Gemma 4 12B Unified model, which is part of the Gemma 4 family of open models. Built with the same multimodal functionality as Gemma 4 E2B and E4B (text, audio, image, and video inputs), it brings native audio and vision understanding directly to local environments without the need for separate encoders. This unified approach to multimodality makes the model encoder-free, offering a deployment size that is perfect for consumer devices and streamlined local execution. Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. ...

Links

https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF

Tags

gemma-4-12b-coder-fable5-composer2.5-v1

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind > [!Note] > This model card is for the Gemma 4 12B Unified model, which is part of the Gemma 4 family of open models. Built with the same multimodal functionality as Gemma 4 E2B and E4B (text, audio, image, and video inputs), it brings native audio and vision understanding directly to local environments without the need for separate encoders. This unified approach to multimodality makes the model encoder-free, offering a deployment size that is perfect for consumer devices and streamlined local execution. Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. ...

Links

https://huggingface.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF

Tags

gemma-4-26b-a4b-it-qat

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind > [!Note] > This model card is for the new versions of the Gemma 4 family optimized with Quantization-Aware Training (QAT), which allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model. > Four versions of the QAT checkpoints are available: > * **Unquantized QAT checkpoints** (Q4_0): Half-precision weights extracted from the QAT pipeline, ideal for custom downstream compilation and research. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B, and their drafter models. > * **GGUF** (Q4_0): Ready-to-deploy formats for broad ecosystem compatibility. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B. > * **Mobile-optimized** (wNa8o8): A custom schema engineered explicitly for mobile hardware efficiency. It features targeted 2-bit decoding layers, optimized KV caches, and static activations to maximize VRAM savings. Available for Gemma 4 E2B and E4B. > * **Compressed Tensors** (w4a16): QAT checkpoints serialized in the compressed-tensors format for native, optimized inference with vLLM. Available for Gemma 4 E2B, E4B, 12B ...

Links

https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF

Tags

gemma-4-12b-it-qat-q4_0

Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind > [!Note] > This model card is for the new versions of the Gemma 4 family optimized with Quantization-Aware Training (QAT), which allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model. > Four versions of the QAT checkpoints are available: > * **Unquantized QAT checkpoints** (Q4_0): Half-precision weights extracted from the QAT pipeline, ideal for custom downstream compilation and research. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B, and their drafter models. > * **GGUF** (Q4_0): Ready-to-deploy formats for broad ecosystem compatibility. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B. > * **Mobile-optimized** (wNa8o8): A custom schema engineered explicitly for mobile hardware efficiency. It features targeted 2-bit decoding layers, optimized KV caches, and static activations to maximize VRAM savings. Available for Gemma 4 E2B and E4B. > * **Compressed Tensors** (w4a16): QAT checkpoints serialized in the compressed-tensors format for native, optimized inference with vLLM. Available for Gemma 4 E2B, E4B, 12B ...

Links

https://huggingface.co/google/gemma-4-12B-it-qat-q4_0-gguf

Tags

step-3.7-flash

**[ModelPage]**: https://static.stepfun.com/blog/step-3.7-flash/ ## 1. Introduction Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and delivers a throughput of up to 400 tokens per second. Step 3.7 Flash supports a 256k context window and offers three selectable reasoning levels (low, medium, and high) so developers can easily balance speed, cost, and cognitive depth. We built Step 3.7 Flash for developers who need to scale agentic workflows that combine perception, search, and reasoning. It is designed to handle intensive tasks such as parsing massive financial reports in one pass, running multi-step search loops with cross-source verification, or operating concurrent coding agents in high-throughput pipelines. ## 2. Capabilities & Performance ### Multimodal Perception and Verification ...

Links

https://huggingface.co/unsloth/Step-3.7-Flash-GGUF

Tags

kimi-k2.6

🤗 huggingchat | 📰 Tech Blog ## 1. Model Introduction Kimi K2.6 is an open-source, native multimodal agentic model that advances practical capabilities in long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration. ### Key Features - **Long-Horizon Coding**: K2.6 achieves significant improvements on complex, end-to-end coding tasks, generalizing robustly across programming languages (Rust, Go, Python) and domains spanning front-end, DevOps, and performance optimization. - **Coding-Driven Design**: K2.6 is capable of transforming simple prompts and visual inputs into production-ready interfaces and lightweight full-stack workflows, generating structured layouts, interactive elements, and rich animations with deliberate aesthetic precision. - **Elevated Agent Swarm**: Scaling horizontally to 300 sub-agents executing 4,000 coordinated steps, K2.6 can dynamically decompose tasks into parallel, domain-specialized subtasks, delivering end-to-end outputs from documents to websites to spreadsheets in a single autonomous run. - **Proactive & Open Orchestration**: For autonomous tasks, K2.6 demonstra ...

Links

https://huggingface.co/unsloth/Kimi-K2.6-GGUF

Tags

nanbeige4.1-3b-q8

Nanbeige4.1-3B is built upon Nanbeige4-3B-Base and represents an enhanced iteration of our previous reasoning model, Nanbeige4-3B-Thinking-2511, achieved through further post-training optimization with supervised fine-tuning (SFT) and reinforcement learning (RL). As a highly competitive open-source model at a small parameter scale, Nanbeige4.1-3B illustrates that compact models can simultaneously achieve robust reasoning, preference alignment, and effective agentic behaviors. Key features: Strong Reasoning: Capable of solving complex, multi-step problems through sustained and coherent reasoning within a single forward pass, reliably producing correct answers on benchmarks like LiveCodeBench-Pro, IMO-Answer-Bench, and AIME 2026 I. Robust Preference Alignment: Outperforms same-scale models (e.g., Qwen3-4B-2507, Nanbeige4-3B-2511) and larger models (e.g., Qwen3-30B-A3B, Qwen3-32B) on Arena-Hard-v2 and Multi-Challenge. Agentic Capability: First general small model to natively support deep-search tasks and sustain complex problem-solving with >500 rounds of tool invocations; excels in benchmarks like xBench-DeepSearch (75), Browse-Comp (39), and others.

Links

Tags

nanbeige4.1-3b-q4

Nanbeige4.1-3B is built upon Nanbeige4-3B-Base and represents an enhanced iteration of our previous reasoning model, Nanbeige4-3B-Thinking-2511, achieved through further post-training optimization with supervised fine-tuning (SFT) and reinforcement learning (RL). As a highly competitive open-source model at a small parameter scale, Nanbeige4.1-3B illustrates that compact models can simultaneously achieve robust reasoning, preference alignment, and effective agentic behaviors. Key features: Strong Reasoning: Capable of solving complex, multi-step problems through sustained and coherent reasoning within a single forward pass, reliably producing correct answers on benchmarks like LiveCodeBench-Pro, IMO-Answer-Bench, and AIME 2026 I. Robust Preference Alignment: Outperforms same-scale models (e.g., Qwen3-4B-2507, Nanbeige4-3B-2511) and larger models (e.g., Qwen3-30B-A3B, Qwen3-32B) on Arena-Hard-v2 and Multi-Challenge. Agentic Capability: First general small model to natively support deep-search tasks and sustain complex problem-solving with >500 rounds of tool invocations; excels in benchmarks like xBench-DeepSearch (75), Browse-Comp (39), and others.

Links

Tags

vits-piper-it_IT-paola-sherpa

Italian (it_IT) single-speaker Piper VITS voice "paola" (medium quality, 22.05 kHz), served through the sherpa-onnx backend with native streaming TTS. Ships espeak-ng phonemization data, so it works for Italian out of the box.

Links

Tags

vits-piper-it_IT-dii-high-sherpa

Italian (it_IT) single-speaker Piper VITS voice "dii" (high quality, 22.05 kHz), served through the sherpa-onnx backend with native streaming TTS. Ships espeak-ng phonemization data. Non-commercial use only (CC BY-NC-SA 4.0).

Links

Tags

Model Gallery

Filter by type:

Filter by tags:

kimi-k3

qwythos-9b-v2

minicpm5-1b-claude-opus-fable5-v2-thinking

ternary-bonsai-8b-q2-g64

ternary-bonsai-27b-q2-g64

minicpm5-1b-claude-opus-fable5-thinking

qwopus3.6-35b-a3b-coder-mtp

qwen-agentworld-35b-a3b

laguna-xs-2.1

laguna-s-2.1-q8

laguna-s-2.1

gemma-4-12b-agentic-fable5-composer2.5-v2-3.5x-tau2

gemma-4-12b-coder-fable5-composer2.5-v1

gemma-4-26b-a4b-it-qat

gemma-4-12b-it-qat-q4_0

step-3.7-flash

kimi-k2.6

nanbeige4.1-3b-q8

nanbeige4.1-3b-q4

vits-piper-it_IT-paola-sherpa

vits-piper-it_IT-dii-high-sherpa