LocalAI - Models

qwen3.6-27b-fable-fusion-711-uncensored-heretic-nm-dau-neo-max-mtp

Important: This is the first fine tune to exceed 700 "arc-c" (The OpenAI, Claude and Gemini "zone of intelligence") in both 8 bit and 4 bit. This repo contains both "regular" and "MTP" Neo MAX Imatrix quants. Qwen3.6-27B-Fable-Fusion-711-Uncensored-Heretic-NM-DAU-NEO-MAX-MTP-GGUF The strongest, smartest open source multi-stage model fine tune for consumer hardware ever and BUILT on consumer hardware via Unsloth. The first model of this size/type to breach "700" ARC-C in both 8 bit and 4 bit; hench the "711" in the name. This model (both 4 bit and 8 bit) exceeds the base Qwen 3.6 27B in 6 out of 7 benchmarks, and matches it on the 7th AND exceeds all 7 benchmarks for Qwen3.6-35B-A3B. The 700 "intelligence club" is reserved for OpenAI, Claude and Gemini closed source models. This is the one they fear. This is a multi-stage fine tune, multi-fine tune, and multi-stage merge. A Colab between myself (multiple fine tunes, including multi-stage), Nightmedia (merge/benching), TeichAI (Polaris Dataset), armand0e (Light fable 5 traces) and trohrbaugh (heretic'ing the model). ...

Links

https://huggingface.co/DavidAU/Qwen3.6-27B-Fable-Fusion-711-Uncensored-Heretic-NM-DAU-NEO-MAX-MTP-GGUF

Tags

hy3

中文｜ English [](#license) [](https://huggingface.co/tencent/Hy3) [](https://modelscope.cn/models/Tencent-Hunyuan/Hy3) [](https://cnb.cool/ai-models/tencent/Hy3) [](https://ai.gitcode.com/tencent_hunyuan/Hy3) 🖥️ Official Website | 💬 GitHub ## Table of Contents - Model Introduction - Stronger Agent Capabilities - More Reliable Product Experiences - Benchmark Appendix - News - Model Links - Quickstart - Deployment - vLLM - SGLang - Finetuning - RL Post-training - Quantization - License - Contact Us ## Model Introduction **Hy3** is a 295B-parameter Mixture-of-Experts (MoE) model with 21B active parameters and 3.8B MTP layer parameters, developed by the Tencent Hy Team. Following the Hy3 Preview launch in late April, we gathered feedback from 50+ products and scaled up post-training with higher quality data. Today, we introduce Hy3, which outperforms similar-size models and rivals flagship open-source models with 2-5x parameters. It also shows significant gains in utility across various products and productivity tasks. ## Stronger Agent Capabilities ...

Links

https://huggingface.co/AngelSlim/Hy3-GGUF

Tags

qwopus3.6-35b-a3b-coder-mtp

# 🌟 Qwopus3.6-35B-A3B-v1 ## 💡 Base Model Overview **Qwen3.6-35B-A3B** is an advanced hybrid sparse MoE (Mixture-of-Experts) model developed by Alibaba Cloud. It features 35B total parameters with only 3B active parameters per token, ensuring high inference efficiency. Architecturally, it combines Gated DeltaNet linear attention with standard gated attention layers, routing tokens across **256 experts**. It natively supports a massive **262k context window** and is specifically designed for high-performance agentic coding, deep reasoning, and multimodal tasks. ## 🚀 Model Refinement & Logic Tuning （Qwopus3.6-35B-A3B-v1） 🪐**Qwopus3.6-35B-A3B-v1** is a reasoning-enhanced MoE (Mixture of Experts) model fine-tuned on top of **Qwen3.6-35B-A3B**. ### 🛠 Training Strategy The fine-tuning process for this model is structured into **three distinct stages of distributed SFT (Supervised Fine-Tuning)**, progressively scaling reasoning complexity and data diversity. This systematic approach ensures the model inherits the base MoE capabilities while sharpening its logic-handling depth. ...

Links

https://huggingface.co/Jackrong/Qwopus3.6-35B-A3B-Coder-MTP-GGUF

Tags

ornith-1.0-9b-mtp

[](https://deep-reinforce.com/ornith.html) # Ornith-1.0-9B Aloha! 🌺 Today, we are releasing Ornith-1.0, a self-improving family of open-source models for agentic coding. Highlights: - **State-of-the-Art Coding Agents**: Available in 9B-Dense, 31B-Dense, 35B-MoE, and 397B-MoE (post-trained on top of Gemma 4 and Qwen 3.5), achieving state-of-the-art performance among open-source models of comparable size on coding benchmarks such as Terminal-Bench 2.1, SWE-Bench, NL2Repo and OpenClaw. - **Self-Improving Training Framework**: Ornith-1.0 employs RL to learn to generate not only solution rollouts, but also the scallfold that drive those rollouts. By jointly optimizing the scaffold and the resulting solution, the model discovers better search trajectories and generates higher-quality solutions. - **Licence**: MIT licensed, globally accessible, and free from regional limitations. ## Ornith 1.0 9B This model card documents **Ornith-1.0-9B**, the most lightweight member of the Ornith family, designed for efficient single-GPU deployment. ### Benchmarks Ornith-1.0-9B Qwen3.5-9B Qwen3.5-35B Gemma4-12B Gemma4-31B Agentic Coding ...

Links

https://huggingface.co/protoLabsAI/Ornith-1.0-9B-MTP-GGUF

Tags

gemmable-4-12b-mtp

## Gemmable 4 12B Gemmable 4 12B is a GGUF export of Gemma 4 12B fine-tuned on Fable-5 style reasoning and assistant traces. ## Highlights - Base model: `google/gemma-4-12B` - Format: GGUF - Training style: Fable-5 style reasoning and assistant traces - Distribution: fp16 GGUF plus matching assistant GGUFs for each quant - Intended use: local inference, coding, reasoning, and assistant workflows ## How to use ### llama.cpp Standard load: ```bash llama-server -m "gemmable-4-12b-fp16.gguf" ``` Speculative / draft-MTP load: ```bash llama-server -m "gemmable-4-12b-Q4_K_M.gguf" \ --spec-draft-model "gemmable-4-12b-Q4_K_M-mtp.gguf" \ --spec-type draft-mtp \ --spec-draft-n-max 4 ``` Use the matching fp16 or quantized main file with its `-mtp` companion. ### LM Studio 1. Search this repo, download target + mtp file. 2. Load target. 3. Load settings → Speculative Decoding → select mtp file file. (Requires a llama.cpp runtime with Gemma 4 MTP support from ggml-org/llama.cpp#23398. LocalAI's pinned llama.cpp backend already carries it, so this entry runs draft-mtp out of the box.) ## GGUF / local inference notes ...

Links

https://huggingface.co/Mia-AiLab/Gemmable-4-12B-MTP-GGUF

Tags

qwopus3.6-27b-coder-compat-mtp

🪐 Qwopus-3.6-27B-Coder Coder SFT Release Agentic Coding & Tool-Use Reasoning Model Fine-Tuned on Qwopus3.6-27B-v2 🧬 Trace Inversion & Negentropy 🧠 27B Dense Model ⚡ Agentic Coding 🛠️ Tool Calling & Agent 🏆 SWE-bench Verified: 67.0% (off-thinking) 💡 What is Qwopus-3.6-27B-Coder? 🪐 Qwopus-3.6-27B-Coder is a reasoning-enhanced agentic coding model built on top of Qwopus3.6-27B-v2. It inherits the powerful reasoning foundation of the v2 base — which achieved 87.43% MMLU-Pro and 75.25% SWE-bench Verified — and further specializes it for agentic code generation, structured tool calling, debugging, and instruction-following in developer workflows. The model is designed to excel at repository-level coding tasks, multi-turn tool orchestration, and complex logical reasoning under realistic agent environments. 🧩 Agentic Coding Optimized for repository-level coding, debugging, patch generation, and structured multi-step development workflows. 🛠️ Tool Calling Learns from real agent trajectories with tool definitions, tool calls, and environment feedback for robust multi-turn execution. ...

Links

https://huggingface.co/Jackrong/Qwopus3.6-27B-Coder-Compat-MTP-GGUF

Tags

qwythos-9b-claude-mythos-5-1m

# Qwythos-9B **Developed by Empero** **Qwythos-9B** is a full-parameter reasoning model built on top of a **deeply uncensored Qwen3.5-9B base** and post-trained on **over 500 million tokens** of high-quality Claude Mythos and Claude Fable traces, with chain-of-thought generated in-house by Empero AI's internal tool **rethink**. The result is a compact, fast, **dramatically more capable** 9B reasoning model. Headline capabilities: ...

Links

https://huggingface.co/empero-ai/Qwythos-9B-Claude-Mythos-5-1M-GGUF

Tags

glm-5.2

# GLM-5.2 👋 Join our WeChat or Discord community. 📖 Check out the GLM-5.2 blog and GLM-5 Technical report. 📍 Use GLM-5.2 API services on Z.ai API Platform. 🔜 Try GLM-5.2 here. [Paper] [GitHub] ## Introduction We're introducing GLM-5.2, our latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a **solid 1M-token context**. GLM-5.2's new capabilities include: - **Solid 1M Context:** A solid 1M-token context that stably sustains long-horizon work - **Advanced Coding with Flexible Effort**: Stronger coding capabilities with multiple thinking effort levels to balance performance and latency - **Improved Architecture**: We propose IndexShare, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9× at a 1M context length. We also improve GLM-5.2’s MTP layer for speculative decoding, increasing the acceptance length by up to 20% - **Pure Open**: An MIT open-source license — no regional limits, technical access without borders ## Benchmark ## Serve GLM-5.2 Locally ...

Links

https://huggingface.co/unsloth/GLM-5.2-GGUF

Tags

qwen3.6-35b-a3b-nvfp4-mtp

# Qwen3.6-35B-A3B [](https://chat.qwen.ai) > [!Note] > This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. > > These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience. ## Qwen3.6 Highlights This release delivers substantial upgrades, particularly in - **Agentic Coding:** the model now handles frontend workflows and repository-level reasoning with greater fluency and precision. - **Thinking Preservation:** we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead. For more details, please refer to our blog post Qwen3.6-35B-A3B. ## Model Overview ...

Links

https://huggingface.co/michaelw9999/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF

Tags

qwopus3.6-27b-v2-mtp-nvfp4

🪐 Qwopus3.6-27B-v2-MTP MTP Release Multi-Token Prediction reasoning model fine-tuned from Qwen3.6-27B 🧬 Trace Inversion & Negentropy 🧠 27B Parameters ⚡ Speculative Decoding 🛠️ Coding / DevOps / Math 💡 What is Qwopus3.6-27B-v2-MTP? 🪐 Qwopus3.6-27B-v2-MTP is a speed-oriented reasoning release built on top of Qwen3.6-27B. It keeps the Qwopus line's focus on reconstructed reasoning traces, coding discipline, DevOps procedures, and mathematical derivations, while adding Multi-Token Prediction for faster generation. The goal is simple: preserve the depth and structure of a 27B reasoning model while making real interactive use noticeably faster. ⚡ MTP DecodingAuxiliary future-token prediction improves throughput on long reasoning, code, math, and strict-format prompts. 🧩 Structured ReasoningInherits the Qwopus training recipe built around reconstructed step-by-step reasoning trajectories. 🧪 GB10 TestedValidated on a 30-question local benchmark across Logic, Coding, DevOps, Math, and Edge tasks. 🚀 Practical SpeedDesigned for workflows where strong answers matter, but waiting several extra minutes per task does not. ...

Links

https://huggingface.co/michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF

Tags

qwopus3.6-27b-coder-mtp-nvfp4

🪐 Qwopus-3.6-27B-Coder Coder SFT Release Agentic Coding & Tool-Use Reasoning Model Fine-Tuned on Qwopus3.6-27B-v2 🧬 Trace Inversion & Negentropy 🧠 27B Dense Model ⚡ Agentic Coding 🛠️ Tool Calling & Agent 🏆 SWE-bench Verified: 67.0% (off-thinking) 💡 What is Qwopus-3.6-27B-Coder? 🪐 Qwopus-3.6-27B-Coder is a reasoning-enhanced agentic coding model built on top of Qwopus3.6-27B-v2. It inherits the powerful reasoning foundation of the v2 base — which achieved 87.43% MMLU-Pro (300ex) and 75.25% SWE-bench Verified — and further specializes it for agentic code generation, structured tool calling, debugging, and instruction-following in developer workflows. The model is designed to excel at repository-level coding tasks, multi-turn tool orchestration, and complex logical reasoning under realistic agent environments. 🧩 Agentic Coding Optimized for repository-level coding, debugging, patch generation, and structured multi-step development workflows. 🛠️ Tool Calling Learns from real agent trajectories with tool definitions, tool calls, and environment feedback for robust multi-turn execution. ...

Links

https://huggingface.co/michaelw9999/Qwopus3.6-27B-Coder-MTP-NVFP4-GGUF

Tags

qwen3.6-27b-nvfp4-mtp

# Qwen3.6-27B [](https://chat.qwen.ai) > [!Note] > This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. > > These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience. ## Qwen3.6 Highlights This release delivers substantial upgrades, particularly in - **Agentic Coding:** the model now handles frontend workflows and repository-level reasoning with greater fluency and precision. - **Thinking Preservation:** we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead. For more details, please refer to our blog post Qwen3.6-27B. ## Model Overview ...

Links

https://huggingface.co/michaelw9999/Qwen3.6-27B-NVFP4-MTP-GGUF

Tags

qwen3.6-27b-mtp-pi-tune

# Qwen3.6-27B [](https://chat.qwen.ai) > [!Note] > This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. > > These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience. ## Qwen3.6 Highlights This release delivers substantial upgrades, particularly in - **Agentic Coding:** the model now handles frontend workflows and repository-level reasoning with greater fluency and precision. - **Thinking Preservation:** we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead. For more details, please refer to our blog post Qwen3.6-27B. ## Model Overview ...

Links

https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF

Tags

qwopus3.6-27b-coder-mtp

🪐 Qwopus3.6-27B-v2 SFT Release Reasoning-Enhanced Dense Language Model Fine-Tuned on Qwen3.6-27B 🧬 Trace Inversion & Negentropy 🧠 27B Parameters 🔥 3-Stage Curriculum SFT 🛠️ Vision & Tool-use Support 💡 What is Qwopus3.6-27B-v2? 🪐 Qwopus3.6-27B-v2 is a reasoning-enhanced dense language model built on top of Qwen3.6-27B. By leveraging a multi-stage curriculum learning pipeline and augmented with Trace Inversion datasets (claude-opus-4.6/4.7-traceInversion), it reverse-engineers the compressed "Reasoning Bubbles" of commercial LLMs into structured, step-by-step synthetic reasoning traces, successfully eliminating logical shortcuts and knowledge fractures. 🧩 Structured Reasoning Injects reconstructed deep CoT chains to eliminate logical shortcuts via Trace Inversion. 🪶 Style Consistency Enforces strict constraints on the format and convergence of <think> tags. 🔁 Distillation Alignment Ensures high-quality cross-source SFT data alignment to narrow the capacity gap. ⚡ RL Scalability Sets up a stable formatting pipeline optimized for downstream Reinforcement Learning (RL). ## 💡 1. Base Model, Training Library & Cooperation ...

Links

https://huggingface.co/Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF

Tags

gemma-4-e2b-it-qat-mtp

Gemma 4 E2B IT QAT (Google DeepMind) paired with its Multi-Token Prediction (MTP) drafter head for speculative decoding on the llama.cpp backend. The Q4_K_XL target carries the full multimodal (text + image) model; the small `mtp-gemma-4-E2B-it` head predicts several tokens ahead which the target verifies in parallel, accelerating generation with no change to output quality. E2B is a MatFormer "effective 2B" elastic variant, well suited to lightweight and on-device deployments. The drafter is not a standalone chat model: it only runs paired with the target, which is why both are bundled here. It uses the upstream `gemma4-assistant` architecture registered by llama.cpp PR #23398, so it loads on stock llama.cpp without any patch. License: Apache 2.0 | Authors: Google DeepMind (target/drafter checkpoints), Unsloth (GGUF conversion)

Links

Tags

gemma-4-e4b-it-qat-mtp

Gemma 4 E4B IT QAT (Google DeepMind) paired with its Multi-Token Prediction (MTP) drafter head for speculative decoding on the llama.cpp backend. The Q4_K_XL target carries the full multimodal (text + image) model; the small `mtp-gemma-4-E4B-it` head predicts several tokens ahead which the target verifies in parallel, accelerating generation with no change to output quality. E4B is a MatFormer "effective 4B" elastic variant, balancing quality and footprint for on-device and edge deployments. The drafter is not a standalone chat model: it only runs paired with the target, which is why both are bundled here. It uses the upstream `gemma4-assistant` architecture registered by llama.cpp PR #23398, so it loads on stock llama.cpp without any patch. License: Apache 2.0 | Authors: Google DeepMind (target/drafter checkpoints), Unsloth (GGUF conversion)

Links

Tags

gemma-4-12b-it-qat-mtp

Gemma 4 12B IT QAT (Google DeepMind) paired with its Multi-Token Prediction (MTP) drafter head for speculative decoding on the llama.cpp backend. The Q4_K_XL target carries the full multimodal (text + image) model; the small `mtp-gemma-4-12B-it` head predicts several tokens ahead which the target verifies in parallel, accelerating generation with no change to output quality. As a dense model, Gemma 4 12B is among the sizes that benefit most from MTP, with the llama.cpp PR reporting well over 1.4x decode speedup. The drafter is not a standalone chat model: it only runs paired with the target, which is why both are bundled here. It uses the upstream `gemma4-assistant` architecture registered by llama.cpp PR #23398, so it loads on stock llama.cpp without any patch. License: Apache 2.0 | Authors: Google DeepMind (target/drafter checkpoints), Unsloth (GGUF conversion)

Links

Tags

gemma-4-31b-it-qat-mtp

Gemma 4 31B IT QAT (Google DeepMind), the largest dense model in the family, paired with its Multi-Token Prediction (MTP) drafter head for speculative decoding on the llama.cpp backend. The Q4_K_XL target carries the full multimodal (text + image) model; the small `mtp-gemma-4-31B-it` head predicts several tokens ahead which the target verifies in parallel, accelerating generation with no change to output quality. Dense models like 31B are the sizes that benefit most from MTP. The drafter is not a standalone chat model: it only runs paired with the target, which is why both are bundled here. It uses the upstream `gemma4-assistant` architecture registered by llama.cpp PR #23398, so it loads on stock llama.cpp without any patch. License: Apache 2.0 | Authors: Google DeepMind (target/drafter checkpoints), Unsloth (GGUF conversion)

Links

Tags

qwopus3.5-9b-coder-mtp

# 🌟 Qwopus3.5-9B-v3.5 ## 💡 Model Overview & v3.5 Design Qwopus3.5-9B-v3.5 is a **data-scaled continuation** of the Qwopus3.5-9B-v3 model. The training data in v3.5 is expanded to cover a broader range of domains, including mathematics, programming, puzzle-solving, multilingual dialogue, instruction-following, multi-turn interactions, and STEM-related tasks. Qwopus3.5-9B-v3.5 is a reasoning-enhanced model based on **Qwen3.5-9B**, designed for: - 🧩 Structured reasoning - 🔧 Tool-augmented workflows - 🔁 Multi-step agentic tasks - ⚡ Token-efficient inference Compared with Qwopus3.5-9B-v3, **3.5 version does not introduce a new architecture, RL stage, or template redesign**. This version is trained with approximately **2× more SFT data**. ## 🎯 Motivation & Generalization Insight The motivation behind v3.5 comes from a simple observation: > This work is motivated by the hypothesis that scaling high-quality SFT data may further enhance the generalization ability of large language models. In earlier Qwopus3.5 experiments, structured reasoning was observed to improve both **accuracy and efficiency**: ...

Links

https://huggingface.co/Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF

Tags

qwopus3.6-27b-v2-mtp

🪐 Qwopus3.6-27B-v2-MTP MTP Release Multi-Token Prediction reasoning model fine-tuned from Qwen3.6-27B 🧬 Trace Inversion & Negentropy 🧠 27B Parameters ⚡ Speculative Decoding 🛠️ Coding / DevOps / Math 💡 What is Qwopus3.6-27B-v2-MTP? 🪐 Qwopus3.6-27B-v2-MTP is a speed-oriented reasoning release built on top of Qwen3.6-27B. It keeps the Qwopus line's focus on reconstructed reasoning traces, coding discipline, DevOps procedures, and mathematical derivations, while adding Multi-Token Prediction for faster generation. The goal is simple: preserve the depth and structure of a 27B reasoning model while making real interactive use noticeably faster. ⚡ MTP DecodingAuxiliary future-token prediction improves throughput on long reasoning, code, math, and strict-format prompts. 🧩 Structured ReasoningInherits the Qwopus training recipe built around reconstructed step-by-step reasoning trajectories. 🧪 GB10 TestedValidated on a 30-question local benchmark across Logic, Coding, DevOps, Math, and Edge tasks. 🚀 Practical SpeedDesigned for workflows where strong answers matter, but waiting several extra minutes per task does not. ...

Links

https://huggingface.co/Jackrong/Qwopus3.6-27B-v2-MTP-GGUF

Tags

gemma-4-e2b-it:sglang-mtp

Google Gemma 4 E2B-IT served by SGLang with Multi-Token Prediction (MTP) speculative decoding. The companion drafter google/gemma-4-E2B-it-assistant lets the target accept several tokens per step. Flags are a 1:1 transcription of the SGLang cookbook's MTP command (NEXTN algorithm, num_steps=5, num_draft_tokens=6, eagle_topk=1, mem_fraction_static=0.85). The E2B variant has 5B total / 2B effective parameters and targets the smaller end of consumer GPUs.

Links

Tags

Model Gallery

Filter by type:

Filter by tags:

qwen3.6-27b-fable-fusion-711-uncensored-heretic-nm-dau-neo-max-mtp

hy3

qwopus3.6-35b-a3b-coder-mtp

ornith-1.0-9b-mtp

gemmable-4-12b-mtp

qwopus3.6-27b-coder-compat-mtp

qwythos-9b-claude-mythos-5-1m

glm-5.2

qwen3.6-35b-a3b-nvfp4-mtp

qwopus3.6-27b-v2-mtp-nvfp4

qwopus3.6-27b-coder-mtp-nvfp4

qwen3.6-27b-nvfp4-mtp

qwen3.6-27b-mtp-pi-tune

qwopus3.6-27b-coder-mtp

gemma-4-e2b-it-qat-mtp

gemma-4-e4b-it-qat-mtp

gemma-4-12b-it-qat-mtp

gemma-4-31b-it-qat-mtp

qwopus3.5-9b-coder-mtp

qwopus3.6-27b-v2-mtp

gemma-4-e2b-it:sglang-mtp