Repository: localaiLicense: apache-2.0

# ๐ Qwopus3.5-9B-v3.5 ## ๐ก Model Overview & v3.5 Design Qwopus3.5-9B-v3.5 is a **data-scaled continuation** of the Qwopus3.5-9B-v3 model. The training data in v3.5 is expanded to cover a broader range of domains, including mathematics, programming, puzzle-solving, multilingual dialogue, instruction-following, multi-turn interactions, and STEM-related tasks. Qwopus3.5-9B-v3.5 is a reasoning-enhanced model based on **Qwen3.5-9B**, designed for: - ๐งฉ Structured reasoning - ๐ง Tool-augmented workflows - ๐ Multi-step agentic tasks - โก Token-efficient inference Compared with Qwopus3.5-9B-v3, **3.5 version does not introduce a new architecture, RL stage, or template redesign**. This version is trained with approximately **2ร more SFT data**. ## ๐ฏ Motivation & Generalization Insight The motivation behind v3.5 comes from a simple observation: > This work is motivated by the hypothesis that scaling high-quality SFT data may further enhance the generalization ability of large language models. In earlier Qwopus3.5 experiments, structured reasoning was observed to improve both **accuracy and efficiency**: ...
Links
Tags
Repository: localaiLicense: apache-2.0
๐ช Qwopus3.6-27B-v2-MTP MTP Release Multi-Token Prediction reasoning model fine-tuned from Qwen3.6-27B ๐งฌ Trace Inversion & Negentropy ๐ง 27B Parameters โก Speculative Decoding ๐ ๏ธ Coding / DevOps / Math ๐ก What is Qwopus3.6-27B-v2-MTP? ๐ช Qwopus3.6-27B-v2-MTP is a speed-oriented reasoning release built on top of Qwen3.6-27B. It keeps the Qwopus line's focus on reconstructed reasoning traces, coding discipline, DevOps procedures, and mathematical derivations, while adding Multi-Token Prediction for faster generation. The goal is simple: preserve the depth and structure of a 27B reasoning model while making real interactive use noticeably faster. โก MTP DecodingAuxiliary future-token prediction improves throughput on long reasoning, code, math, and strict-format prompts. ๐งฉ Structured ReasoningInherits the Qwopus training recipe built around reconstructed step-by-step reasoning trajectories. ๐งช GB10 TestedValidated on a 30-question local benchmark across Logic, Coding, DevOps, Math, and Edge tasks. ๐ Practical SpeedDesigned for workflows where strong answers matter, but waiting several extra minutes per task does not. ...
Links
Tags
Repository: localaiLicense: gemma

Google Gemma 4 E2B-IT served by SGLang with Multi-Token Prediction (MTP) speculative decoding. The companion drafter google/gemma-4-E2B-it-assistant lets the target accept several tokens per step. Flags are a 1:1 transcription of the SGLang cookbook's MTP command (NEXTN algorithm, num_steps=5, num_draft_tokens=6, eagle_topk=1, mem_fraction_static=0.85). The E2B variant has 5B total / 2B effective parameters and targets the smaller end of consumer GPUs.
Links
Tags
Repository: localaiLicense: gemma

Google Gemma 4 E4B-IT served by SGLang with Multi-Token Prediction (MTP) speculative decoding. The companion drafter google/gemma-4-E4B-it-assistant lets the target accept several tokens per step. Flags are a 1:1 transcription of the SGLang cookbook's MTP command (NEXTN algorithm, num_steps=5, num_draft_tokens=6, eagle_topk=1, mem_fraction_static=0.85). The E4B variant has 8B total / 4B effective parameters โ the natural pick for consumer GPUs in the 16โ24 GB range.
Links
Tags
Repository: localaiLicense: mit

Xiaomi MiMo-7B-RL served by SGLang with built-in Multi-Token Prediction (MTP) heads (no separate drafter needed) plus online fp8 weight quantization to fit on a 16 GB consumer GPU. ~90% acceptance per the model card. Verified end-to-end at ~88 tok/s on an RTX 5070 Ti (16 GB). Note: mem_fraction_static is dropped to 0.7 (vs sglang's 0.85 default) because the MTP draft worker's vocab embedding is loaded unquantised (~1.2 GiB) and OOMs the static reservation otherwise.
Links
Tags