Model Gallery

36 models from 1 repositories

Filter by type:

Filter by tags:

qwen3.5-27b-claude-4.6-opus-reasoning-distilled-heretic-i1

Repository: localaiLicense: apache-2.0

lfm2.5-audio-1.5b-tts
LFM2.5-Audio-1.5B in TTS mode. Four baked voices: us_male, us_female, uk_male, uk_female — pick the default at load time via `voice:` option, or override per-request via the OpenAI `/v1/audio/speech` `voice` field.

Repository: localaiLicense: LFM-Open-License-v1.0

allenai_olmo-3.1-32b-think
The **Olmo-3.1-32B-Think** model is a large language model (LLM) optimized for efficient inference using quantized versions. It is a quantized version of the original **allenai/Olmo-3.1-32B-Think** model, developed by **bartowski** using the **imatrix** quantization method. ### Key Features: - **Base Model**: `allenai/Olmo-3.1-32B-Think` (unquantized version). - **Quantized Versions**: Available in multiple formats (e.g., `Q6_K_L`, `Q4_1`, `bf16`) with varying precision (e.g., Q8_0, Q6_K_L, Q5_K_M). These are derived from the original model using the **imatrix calibration dataset**. - **Performance**: Optimized for low-memory usage and efficient inference on GPUs/CPUs. Recommended quantization types include `Q6_K_L` (near-perfect quality) or `Q4_K_M` (default, balanced performance). - **Downloads**: Available via Hugging Face CLI. Split into multiple files if needed for large models. - **License**: Apache-2.0. ### Recommended Quantization: - Use `Q6_K_L` for highest quality (near-perfect performance). - Use `Q4_K_M` for balanced performance and size. - Avoid lower-quality options (e.g., `Q3_K_S`) unless specific hardware constraints apply. This model is ideal for deploying on GPUs/CPUs with limited memory, leveraging efficient quantization for practical use cases.

Repository: localaiLicense: apache-2.0

qwen3-vl-8b-instruct
Qwen3-VL-8B-Instruct is the 8B parameter model of the Qwen3-VL series. Uses recommended default parameters according to Unsloth documentation for Qwen 3 VL.

Repository: localaiLicense: apache-2.0

qwen3-vl-8b-thinking
Qwen3-VL-8B-Thinking is the 8B parameter model of the Qwen3-VL series that is thinking. Uses recommended default parameters according to Unsloth documentation for Qwen 3 VL.

Repository: localaiLicense: apache-2.0

rfdetr-cpp-nano
RF-DETR Nano object detection model, served via the native rfdetr.cpp backend (ggml + purego, no Python). Q8_0 quantization is the recommended default for CPU: same accuracy as F16/F32, ~20MB on disk, fastest CPU latency. Pure C++/ggml runtime; no Python dependencies. Drop-in for the /v1/detection endpoint.

Repository: localaiLicense: apache-2.0

rfdetr-cpp-small
RF-DETR Small object detection model (DINOv2-small backbone, 512px input, 3 decoder layers), served via the native rfdetr.cpp backend (ggml + purego, no Python). A step up from Nano in accuracy while staying lightweight on CPU. F16 quantization is the recommended default: identical accuracy to F32 at roughly half the size. Drop-in for the /v1/detection endpoint.

Repository: localaiLicense: apache-2.0

rfdetr-cpp-medium
RF-DETR Medium object detection model (DINOv2-small backbone, 576px input, 4 decoder layers), served via the native rfdetr.cpp backend. Balanced detection quality vs. CPU latency — recommended when Base is not accurate enough but Large is too slow. F16 quantization is the recommended default: identical accuracy to F32, half the size. Drop-in for the /v1/detection endpoint.

Repository: localaiLicense: apache-2.0

rfdetr-cpp-large
RF-DETR Large object detection model (DINOv2-small backbone, 704px input, 4 decoder layers), served via the native rfdetr.cpp backend. Highest-accuracy detection variant — best for offline workflows and high-resolution inputs where CPU latency is secondary to recall. F16 quantization is the recommended default: identical accuracy to F32, half the size. Drop-in for the /v1/detection endpoint.

Repository: localaiLicense: apache-2.0

rfdetr-cpp-seg-nano
RF-DETR Seg-Nano instance segmentation model (DINOv2-small backbone, 312px input, 4 decoder layers, 100 queries), served via the native rfdetr.cpp backend. Smallest segmentation variant — fastest CPU latency, ideal for edge deployment. Returns both bounding boxes and per-instance masks via the /v1/detection endpoint. F16 quantization is the recommended default: identical accuracy to F32, half the size.

Repository: localaiLicense: apache-2.0

rfdetr-cpp-seg-small
RF-DETR Seg-Small instance segmentation model (DINOv2-small backbone, 384px input, 4 decoder layers, 100 queries), served via the native rfdetr.cpp backend. Step up from Seg-Nano in mask quality while staying CPU-friendly. Returns both bounding boxes and per-instance masks via the /v1/detection endpoint. F16 quantization is the recommended default: identical accuracy to F32, half the size.

Repository: localaiLicense: apache-2.0

rfdetr-cpp-seg-medium
RF-DETR Seg-Medium instance segmentation model (DINOv2-small backbone, 432px input, 5 decoder layers, 200 queries), served via the native rfdetr.cpp backend. Balanced segmentation quality vs. CPU latency — recommended for everyday segmentation workloads. Returns both bounding boxes and per-instance masks via the /v1/detection endpoint. F16 quantization is the recommended default.

Repository: localaiLicense: apache-2.0

rfdetr-cpp-seg-large
RF-DETR Seg-Large instance segmentation model (DINOv2-small backbone, 504px input, 5 decoder layers, 200 queries), served via the native rfdetr.cpp backend. Higher-resolution input than Seg-Medium for sharper mask boundaries. Returns both bounding boxes and per-instance masks via the /v1/detection endpoint. F16 quantization is the recommended default: identical accuracy to F32, half the size.

Repository: localaiLicense: apache-2.0

rfdetr-cpp-seg-xlarge
RF-DETR Seg-XLarge instance segmentation model (DINOv2-small backbone, 624px input, 6 decoder layers, 300 queries), served via the native rfdetr.cpp backend. High-capacity segmentation variant with more queries and deeper decoder — best for dense scenes with many instances. Returns both bounding boxes and per-instance masks via the /v1/detection endpoint. F16 quantization is the recommended default.

Repository: localaiLicense: apache-2.0

rfdetr-cpp-seg-2xlarge
RF-DETR Seg-2XLarge instance segmentation model (DINOv2-small backbone, 768px input, 6 decoder layers, 300 queries), served via the native rfdetr.cpp backend. Highest-accuracy segmentation variant — best for offline workflows and high-resolution inputs where CPU latency is secondary to mask quality. Returns both bounding boxes and per-instance masks via the /v1/detection endpoint. F16 quantization is the recommended default: identical accuracy to F32, half the size.

Repository: localaiLicense: apache-2.0

qwen3-42b-a3b-stranger-thoughts-deep20x-abliterated-uncensored-i1
WARNING: NSFW. Vivid prose. INTENSE. Visceral Details. Violence. HORROR. GORE. Swearing. UNCENSORED... humor, romance, fun. Qwen3-42B-A3B-Stranger-Thoughts-Deep20x-Abliterated-Uncensored This repo contains the full precision source code, in "safe tensors" format to generate GGUFs, GPTQ, EXL2, AWQ, HQQ and other formats. The source code can also be used directly. ABOUT: Qwen's excellent "Qwen3-30B-A3B", abliterated by "huihui-ai" then combined Brainstorm 20x (tech notes at bottom of the page) in a MOE (128 experts) at 42B parameters (up from 30B). This pushes Qwen's abliterated/uncensored model to the absolute limit for creative use cases. Prose (all), reasoning, thinking ... all will be very different from reg "Qwen 3s". This model will generate horror, fiction, erotica, - you name it - in vivid, stark detail. It will NOT hold back. Likewise, regen(s) of the same prompt - even at the same settings - will create very different version(s) too. See FOUR examples below. Model retains full reasoning, and output generation of a Qwen3 MOE ; but has not been tested for "non-creative" use cases. Model is set with Qwen's default config: 40 k context 8 of 128 experts activated. Chatml OR Jinja Template (embedded) IMPORTANT: See usage guide / repo below to get the most out of this model, as settings are very specific. USAGE GUIDE: Please refer to this model card for Specific usage, suggested settings, changing ACTIVE EXPERTS, templates, settings and the like: How to maximize this model in "uncensored" form, with specific notes on "abliterated" models. Rep pen / temp settings specific to getting the model to perform strongly. https://huggingface.co/DavidAU/Qwen3-18B-A3B-Stranger-Thoughts-Abliterated-Uncensored-GGUF GGUF / QUANTS / SPECIAL SHOUTOUT: Special thanks to team Mradermacher for making the quants! https://huggingface.co/mradermacher/Qwen3-42B-A3B-Stranger-Thoughts-Deep20x-Abliterated-Uncensored-GGUF KNOWN ISSUES: Model may "mis-capitalize" word(s) - lowercase, where uppercase should be - from time to time. Model may add extra space from time to time before a word. Incorrect template and/or settings will result in a drop in performance / poor performance.

Repository: localaiLicense: apache-2.0

qwen3-22b-a3b-the-harley-quinn
WARNING: MADNESS - UN HINGED and... NSFW. Vivid prose. INTENSE. Visceral Details. Violence. HORROR. GORE. Swearing. UNCENSORED... humor, romance, fun. Qwen3-22B-A3B-The-Harley-Quinn This repo contains the full precision source code, in "safe tensors" format to generate GGUFs, GPTQ, EXL2, AWQ, HQQ and other formats. The source code can also be used directly. ABOUT: A stranger, yet radically different version of Kalmaze's "Qwen/Qwen3-16B-A3B" with the experts pruned to 64 (from 128, the Qwen 3 30B-A3B version) and then I added 19 layers expanding (Brainstorm 20x by DavidAU info at bottom of this page) the model to 22B total parameters. The goal: slightly alter the model, to address some odd creative thinking and output choices. Then... Harley Quinn showed up, and then it was a party! A wild, out of control (sometimes) but never boring party. Please note that the modifications affect the entire model operation; roughly I adjusted the model to think a little "deeper" and "ponder" a bit - but this is a very rough description. That being said, reasoning and output generation will be altered regardless of your use case(s). These modifications pushes Qwen's model to the absolute limit for creative use cases. Detail, vividiness, and creativity all get a boost. Prose (all) will also be very different from "default" Qwen3. Likewise, regen(s) of the same prompt - even at the same settings - will create very different version(s) too. The Brainstrom 20x has also lightly de-censored the model under some conditions. However, this model can be prone to bouts of madness. It will not always behave, and it will sometimes go -wildly- off script. See 4 examples below. Model retains full reasoning, and output generation of a Qwen3 MOE ; but has not been tested for "non-creative" use cases. Model is set with Qwen's default config: 40 k context 8 of 64 experts activated. Chatml OR Jinja Template (embedded) Four example generations below. IMPORTANT: See usage guide / repo below to get the most out of this model, as settings are very specific. If not set correctly, this model will not work the way it should. Critical settings: Chatml or Jinja Template (embedded, but updated version at repo below) Rep pen of 1.01 or 1.02 ; higher (1.04, 1.05) will result in "Harley Mode". Temp range of .6 to 1.2. ; higher you may need to prompt the model to "output" after thinking. Experts set at 8-10 ; higher will result in "odder" output BUT it might be better. That being said, "Harley Quinn" may make her presence known at any moment. USAGE GUIDE: Please refer to this model card for Specific usage, suggested settings, changing ACTIVE EXPERTS, templates, settings and the like: How to maximize this model in "uncensored" form, with specific notes on "abliterated" models. Rep pen / temp settings specific to getting the model to perform strongly. https://huggingface.co/DavidAU/Qwen3-18B-A3B-Stranger-Thoughts-Abliterated-Uncensored-GGUF GGUF / QUANTS / SPECIAL SHOUTOUT: Special thanks to team Mradermacher for making the quants! https://huggingface.co/mradermacher/Qwen3-22B-A3B-The-Harley-Quinn-GGUF KNOWN ISSUES: Model may "mis-capitalize" word(s) - lowercase, where uppercase should be - from time to time. Model may add extra space from time to time before a word. Incorrect template and/or settings will result in a drop in performance / poor performance. Can rant at the end / repeat. Most of the time it will stop on its own. Looking for the Abliterated / Uncensored version? https://huggingface.co/DavidAU/Qwen3-23B-A3B-The-Harley-Quinn-PUDDIN-Abliterated-Uncensored In some cases this "abliterated/uncensored" version may work better than this version. EXAMPLES Standard system prompt, rep pen 1.01-1.02, topk 100, topp .95, minp .05, rep pen range 64. Tested in LMStudio, quant Q4KS, GPU (CPU output will differ slightly). As this is the mid range quant, expected better results from higher quants and/or with more experts activated to be better. NOTE: Some formatting lost on copy/paste. WARNING: NSFW. Vivid prose. INTENSE. Visceral Details. Violence. HORROR. GORE. Swearing. UNCENSORED... humor, romance, fun.

Repository: localaiLicense: apache-2.0

planetoid_27b_v.2
This is a merge of pre-trained gemma3 language models Goal of this merge was to create good uncensored gemma 3 model good for assistant and roleplay, with uncensored vision. First, vision: i dont know is it normal, but it slightly hallucinate (maybe q3 is too low?), but lack any refusals and otherwise work fine. I used default gemma 3 27b mmproj. Second, text: it is slow on my hardware, slower than 24b mistral, speed close to 32b QWQ. Model is smart even on q3, responses are adequate in length and are interesting to read. Model is quite attentive to context, tested up to 8k - no problems or degradation spotted. (beware of your typos, it will copy yours mistakes) Creative capabilities are good too, model will create good plot for you, if you let it. Model follows instructions fine, it is really good in "adventure" type of cards. Russian is supported, is not too great, maybe on higher quants is better. Refusals was not encountered. However, i find this model not unbiased enough. It is close to neutrality, but i want it more "dark". Positivity highly depends on prompts. With good enough cards model can do wonders. Tested on Q3_K_L, t 1.04.

Repository: localaiLicense: gemma

mistral-small-3.2-46b-the-brilliant-raconteur-ii-instruct-2506
WARNING: MADNESS - UN HINGED and... NSFW. Vivid prose. INTENSE. Visceral Details. Violence. HORROR. GORE. Swearing. UNCENSORED... humor, romance, fun. Mistral-Small-3.2-46B-The-Brilliant-Raconteur-II-Instruct-2506 This repo contains the full precision source code, in "safe tensors" format to generate GGUFs, GPTQ, EXL2, AWQ, HQQ and other formats. The source code can also be used directly. ABOUT: A stronger, more creative Mistral (Mistral-Small-3.2-24B-Instruct-2506) extended to 79 layers, 46B parameters with Brainstorm 40x by DavidAU (details at very bottom of the page). This is version II, which has a jump in detail, and raw emotion relative to version 1. This model pushes Mistral's Instruct 2506 to the limit: Regens will be very different, even with same prompt / settings. Output generation will vary vastly on each generation. Reasoning will be changed, and often shorter. Prose, creativity, word choice, and general "flow" are improved. Several system prompts below help push this model even further. Model is partly de-censored / abliterated. Most Mistrals are more uncensored that most other models too. This model can also be used for coding too; even at low quants. Model can be used for all use cases too. As this is an instruct model, this model thrives on instructions - both in the system prompt and/or the prompt itself. One example below with 3 generations using Q4_K_S. Second example below with 2 generations using Q4_K_S. Quick Details: Model is 128k context, Jinja template (embedded) OR Chatml Template. Reasoning can be turned on/off (see system prompts below) and is OFF by default. Temp range .1 to 1 suggested, with 1-2 for enhanced creative. Above temp 2, is strong but can be very different. Rep pen range: 1 (off) or very light 1.01, 1.02 to 1.05. (model is sensitive to rep pen - this affects reasoning / generation length.) For creative/brainstorming use: suggest 2-5 generations due to variations caused by Brainstorm. Observations: Sometimes using Chatml (or Alpaca / others ) template (VS Jinja) will result in stronger creative generation. Model can be operated with NO system prompt; however a system prompt will enhance generation. Longer prompts, that more detailed, with more instructions will result in much stronger generations. For prose directives: You may need to add directions, because the model may follow your instructions too closely. IE: "use short sentences" vs "use short sentences sparsely". Reasoning (on) can lead to better creative generation, however sometimes generation with reasoning off is better. Rep pen of up to 1.05 may be needed on quants Q2k/q3ks for some prompts to address "low bit" issues. Detailed settings, system prompts, how to and examples below. NOTES: Image generation should also be possible with this model, just like the base model. Brainstorm was not applied to the image generation systems of the model... yet. This is Version II and subject to change / revision. This model is a slightly different version of: https://huggingface.co/DavidAU/Mistral-Small-3.2-46B-The-Brilliant-Raconteur-Instruct-2506

Repository: localaiLicense: apache-2.0

cognitivecomputations_dolphin-mistral-24b-venice-edition
Dolphin Mistral 24B Venice Edition is a collaborative project we undertook with Venice.ai with the goal of creating the most uncensored version of Mistral 24B for use within the Venice ecosystem. Dolphin Mistral 24B Venice Edition is now live on https://venice.ai/ as “Venice Uncensored,” the new default model for all Venice users. Dolphin aims to be a general purpose model, similar to the models behind ChatGPT, Claude, Gemini. But these models present problems for businesses seeking to include AI in their products. They maintain control of the system prompt, deprecating and changing things as they wish, often causing software to break. They maintain control of the model versions, sometimes changing things silently, or deprecating older models that your business relies on. They maintain control of the alignment, and in particular the alignment is one-size-fits all, not tailored to the application. They can see all your queries and they can potentially use that data in ways you wouldn't want. Dolphin, in contrast, is steerable and gives control to the system owner. You set the system prompt. You decide the alignment. You have control of your data. Dolphin does not impose its ethics or guidelines on you. You are the one who decides the guidelines. Dolphin belongs to YOU, it is your tool, an extension of your will. Just as you are personally responsible for what you do with a knife, gun, fire, car, or the internet, you are the creator and originator of any content you generate with Dolphin.

Repository: localaiLicense: apache-2.0

mistralai_magistral-small-2509-multimodal
Magistral Small 1.2 Building upon Mistral Small 3.2 (2506), with added reasoning capabilities, undergoing SFT from Magistral Medium traces and RL on top, it's a small, efficient reasoning model with 24B parameters. Magistral Small can be deployed locally, fitting within a single RTX 4090 or a 32GB RAM MacBook once quantized. Learn more about Magistral in our blog post. The model was presented in the paper Magistral. Quantization from unsloth, using their recommended parameters as defaults and including mmproj for multimodality.

Repository: localaiLicense: apache-2.0

Page 1