Model Gallery

9 models from 1 repositories

Filter by type:

Filter by tags:

vllm-omni-wan2.2-t2v

Wan2.2-T2V-A14B via vLLM-Omni - Text-to-video generation model from Wan-AI. Generates high-quality videos from text prompts using a 14B parameter diffusion model.

Repository: localaiLicense: apache-2.0

longcat-video

LongCat-Video served by LocalAI's dedicated CUDA backend. Generates video from a text prompt or a start image. The SDPA attention path works without FlashAttention and is suitable for CUDA 13 ARM64 systems such as DGX Spark. This is a very large checkpoint (roughly 83 GB in Hugging Face storage) and requires Linux with an NVIDIA CUDA GPU plus substantial memory and disk.

Repository: localaiLicense: mit

wan-2.1-t2v-1.3b-ggml

Wan 2.1 T2V 1.3B — text-to-video diffusion model, GGUF-quantized for the stable-diffusion.cpp backend. Generates short (33-frame) 832x480 clips from a text prompt. Cheapest Wan variant, suitable for CPU-offloaded inference with ~10 GB of usable RAM.

Repository: localaiLicense: apache-2.0

ltx-2.3-22b-dev-ggml

LTX-2.3 22B dev - DiT-based audio-video foundation model from Lightricks, GGUF-quantized for the stable-diffusion.cpp backend. Generates synchronized video and audio from a text prompt (T2V), a reference image (I2V), or first/last frame pairs (FLF2V). Uses gemma-3-12b-it as the text encoder and ships dedicated video and audio VAEs plus an embeddings_connectors safetensors that bridges the LLM hidden states to the diffusion model. This entry uses the dynamic (UD) Q4_K_M quantization of the 22B model (~16 GB) paired with the UD-Q4_K_XL QAT Gemma encoder (~7.4 GB). Recommended generation: width=1280, height=720, video_frames=33, fps=24, sampler=euler, cfg_scale=6.0.