LocalAI - Models

vllm-omni-wan2.2-i2v

Wan2.2-I2V-A14B via vLLM-Omni - Image-to-video generation model from Wan-AI. Generates high-quality videos from images using a 14B parameter diffusion model.

Links

https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers

Tags

longcat-video

LongCat-Video served by LocalAI's dedicated CUDA backend. Generates video from a text prompt or a start image. The SDPA attention path works without FlashAttention and is suitable for CUDA 13 ARM64 systems such as DGX Spark. This is a very large checkpoint (roughly 83 GB in Hugging Face storage) and requires Linux with an NVIDIA CUDA GPU plus substantial memory and disk.

Links

Tags

longcat-video-avatar-1.5

LongCat-Video-Avatar-1.5 served by LocalAI's dedicated CUDA backend. Turns speech plus a prompt into an avatar video, optionally conditioning on a portrait, and continues across multiple segments for longer audio. Avatar generation also loads tokenizer, text encoder, and VAE components from LongCat-Video. Plan for very large downloads and substantial NVIDIA GPU or unified memory; CPU and macOS execution are unsupported.

Links

Tags

ltx-2

**LTX-2** is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution. **Key Features:** - **Joint Audio-Video Generation**: Generates synchronized video and audio in a single model - **Image-to-Video**: Converts static images into dynamic videos with matching audio - **High Quality**: Produces realistic video with natural motion and synchronized audio - **Open Weights**: Available under the LTX-2 Community License Agreement **Model Details:** - **Model Type**: Diffusion-based audio-video foundation model - **Architecture**: DiT (Diffusion Transformer) based - **Developed by**: Lightricks - **Paper**: [LTX-2: Efficient Joint Audio-Visual Foundation Model](https://arxiv.org/abs/2601.03233) **Usage Tips:** - Width & height settings must be divisible by 32 - Frame count must be divisible by 8 + 1 (e.g., 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 105, 113, 121) - Recommended settings: width=768, height=512, num_frames=121, frame_rate=24.0 - For best results, use detailed prompts describing motion and scene dynamics **Limitations:** - This model is not intended or able to provide factual information - Prompt following is heavily influenced by the prompting-style - When generating audio without speech, the audio may be of lower quality **Citation:** ```bibtex @article{hacohen2025ltx2, title={LTX-2: Efficient Joint Audio-Visual Foundation Model}, author={HaCohen, Yoav and Brazowski, Benny and Chiprut, Nisan and others}, journal={arXiv preprint arXiv:2601.03233}, year={2025} } ```

Links

https://huggingface.co/Lightricks/LTX-2

Tags

wan-2.1-i2v-14b-480p-ggml

Wan 2.1 I2V 14B 480P — image-to-video diffusion, GGUF Q4 quantization. Animates a reference image into a 33-frame 480p clip. Requires more RAM than the 1.3B T2V variant; CPU offload enabled by default.

Links

https://huggingface.co/city96/Wan2.1-I2V-14B-480P-gguf

Tags

wan-2.1-flf2v-14b-720p-ggml

Wan 2.1 FLF2V 14B 720P — first-last-frame-to-video diffusion, GGUF Q4_K_M. Takes a start and end reference image and interpolates a 33-frame clip between them. Unlike the plain I2V variant this model feeds the end frame through clip_vision as well, so it conditions semantically (not just in pixel-space) on both endpoints. That makes it the right choice for seamless loops (start_image == end_image) and clean narrative cuts. Native 720p but accepts 480p resolutions; shares the same VAE, t5xxl text encoder, and clip_vision_h as I2V 14B.

Links

https://huggingface.co/city96/Wan2.1-FLF2V-14B-720P-gguf

Tags

wan-2.1-i2v-14b-720p-ggml

Wan 2.1 I2V 14B 720P — image-to-video diffusion, GGUF Q4_K_M. Native 720p sibling of the 480p I2V model: animates a single reference image into a 33-frame clip at up to 1280x720. Trained purely as image-to-video (no first-last-frame interpolation path), so motion is freer and better-suited to single-anchor animation than repurposing the FLF2V 720P variant for i2v. Shares the same VAE, umt5_xxl text encoder, and clip_vision_h as the I2V 14B 480P and FLF2V 14B 720P entries.

Links

https://huggingface.co/city96/Wan2.1-I2V-14B-720P-gguf

Tags

ltx-2.3

**LTX-2.3** is an improved DiT-based audio-video foundation model from Lightricks, building upon the LTX-2 architecture with enhanced capabilities for generating synchronized video and audio within a single model. **Key Features:** - **Joint Audio-Video Generation**: Generates synchronized video and audio in a single model - **Image-to-Video**: Converts static images into dynamic videos with matching audio - **Enhanced Quality**: Improved video quality and motion generation over LTX-2 - **Open Weights**: Available under the LTX-2 Community License Agreement **Model Details:** - **Model Type**: Diffusion-based audio-video foundation model - **Architecture**: DiT (Diffusion Transformer) based - **Developed by**: Lightricks - **Parent Model**: LTX-2 **Usage Tips:** - Width & height settings must be divisible by 32 - Frame count must be divisible by 8 + 1 (e.g., 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 105, 113, 121) - Recommended settings: width=768, height=512, num_frames=121, frame_rate=24.0 - For best results, use detailed prompts describing motion and scene dynamics **Limitations:** - This model is not intended or able to provide factual information - Prompt following is heavily influenced by the prompting-style - When generating audio without speech, the audio may be of lower quality

Links

https://huggingface.co/Lightricks/LTX-2.3

Tags

ltx-2.3-22b-dev-ggml

LTX-2.3 22B dev - DiT-based audio-video foundation model from Lightricks, GGUF-quantized for the stable-diffusion.cpp backend. Generates synchronized video and audio from a text prompt (T2V), a reference image (I2V), or first/last frame pairs (FLF2V). Uses gemma-3-12b-it as the text encoder and ships dedicated video and audio VAEs plus an embeddings_connectors safetensors that bridges the LLM hidden states to the diffusion model. This entry uses the dynamic (UD) Q4_K_M quantization of the 22B model (~16 GB) paired with the UD-Q4_K_XL QAT Gemma encoder (~7.4 GB). Recommended generation: width=1280, height=720, video_frames=33, fps=24, sampler=euler, cfg_scale=6.0.

Links

Tags

ltx-2.3-22b-dev-ggml-q4_k_m

LTX-2.3 22B dev - non-dynamic Q4_K_M quantization (~14.3 GB). Same pipeline as ltx-2.3-22b-dev-ggml but with the plain Q4_K_M weights instead of the dynamic UD-Q4_K_M variant. Slightly smaller and slightly lower quality.

Links

Tags

ltx-2.3-22b-dev-ggml-q8_0

LTX-2.3 22B dev - Q8_0 quantization (~22.8 GB). Highest-quality quantized dev variant on the cpp backend; needs roughly twice the VRAM/RAM of the Q4 entries but produces noticeably cleaner audio and motion. Paired with the QAT Gemma-3 12B encoder.

Links

Tags

ltx-2.3-22b-distilled-ggml

LTX-2.3 22B distilled - faster student of the dev model, GGUF-quantized for the stable-diffusion.cpp backend. Trades a small amount of quality for substantially fewer sampling steps, making it the right pick for iterative previews and CPU-offloaded inference. Same input modalities as the dev entry (T2V / I2V / FLF2V) and the same gemma-3-12b-it text encoder. This entry uses the dynamic (UD) Q4_K_M quantization of the 22B distilled model (~16.3 GB). Recommended generation: width=1280, height=720, video_frames=33, fps=24, sampler=euler, cfg_scale=6.0.

Links

Tags

ltx-2.3-22b-distilled-ggml-q4_k_m

LTX-2.3 22B distilled - non-dynamic Q4_K_M quantization (~14.3 GB). Same pipeline as ltx-2.3-22b-distilled-ggml but with the plain Q4_K_M weights instead of the dynamic UD-Q4_K_M variant.

Links

Tags

ltx-2.3-22b-distilled-ggml-q8_0

LTX-2.3 22B distilled - Q8_0 quantization (~22.8 GB). Highest-quality distilled variant on the cpp backend; useful when you want the distilled sampling cost but the cleanest possible output.

Links

Tags

Model Gallery

Filter by type:

Filter by tags:

vllm-omni-wan2.2-i2v

longcat-video

longcat-video-avatar-1.5

ltx-2

wan-2.1-i2v-14b-480p-ggml

wan-2.1-flf2v-14b-720p-ggml

wan-2.1-i2v-14b-720p-ggml

ltx-2.3

ltx-2.3-22b-dev-ggml

ltx-2.3-22b-dev-ggml-q4_k_m

ltx-2.3-22b-dev-ggml-q8_0

ltx-2.3-22b-distilled-ggml

ltx-2.3-22b-distilled-ggml-q4_k_m

ltx-2.3-22b-distilled-ggml-q8_0