LocalAI - Models

ace-step-turbo

ACE-Step 1.5 Turbo is a music generation model that can create music from text descriptions, lyrics, or audio samples. Supports both simple text-to-music and advanced music generation with metadata like BPM, key scale, and time signature.

Links

https://huggingface.co/ACE-Step/Ace-Step1.5

Tags

ltx-2

**LTX-2** is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution. **Key Features:** - **Joint Audio-Video Generation**: Generates synchronized video and audio in a single model - **Image-to-Video**: Converts static images into dynamic videos with matching audio - **High Quality**: Produces realistic video with natural motion and synchronized audio - **Open Weights**: Available under the LTX-2 Community License Agreement **Model Details:** - **Model Type**: Diffusion-based audio-video foundation model - **Architecture**: DiT (Diffusion Transformer) based - **Developed by**: Lightricks - **Paper**: [LTX-2: Efficient Joint Audio-Visual Foundation Model](https://arxiv.org/abs/2601.03233) **Usage Tips:** - Width & height settings must be divisible by 32 - Frame count must be divisible by 8 + 1 (e.g., 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 105, 113, 121) - Recommended settings: width=768, height=512, num_frames=121, frame_rate=24.0 - For best results, use detailed prompts describing motion and scene dynamics **Limitations:** - This model is not intended or able to provide factual information - Prompt following is heavily influenced by the prompting-style - When generating audio without speech, the audio may be of lower quality **Citation:** ```bibtex @article{hacohen2025ltx2, title={LTX-2: Efficient Joint Audio-Visual Foundation Model}, author={HaCohen, Yoav and Brazowski, Benny and Chiprut, Nisan and others}, journal={arXiv preprint arXiv:2601.03233}, year={2025} } ```

Links

https://huggingface.co/Lightricks/LTX-2

Tags

gemma-3n-e2b-it

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages. Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page.

Links

Tags

gemma-3n-e4b-it

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages. Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page.

Links

Tags

ultravox-v0_5-llama-3_1-8b

Ultravox is a multimodal Speech LLM built around a pretrained Llama3.1-8B-Instruct and whisper-large-v3-turbo backbone. See https://ultravox.ai for the GitHub repo and more information. Ultravox is a multimodal model that can consume both speech and text as input (e.g., a text system prompt and voice user message). The input to the model is given as a text prompt with a special <|audio|> pseudo-token, and the model processor will replace this magic token with embeddings derived from the input audio. Using the merged embeddings as input, the model will then generate output text as usual. In a future revision of Ultravox, we plan to expand the token vocabulary to support generation of semantic and acoustic audio tokens, which can then be fed to a vocoder to produce voice output. No preference tuning has been applied to this revision of the model.

Links

Tags

Model Gallery

Filter by type:

Filter by tags:

ace-step-turbo

ltx-2

gemma-3n-e2b-it

gemma-3n-e4b-it

ultravox-v0_5-llama-3_1-8b