How to Use Ovis Image - Text-to-Image Quickstart and Tips

Dec 1, 2024

Ovis-Image is a 7B text-to-image model built on the Ovis 2.5 multimodal backbone. It is tuned for clear, accurate text inside images while staying light enough for a single high-end GPU.

Fast start with Diffusers (2 steps)

  1. Install the Diffusers build that includes Ovis-Image support:
pip install git+https://github.com/DoctorKey/diffusers.git@ovis-image
  1. Run the minimal Python snippet:
import torch
from diffusers import OvisImagePipeline

pipe = OvisImagePipeline.from_pretrained(
    "AIDC-AI/Ovis-Image-7B",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

prompt = (
    'A creative 3D artistic render where the text "OVIS-IMAGE" '
    "is written in a bold, expressive handwritten brush style using thick, wet oil paint. "
    "The paint is a mix of vibrant rainbow colors swirling together like toothpaste. "
    "The background is a clean artist's canvas with soft shadows and glossy texture, 4k detail."
)

image = pipe(prompt, negative_prompt="", num_inference_steps=50, true_cfg_scale=5.0).images[0]
image.save("ovis_image.png")

Official PyTorch script (repo workflow)

This follows the commands published by the authors.

git clone https://github.com/AIDC-AI/Ovis-Image.git
conda create -n ovis-image python=3.10 -y
conda activate ovis-image
cd Ovis-Image
pip install -r requirements.txt
pip install -e .

python ovis_image/test.py \
  --model_path AIDC-AI/Ovis-Image-7B/ovis_image.safetensors \
  --vae_path AIDC-AI/Ovis-Image-7B/ae.safetensors \
  --ovis_path AIDC-AI/Ovis-Image-7B/Ovis2.5-2B \
  --image_size 1024 \
  --denoising_steps 50 \
  --cfg_scale 5.0 \
  --prompt "A creative 3D artistic render where the text \"OVIS-IMAGE\" is written in a bold, expressive handwritten brush style using thick, wet oil paint. The background is a clean artist's canvas with soft shadows, 4k detail."

Prompt and sampling tips (for crisp text)

  • Keep 45-50 denoising steps for clean edges; you can drop to 35+ for speed but text sharpness may soften.
  • true_cfg_scale around 5.0 balances adherence and natural textures.
  • For text-heavy layouts (posters, banners, UI mockups), describe font weight, material, and background simplicity explicitly.
  • Stick to 1024-resolution outputs when you want the best legibility; wider aspect ratios also work but may need higher step counts.

Try it without setup

  • Hugging Face Space: AIDC-AI/Ovis-Image-7B lets you generate in the browser.
  • FAL playground: fal-ai/ovis-image provides hosted inference with a simple form input.

Hardware and model notes

  • The model uses BF16 and was profiled on single high-end GPUs; plan for a modern GPU with plenty of VRAM for fastest results.
  • Ovis-Image prioritizes bilingual text rendering quality while keeping a compact 7B parameter budget, so it is a good fit for posters, UI comps, and any scene where words must remain legible.
How to Use Ovis Image - Text-to-Image Quickstart and Tips | Tutorial