The Livepeer AI Gateway exposes nine batch pipelines and one LLM pipeline through HTTP POST endpoints. Each pipeline accepts a JSON request body keyed by
model_id and pipeline-specific fields, and returns a JSON response with the result. Real-time video AI (live-video-to-video) runs through the trickle protocol and is covered separately in the real-time AI overview.
For warm models, VRAM requirements, and architecture support per pipeline, see model support. For SDK wrappers, see AI SDKs.
Shared conventions
Base URL: Any Livepeer Gateway endpoint. The community Gateway athttps://dream-gateway.livepeer.cloud accepts unauthenticated requests for development.
Authentication: Bearer token when the Gateway requires it. The community Gateway does not require a token.
Request format: POST /<pipeline-endpoint> with Content-Type: application/json.
model_id field: Every pipeline accepts a model_id field specifying the Hugging Face model ID (or Ollama model ID for LLM). Omitting model_id uses the pipeline’s default warm model.
Error responses: 400 for malformed requests, 422 for validation errors (invalid model_id, missing required fields), 500 for inference failures. Error bodies include a detail field with the failure reason.
Cold model latency: If no Orchestrator has the requested model warm in GPU memory, the first request triggers a model load (30 seconds to 5 minutes depending on model size). Subsequent requests to the same model on the same Orchestrator are immediate.
Pipeline reference
text-to-image
text-to-image
Generate images from text prompts using diffusion models (SDXL, SD 1.5, Flux).
Response: JSON object with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Hugging Face model ID. Default: SG161222/RealVisXL_V4.0_Lightning |
prompt | string | Yes | Text prompt for generation |
negative_prompt | string | No | Terms to avoid in generation |
width | integer | No | Output width in pixels (default: 1024) |
height | integer | No | Output height in pixels (default: 1024) |
guidance_scale | number | No | Classifier-free guidance scale (default: 7.5) |
num_inference_steps | integer | No | Denoising steps (default depends on model; Lightning models use 4-8) |
seed | integer | No | Random seed for reproducibility |
num_images_per_prompt | integer | No | Number of images to generate (default: 1) |
safety_check | boolean | No | Run NSFW safety filter (default: true) |
images array. Each image is a { url, seed } object.image-to-image
image-to-image
Transform images using style transfer, enhancement, or img2img diffusion.
Response: JSON with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Default: timbrooks/instruct-pix2pix |
image | file | Yes | Input image (multipart form upload) |
prompt | string | Yes | Transformation instruction |
strength | number | No | How much to transform (0.0 = no change, 1.0 = full regeneration) |
guidance_scale | number | No | Guidance scale (default: 7.5) |
num_inference_steps | integer | No | Denoising steps |
seed | integer | No | Random seed |
safety_check | boolean | No | NSFW filter (default: true) |
images array, same format as text-to-image.image-to-image uses
multipart/form-data, not application/json. The image is uploaded as a file field.image-to-video
image-to-video
Animate a still image into a short video clip using Stable Video Diffusion.
Response: JSON with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Default: stabilityai/stable-video-diffusion-img2vid-xt |
image | file | Yes | Input image (multipart form upload) |
fps | integer | No | Output frames per second (default: 6) |
motion_bucket_id | integer | No | Motion intensity (0-255; default: 127) |
seed | integer | No | Random seed |
safety_check | boolean | No | NSFW filter (default: true) |
frames array containing frame URLs, or a video URL.SVD outputs 14-25 frames at 576x1024 resolution. Text prompts are not used; the image is the sole conditioning input.
image-to-text
image-to-text
Generate captions or descriptions for images using BLIP or vision-language models.
Response: JSON with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Default: Salesforce/blip-image-captioning-large |
image | file | Yes | Input image (multipart form upload) |
prompt | string | No | Optional prompt to guide caption content |
text field containing the generated caption.audio-to-text
audio-to-text
Transcribe audio to text with per-chunk timestamps using Whisper.
Response: JSON with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Default: openai/whisper-large-v3 |
audio | file | Yes | Audio file (mp4, webm, mp3, flac, wav, m4a). Max 50 MB. |
text (full transcript) and chunks array (per-segment timestamps and text).text-to-speech
text-to-speech
Generate natural speech from text using Parler-TTS.
Response: JSON with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Default: parler-tts/parler-tts-large-v1 |
text | string | Yes | Text to synthesise. Max ~600 characters; chunk longer text. |
description | string | No | Voice characteristics (speaker identity, style, audio quality) |
audio object containing a URL to the generated audio file.Requires a pipeline-specific AI Runner container. Not all Orchestrators have this pipeline active.
upscale
upscale
Upscale low-resolution images using the SD x4-Upscaler (4x super-resolution).
Response: JSON with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Default: stabilityai/stable-diffusion-x4-upscaler |
image | file | Yes | Input image (multipart form upload) |
prompt | string | No | Optional quality guidance prompt |
seed | integer | No | Random seed |
safety_check | boolean | No | NSFW filter (default: true) |
images array, same format as text-to-image.segment-anything-2
segment-anything-2
Promptable visual segmentation for images using SAM 2 (Meta AI).
Response: JSON with
| Field | Type | Required | Description |
|---|---|---|---|
model_id | string | No | Default: facebook/sam2-hiera-large |
image | file | Yes | Input image |
point_coords | array | No | Point prompts as [[x,y], ...] |
point_labels | array | No | Labels for points (1 = foreground, 0 = background) |
box | array | No | Bounding box prompt [x1, y1, x2, y2] |
masks, scores, and logits arrays.llm
llm
OpenAI-compatible chat completions using Ollama-based runner.
Response: OpenAI-compatible chat completion object with
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Ollama-compatible model ID |
messages | array | Yes | OpenAI-format message array (role + content) |
max_tokens | integer | No | Maximum output tokens |
temperature | number | No | Sampling temperature (0.0-2.0) |
stream | boolean | No | Stream response tokens (SSE) |
choices[0].message.content.The LLM pipeline is in beta. The request format follows the OpenAI
/v1/chat/completions shape. Supported models include Meta-Llama-3.1-8B-Instruct (warm, 8 GB VRAM), Mistral-7B-Instruct-v0.3, Gemma-2-9b-it, and Qwen2.5-7B-Instruct.Operational notes
Multipart vs JSON. Pipelines that accept file uploads (image-to-image, image-to-video, image-to-text, audio-to-text, upscale, segment-anything-2) usemultipart/form-data. Pipelines that accept only text input (text-to-image, text-to-speech, LLM) use application/json.
Gateway selection. The community Gateway routes to whichever Orchestrator in the Active Set has the requested model warm. For production, operate a self-hosted Gateway with -maxPricePerUnit to control costs, or use a Gateway provider with an API key.
safety_check filter. Enabled by default on image-generating pipelines. Set to false to disable. The filter runs on the Orchestrator side; disabling it does not affect content moderation policies that the Gateway operator may enforce.
The AI quickstart walks through the first inference call end-to-end with error handling.