← Back to blog

Grok Imagine 1.5 nails photo-to-reel with audio

Grok Imagine 1.5 just took #1 on the Image-to-Video Arena — and it's the fastest path from one still photo to a 720p reel with sound.

One photo. A short note on how the camera should move. Seventeen seconds later, you have a 720p reel with sound — and right now, no other image-to-video model does that better.

  • xAI's Grok Imagine Video 1.5 Preview just took #1 on the Image-to-Video Arena leaderboard — +52 Elo over its predecessor, ahead of Seedance 2.0, HappyHorse 1.0, and Google Veo.
  • A 10-second 720p clip renders in about 17 seconds, with synchronized audio baked in. Sora, Runway, and Kling 2.6 still don't generate audio natively.
  • It's the shortest path from one still — your selfie, a product shot, an identity-locked AI avatar render — to a publishable short-form reel.

For image-to-video, the leaderboard quietly changed this week. A new preview model from xAI now sits at the top, and the gap with the previous #1 — Seedance 2.0 — comes down to two things creators feel immediately: speed and sound.

The selfie you already shot, the AI avatar you spent a weekend locking, the product photo your e-commerce app spits out — Grok Imagine 1.5 takes any of those and animates them with motion and matched audio in seconds. You don't get a long film. You get a short reel that's good enough to post.

What's new with Grok Imagine 1.5

Mid-50s bald white male creator with grey beard, clean podcast booth with light wood acoustic panels, bright split warm-amber+cool-teal lighting — holding a phone showing the still-photo to short-reel transformation

Three things ship with this version that change the math for a short-form creator. First, image-to-video quality — the model now preserves face accuracy and character consistency through motion, the area where most image-to-video tools still wash the subject out. Second, native synchronized audio — dialogue, ambient sound, motion-matched effects, mood-fitting background music, all generated alongside the video instead of stitched on after. Third, generation speed — roughly 17 seconds for a 10-second 720p clip, versus several minutes for Sora.

There are honest constraints. Output caps at 720p / 24fps right now, with 1080p on the roadmap. Quality visibly degrades after 2–3 chained Extend-from-Frame jumps, so long sequences still need a real editor. And the consumer rollout across X Premium tiers is staged — the API at api.x.ai is the surface you have today.

Pricing is per-second and easy to model: $0.08/sec for 480p, $0.14/sec for 720p, plus $0.01 per input image. A 10-second 720p image-to-video clip costs $1.41. The model is available in us-east-1 and eu-west-1, capped at 60 requests per minute.

Receipts: The Decoder's release brief, JXP's benchmark write-up, and EvoLink's API spec review.

How to turn a photo into a reel

The following images were generated using Nano Banana 2:

  1. Pick a still that's already locked. Don't ask Grok Imagine to fix face drift on a photo that's already shaky — it won't. Start from an identity-consistent reference: an AI avatar render with the face you trained, a clean product shot, or a selfie with even key light. Centered subject, uncluttered background. Complex backgrounds with multiple moving elements drop consistency hard.

  2. Write the prompt in five layers. Weak results describe only the scene. Strong results cover Scene → Camera → Style and lighting → Motion → Audio. Three layers minimum; all five for the best output.

Macro 1:1 product shot, no human, white seamless backdrop, bright clean studio lighting, cool-grey palette with pale-blue screen glow — matte-black smartwatch on wet polished glass with a ring of water
Image-to-video prompt (Grok Imagine 1.5):
A matte-black smartwatch on a wet glass surface, thin ring of water circling the base, screen waking up with a clean pulse. Macro close-up, slow upward tilt, studio lighting, subtle digital tone.
  1. Default to 480p for drafts. Generate the concept at $0.08/sec first. If the camera move lands and the motion feels right, regenerate at 720p. You'll save real money once you start iterating in volume.

South Asian male creator late 20s in vintage band tee + bomber jacket, outside a glowing 24-hour convenience store at night, bright cool-white key on face with magenta+cyan+pink neon reflections on wet asphalt — flicking open a silver lighter, looking into camera
Image-to-video prompt (Grok Imagine 1.5):
A streetwear creator steps out of a glowing convenience store at night, looks into camera, flicks open a silver lighter without lighting it. Slow handheld push-in, neon reflections on wet pavement, lo-fi ambient audio.
  1. Use Extend from Frame, but cap it at two jumps. The feature picks up exactly where the previous clip ended — same lighting, same character pose, same motion direction. Past two extensions, the quality drift is visible. For anything longer than ~30 seconds, export the clips and cut them in a real editor.

  2. Sound check before you post. Native audio is the headline feature, but it can still drop the wrong ambient choice. Listen on phone speakers, not headphones. If the music doesn't fit, regenerate the same clip — the audio is reseeded each time.

Grok Imagine 1.5 is the right tool for the short-form variants you actually post — fifteen-second reels, talking-head openers, product loops, vibey scene cuts. It's the wrong tool for a hero brand video where you'd notice 720p next to 1080p footage. Pair it with a higher-resolution model — Kling 2.6 Pro for cinematic, Wan 2.6 for character-consistency over longer sequences — for those, and let Grok do the speed and cost-sensitive work everywhere else.

The bottom line for AI avatar creators

For anyone running an AI avatar feed, this drops the cost of one more posted reel below two dollars and the time to make it below five minutes. The right move now: use Grok Imagine 1.5 as your daily variant engine on a face you already own. If you haven't built that face yet, do that first — start with your AI avatar, then point Grok at every still it produces.

FAQs

Does Grok Imagine 1.5 keep my AI avatar's face consistent?

Better than 1.0, and noticeably better than most rivals on image-to-video specifically. Reference-to-video mode is what you want for character work — feed it an identity-locked still and the face stays through the clip. Multi-character scenes with precise choreography are still weak.

How much does a 10-second reel actually cost?

$1.41 at 720p ($1.40 in video + $0.01 input image) or $0.81 at 480p. Cheaper than every alternative at this quality tier.

Can I use it for commercial content?

Yes, via the xAI API today. Read the xAI terms before publishing — same as you would for Sora or Runway. The full consumer rollout across X Premium tiers is still in progress.

What image formats does it accept?

JPG, JPEG, PNG, WEBP, GIF, and AVIF.

Should I drop Seedance 2.0 or Kling 2.6 Pro for this?

No — keep them as fallbacks. Grok Imagine 1.5 wins on speed, cost, and image-to-video quality at 720p. Kling 2.6 Pro and Sora still hold the 1080p output edge, and Sora has stronger results on complex multi-subject text-to-video.

Newsletter