The State of AI in Video and Image Generation

Runbo Li
Runbo Li
·
Co-founder & CEO of Magic Hour
· 11 min read
The State of AI in Video and Image Generation

AI has revolutionized creative work in the past two years. AI has generated over 15 billion images worldwide, _ videos, and AI tools have been adopted by most Fortune 100 companies.

Three years ago, this might have been considered sci-fi. Now it's now changing how we create content. In this article, we'll look at where we are and what's coming next.

How We Got Here: GANs to Diffusion

The evolution of AI image and video generation has followed an S-curve, starting in 2014 and taking off in 2021.

2014: GANs Emerge

  • Generative Adversarial Networks pitted two models against each other (generator vs. discriminator)
  • Early results were small, blurry images with limited detail
  • First major breakthrough in machine-created visuals that showed potential
  • Researchers began exploring what neural networks could create from scratch

2018: Photorealism Takes Off

  • NVIDIA's StyleGAN generated lifelike human faces with unprecedented detail
  • "This Person Does Not Exist" website went viral, showing faces indistinguishable from photos
  • AI art gained attention when "Edmond de Belamy" sold for $432,500 at Christie's
  • This period proved AI could create realistic art, not just experimental examples
  • The art world started debating machine creativity versus human artistry

2021: New Model Architectures

  • OpenAI released DALL·E, combining vision and language models using CLIP
  • Diffusion models began outperforming GANs in quality and variety
  • Early hybrid systems like VQGAN+CLIP showed promising results
  • Researchers refined the approach of gradually denoising images to generate content
  • These technical breakthroughs laid groundwork for the coming explosion in AI creativity

2022: Text-to-Image Goes Mainstream

  • DALL·E 2 launched, generating 2+ million images daily for select users
  • Stability AI released Stable Diffusion as open-source, allowing anyone to run it locally
  • Community-driven innovation exploded with custom models and fine-tuning
  • Midjourney attracted artists with its distinctive style through Discord
  • The tools became accessible enough for non-technical users to create with prompts
  • Social media filled with AI-generated art as millions experimented with the technology

2023: Video Generation Begins & Mass Adoption

  • Image generators reached millions of users across platforms
  • Midjourney grew to ~15 million users creating nearly 1 billion images
  • Quality improved dramatically with Stable Diffusion 2, SDXL, and other models
  • First wave of text-to-video tools appeared from multiple companies
  • Runway introduced Gen-1 and Gen-2 for video stylization and generation
  • Meta unveiled Make-A-Video, Google showed Imagen Video as research prototypes
  • Short AI-generated videos (5-10 seconds) became possible but still had limitations
  • The debate intensified when AI-generated art won competitions against human creators

AI Image Generation: Where We Stand

High-Quality Images, Instantly

  • Modern text-to-image models produce photorealistic images at 1024×1024+ resolution
  • Diffusion models have become more efficient - generation time cut from hundreds of steps to as few as 1-4
  • Human faces, complex scenes, and lighting effects look remarkably real
  • Adversarial diffusion distillation techniques have dramatically accelerated generation
  • Models handle coherent compositions with multiple subjects and accurate perspective
  • Special effects like reflections, shadows, and depth of field appear natural
  • Text rendering has improved, though complex text still presents challenges

Tools for Every Creator

  • Digital artists, designers, filmmakers, and marketers use AI in their daily workflows
  • Concept artists can generate dozens of ideas in minutes, then refine the best ones
  • Game developers iterate characters and environments rapidly with prompt variations
  • Illustrators create custom visuals on demand for blogs, books, and advertisements
  • Adobe integrated Firefly into Creative Cloud - users created 1+ billion images in months
  • Photoshop features like background replacement and image extension use generative AI
  • Stable Diffusion ecosystem accounts for ~80% of all AI-generated images
  • The open-source movement has empowered global creativity with accessible tools
  • Users fine-tune models on specific styles, share them, and collectively improve technology
  • Black Forest Labs' FLUX models rival Midjourney in quality while remaining open-source

From Novelty to Necessity

  • AI image generators have moved beyond the "wow" phase into practical tools
  • Usage data shows stronger retention - people keep using these tools after initial trials
  • State of AI Report 2024 showed improved spending and retention for generative AI apps
  • Top-performing AI products include image generation platforms like Midjourney and OpenAI
  • New careers have emerged (prompt engineering, AI art design, model fine-tuning)
  • Traditional artists adapt - some embrace AI as a tool, others focus on human uniqueness
  • IP debates continue - Getty Images v. Stability AI and other lawsuits test legal boundaries
  • Companies explore opt-in datasets, attribution systems, and watermarking solutions
  • Art platforms and stock photo sites have established policies on AI-generated works
  • Some jurisdictions now require disclosure when publishing AI-created media

AI Video Generation: The New Frontier

Longer, Better Video Content

  • Early models (2022-23) produced only seconds of often glitchy, surreal footage
  • By late 2023, Stability AI released Stable Video Diffusion as an open model
  • This proved diffusion approaches could extend to the time dimension
  • 2024 saw major breakthroughs from research labs:
    • OpenAI's Sora generates minute-long videos with consistent 3D geometry
    • Google's Veo demonstrated improved temporal coherence and natural motion
    • Meta's MovieGen combines a 30B-parameter video model with a 13B audio model
    • MovieGen produces 16-second videos (16 fps) with 45 seconds of accompanying sound
  • Research models showed mastery of physics, lighting, and camera movement
  • Progress accelerated as techniques from image models transferred to video

What's Possible Today

  • Public AI video generators create reliable 5-15 second clips from text descriptions
  • Quality improves monthly - fewer artifacts and more natural motion than last year
  • Results still have minor issues but vastly outperform early attempts
  • Common capabilities include:
    • Generating short clips from text prompts (e.g., "lion running through neon jungle")
    • Style transfer tools transform real footage into new visual styles
    • Creating AI avatars that deliver specified scripts in multiple languages
  • Business adoption is booming - Synthesia use grew 2.5× in one year
  • Half of Fortune 100 companies use AI video for training, marketing, and customer content
  • Creators incorporate AI elements into music videos and short films for fantasy sequences
  • Video editors generate B-roll and abstract visuals quickly for projects
  • Game designers preview character animations before committing to full development
  • Marketing teams create variants of videos for different language markets efficiently

Remaining Challenges

  • Consistency across frames - maintaining appearance throughout clips remains difficult
  • Objects sometimes change appearance slightly between frames
  • Advanced techniques enforce 3D geometry consistency, but minor flickers still occur
  • Compute costs - video generation requires significant processing power
  • Most services charge per second of generated video (e.g., 5 credits per second)
  • Cost considerations force users to be strategic with generation attempts
  • Chinese companies and open-source communities offer cheaper alternatives
  • Kuaishou released Kling with decent quality at lower cost
  • Researchers open-sourced CogVideoX, giving enthusiasts a free playground
  • Video generation actually uses less GPU memory than large language models
  • This lower memory requirement has enabled more competition in the space

The Competitive Landscape

  • Startups like Runway, Pika Labs, and Luma have raised hundreds of millions
  • Venture capital sees video generation as the next frontier after images
  • OpenAI, Google, and Meta keep their most advanced models internal or in limited beta
  • Similar pattern to early AI image generation - mix of open research, startups, and big lab projects
  • Innovation comes from multiple sources - no single company dominates yet
  • Industry discusses safeguards like watermarking for transparency and trust
  • The balance between creative power and responsible use drives development
  • Public sentiment influences which features reach consumers first

Where We're Headed

Multimodal Creation

  • Lines between image, video, and audio generation are blurring rapidly
  • Meta's MovieGen demonstrated joint generation of visuals and sound
  • Future tools will generate entire scenes with visuals, music, and dialogue from a single prompt
  • One-stop creative engines could turn scripts into animated films with soundtracks
  • Image-to-video pipelines will become seamless and more controllable
  • Current technique: use AI-generated image as keyframe to condition video model
  • Community creators fine-tune image models for specific styles using LoRA adapters
  • These outputs then drive video generation, combining personal style with AI capabilities
  • "Multimodal studios" will unite text, image, video, and audio AI in collaborative interfaces

Real-Time Generation

  • Research shows diffusion models can run in a single step with appropriate training
  • Soon, AI tools will feel instantaneous, enabling truly interactive creation
  • On-the-fly editing will enable feedback loops - see changes immediately as you adjust prompts
  • Tweaking parameters and getting immediate visual feedback transforms the creative process
  • Ongoing optimizations in model efficiency bring this future closer each month
  • Generating HD video could be as quick as generating images was in 2023
  • This speed breakthrough will change how creators interact with AI tools fundamentally

New Industry Applications

  • Entertainment: AI-generated feature films (or significant portions) will emerge
  • We'll see the first films where AI handles effects, backgrounds, or entire sequences
  • Gaming: AI-generated game levels, characters, cut-scenes on demand
  • Procedural content generation will reach new heights with personalized game worlds
  • Education: Interactive AI avatars and training simulations for specialized skills
  • Virtual teachers and realistic role-play scenarios generated dynamically
  • Marketing: Personalized video ads tailored to different audience segments
  • Endless variations of visual assets and videos customized for target demographics
  • Custom media: News videos or entertainment with you as the main character
  • Apps could generate stories with users as protagonists, changing media consumption

Community-Driven Innovation

  • Open-source models will continue democratizing access to video generation
  • More models like CogVideoX will appear, following Stable Diffusion's pattern
  • Plugins, fine-tunings, and model checkpoints will expand creative possibilities
  • Platforms like Civitai (hundreds of millions of model downloads) show community demand
  • Users trade custom models and enhancements in a vibrant ecosystem
  • Competition between hobbyists, startups, and big labs ensures progress
  • Alternative tools will push boundaries beyond "official" products
  • This ecosystem prevents monopolization of the technology by a few corporations

Key Challenges Ahead

  1. Authenticity and Misinformation
    • Highly realistic AI videos increase deepfake concerns in politics and media
    • Potential for fake speeches or impersonations grows with improved quality
    • Companies developing watermarking and cryptographic signature systems
    • These would allow verification of AI-generated content without affecting appearance
    • Some jurisdictions require disclosure of AI-generated content featuring real people
    • This will bring more concrete policies from governments and industry bodies
    • The race between detection and generation technologies continues
  2. Intellectual Property
    • Artists question training data usage without permission or compensation
    • 2024 saw vocal concerns and legal action from content creators
    • Companies like Adobe now train on licensed/public domain content to avoid conflicts
    • New frameworks may track AI influences and compensate artists whose styles influenced output
    • Possible future: tracing AI-generated work back to influential training sources
    • The goal: ensuring fair relationships between AI tools and creative professionals
    • The industry must resolve these issues for sustainable growth
  3. Creative Jobs
    • Some routine design and editing tasks will be automated by AI tools
    • Certain production roles may see reduced demand as AI handles technical work
    • New skills (AI guidance, curation, prompt engineering) will grow in demand
    • Human creators remain essential for storytelling and emotional resonance
    • The "final mile" of content still benefits from human refinement and direction
    • Best results come from human-AI collaboration rather than replacement
    • Industry needs to smooth this transition through training and tool design
    • AI should augment human creativity rather than supplant it

What This Means for You

The stats tell part of the story: billions of AI images, millions of videos, and rapid growth across industries. The real shift is cultural - we're learning to see AI as a creative partner, not just a tool.

For professionals:

  • Designers: Experiment with style transfer and concept generation to multiply your output
  • Filmmakers: Use AI for effects, backgrounds, and previsualization to reduce production costs
  • Marketers: Create targeted visual content at scale for different customer segments
  • Educators: Build interactive simulations and personalized lessons for varied learning styles
  • Game developers: Generate prototypes quickly and focus human talent on refinement
  • Content creators: Explore hybrid workflows where AI handles technical aspects

For enthusiasts:

  • Try open-source tools (Stable Diffusion variants, CogVideoX) on consumer hardware
  • Join communities sharing model tweaks and techniques to expand your creative options
  • Explore multimodal creation (image → video → audio) for complete projects
  • Stay informed about watermarking and verification developments for responsible sharing
  • Experiment with fine-tuning models on specific styles you enjoy
  • Use AI to visualize ideas that would be difficult to create manually

The AI image and video landscape changes fast - today's cutting edge will be tomorrow's basic feature. By embracing these tools while addressing ethical challenges, we can unlock creativity for everyone while building a sustainable ecosystem for human and AI collaboration.

Runbo Li
Runbo Li is the Co-founder & CEO of Magic Hour. He is a Y Combinator W24 alum and was previously a Data Scientist at Meta where he worked on 0-1 consumer social products in New Product Experimentation. He is the creator behind @magichourai and loves building creation tools and making art.