Walkyrie-1.3B-v1.0 Spins Video Smarts Into Speedy Local Image Creation

A gigantic smooth matte white sculptural bust of a featureless mannequin covered with an intricate interconnected nodes.

Walkyrie-1.3B-v1.0 is a new text-to-image model that turns written prompts into 1024×1024 pixel images. It was rebuilt from an existing video-generation model after its language-understanding component was trimmed down to run faster on less powerful hardware. The developer retrained the entire pipeline specifically for still-image output instead of video clips.

The project comes from independent creator kpsss34, who adapted the Wan2.1-T2V-1.3B architecture. By pruning the UMT5 text encoder to around 1 billion parameters and fine-tuning for image generation, they created a tool that prioritizes quick, local use. The model is shared as an early preview to gather feedback and build community interest.

Early preview for community testing

Key highlights
  • Text-to-image output at 1024×1024 resolution.
  • Pruned text encoder for faster processing.
  • Runs with as little as 6–8 GB VRAM.
  • CPU offload support for memory efficiency.
  • Built from Wan2.1-T2V-1.3B video architecture.
  • Free for both research and commercial use.
  • An early preview, trained to about 20% budget.

People with consumer-grade GPUs, small studios that rely on local AI tools, and anyone who wants to keep image generation private can benefit from this model. Because it can fit into mid-range hardware using CPU offload, it lowers the barrier for running high-quality diffusion locally. The current release is tuned toward an anime aesthetic, with a turbo variant and a larger 13B version planned for the future.

Developer notes and known limits

The model is very much a work in progress—only about one-fifth of the intended training has been completed, so quality and stability are expected to improve. The developer points out that anatomy problems remain a challenge, a limitation that often shows up in smaller models. Future releases, including a turbo edition and a bigger 13-billion-parameter version, depend on additional training resources and community support.

This model has only been trained to approximately 20% of the planned training budget.