Fashn-AI struts with new FASHN VTON v1.5 Model

Wallpaper logo for FASHN VTON v1.5 engraved into surface

Fashn-AI has released FASHN VTON v1.5, a virtual try-on model built to generate photorealistic images without needing segmentation masks. This new release operates directly in pixel space using a 972M parameter architecture.

It aims to solve common problems like lost garment details and distorted body shapes often seen in latent diffusion methods. With an inference speed of roughly 5 seconds on H100 GPUs, the tool is designed for interactive, consumer-facing applications.

Model Size: 1.94GB & VRAM GPU: 8GB required

Core features of FASHN VTON

  • Direct pixel-space processing to prevent information loss.
  • Maskless inference allows garments to fit naturally without shape constraints.
  • Preservation of body identity keeps tattoos, cultural garments, and hair details.
  • Support for varied garment categories including tops, bottoms, and one-pieces.
  • Consumer hardware compatibility runs on roughly 8GB of VRAM.
  • Open-source availability under the Apache 2.0 license.

Architecture & training

The system uses an MMDiT architecture that processes person and garment images together through 8 double-stream blocks and 16 single-stream blocks. Because it works in pixel space, it avoids the distortion issues typically found in latent space models, specifically preserving colors, logos, and patterns.

The training process occurred in two phases, starting with 18 million masked try-on pairs and later refining with 4 million synthetic triplets. While the output resolution is currently capped at 576×864, the architecture optimizes memory usage by allowing up to 75% of tokens to be dropped during training.

Challenges addressed by the developers

The developers highlight that existing methods face fundamental challenges with warping and data scarcity. The project page notes,

'FASHN VTON v1.5 addresses both challenges with three key properties: 1) pixel-space generation that operates directly on RGB pixels without VAE encoding, preserving garment details, 2) maskless inference that generates try-on results without masking, preserving body identity and removing volume constraints, and 3) interactive performance with a 972M parameter architecture optimized for ~5 second inference on H100 GPUs.'

This approach allows the model to handle voluminous garments, like wedding gowns or puffer jackets, which previous masked methods struggled to fit within original boundaries.

Learn more about FASHN VTON v1.5?