Key Takeaways:
Last month, we explored the broad topic of generative AI in fashion, focusing on various applications of this technology in the industry.
Now, we will look at a specific application: using Virtual Try-On technology to automate photo shoots.
This article examines the pros and cons of using foundation models for Virtual Try-On. In the next article, we will discuss the advantages of specific Virtual Try-On models compared to foundation models.
But first, let’s remind ourselves what Virtual Try-On is.
What is Virtual Try-On?
Virtual Try-On uses AI to dress models in garments digitally. It creates on-model photos from product images without the need for a physical photo shoot. This simplifies and reduces the cost of fashion imagery, revolutionizing how we approach fashion photography.
In recent years, this technology has rapidly evolved. Understanding its evolution is key to anticipating the opportunities it presents.
The Evolution of Virtual Try-On Technologies
GANs marked the beginning of Virtual Try-On technologies. These models rely on two neural networks: the generator and the discriminator. The generator creates an image, and the discriminator predicts whether the image looks realistic. While GANs showed promise, they often struggled with photorealism. They had difficulty capturing intricate details like garment folds and adapting to different body shapes and poses.
GANs (Generative Adversarial Networks)
GANs marked the beginning of Virtual Try-On technologies.
These models rely on two neural networks: the generator and the discriminator. The generator creates an image, and the discriminator predicts whether the image looks realistic.
While GANs showed promise, they often struggled with photorealism. They had difficulty capturing intricate details like garment folds and adapting to different body shapes and poses.
The Rise of Diffusion Models
Diffusion models have emerged as a robust alternative to GANs for image synthesis. They work by refining random noise through iterative processes until a clear image forms. This process is similar to how an artist creates a painting, beginning with broad, foundational strokes and gradually refining finer details, pixel by pixel, until the final image is realized.
This process allows for high-quality, diverse, and detailed image generation. Diffusion models can produce higher-quality, more realistic images than GANs. This explains the popularity of models like DALL-E, Midjourney, and Stable Diffusion. These foundation models are excellent for creating a wide variety of high-quality, original, and creative images.
Let's now focus on how these foundation models apply to Virtual Try-On.
Foundation Models and Virtual Try-On
Foundation models enable the creation of stunning creative images based on textual prompts. You can generate fashion photos using simple prompts like “fashion photography”, with specific styles such as “high-fashion”, “minimalist’ or “techwear”.
You can also create virtual models wearing different outfits and generate diverse, detailed garments.

A production by Midjourney
At first glance, foundation models seem perfect for creating creative fashion photos. However, they have significant limitations in Virtual Try-On.
Limitations of Foundation Models
For Virtual Try-On, we need to use an existing photo of a garment to create new on-model pictures that accurately reflect the original. Every detail, including fabric texture, patterns, colors, and unique design elements, must be preserved.
For example:
If the original garment is a red plaid shirt with button-down collars and a specific type of stitching, the generated on-model image should accurately reflect the same shade of red, the exact plaid pattern, and the buttons' placement.
If the garment has a floral print with intricate details, these details must be clearly visible and correctly aligned on the model, ensuring the print appears as it would in real life.
For a dress with a specific type of fabric, like silk or denim, the generated image should capture the fabric's characteristic sheen or texture, ensuring the final image feels realistic and true to the original.
This level of detail is crucial to accurately represent to the customer how the garment will look when worn.
Accuracy in Depicting Garments
Foundation models excel at producing beautiful, creative content. However, they struggle with creating precise images for fashion collections.
Detail Preservation: These models often fail to capture and preserve small details. While the overall image may look photorealistic, zooming in reveals a lack of detail. The mesh may be coarse, seams exaggerated, and buttons may lack detail. This issue is more evident with complex patterns.
Complex textures: Results are often poor for complex textures like puffy jackets, lacking proper volumes and textures.
Fidelity: Unintended elements may be added, misleading customers about the product’s actual appearance.
These difficulties arise from the very architecture of the models. Foundation models are designed to generate a wide range of images from textual input, not precise images from specific inputs like clothing or ready-to-wear items.
The architecture cannot accurately capture all the details of a specific garment and then reproduce it with fidelity. This limitation arises from two factors: the inability to capture all the details in a product image and the challenge of conditioning the output based on the input, known as model conditioning.
It's possible to create high-quality images using foundation models, but it’s not consistent across all types of garments or at scale. It often requires extensive prompting and manual retouching for each item.
These limitations extend beyond just the garments. When trying to generate an entire photo shoot digitally, you may want to create your own digital models. However, this is not so simple with foundation models.
Foundation Models: Generating High-Quality Photorealistic Models
There are two main issues when generating faces for your models.
First, foundation models often struggle with photorealism in full-body shots, leading to unrealistic and distorted faces. Some applications can fix this in post-production, but it adds another step, reducing scalability.

When you prompt a full-body shot, it often generates artifacts in the face generation
Also, maintaining consistency between multiple generations of the same model is challenging. You might want to create different views (front, back) of the same garment on the same model, but achieving perfect consistency between images is difficult.
You can upload a photo of a specific model and use it to generate other images, but slight differences can occur. This inconsistency makes foundation models less effective as a substitute for traditional photo shoots.

A generated image and the result when using it as a prompt: the result is similar, but the face isn’t exactly the same.
Conclusion
As we’ve seen, foundation models are not yet perfectly suited to address the challenge of Virtual Try-On in fashion. However, another approach exists: developing specialized models tailored for Virtual Try-On.
By using advanced AI, Virtual Try-On technology, and diffusion models, these new models will produce photorealistic on-model photos. They will accurately depict garment folds, shadows, and details, and adapt to body shapes and positions, all at scale.
They will revolutionize the fashion industry’s approach to product imagery, enhancing the shopping experience for consumers and setting new standards in e-commerce and fashion technology.
At Veeton, we are developing specialized models to automate fashion imagery at scale. Wanna Try?