Constructing Pinterest Canvas, a text-to-image basis mannequin | by Pinterest Engineering | Pinterest Engineering Weblog | Jul, 2024

Pinterest Engineering
Pinterest Engineering Blog

Eric Tzeng; ML Analysis Scientist, ATG | Raymond Shiau; ML Analysis Scientist, ATG |

On this engineering notice, we wished to share a few of our newest progress on Pinterest Canvas, a text-to-image basis mannequin for enhancing current photographs and merchandise on the platform. Constructing picture basis fashions has been a core a part of Pinterest’s ML technique for the previous decade, however these have been centered on illustration studying duties (e.g. our Unified Visible Embedding v2, v3, and so on.). Extra lately, now we have begun to discover the applying of generative fashions, particularly these that may be conditioned on current Pinterest photographs, to create new backgrounds for merchandise.

Pinterest Canvas is constructed as a text-to-image mannequin that may assist arbitrary conditioning info within the type of product masks and conditioning photographs for stylistic steerage. On this publish, we are going to talk about first the coaching of the bottom text-to-image mannequin, then the fine-tuning course of to generate photorealistic backgrounds conditioned on masks, and at last an in-context studying course of for conditioning on picture types.

As a result of we’re primarily enthusiastic about picture era as a method of visualizing current merchandise in new contexts, relatively than producing fully new content material from scratch, we don’t have a direct product use for the standard picture era mannequin that takes a textual content caption and tries to generate a picture based mostly on that caption. Nonetheless, this text-to-image process finally ends up being a helpful strategy to educate a mannequin in regards to the visible world, in order that it could actually then discover ways to generate cohesive and compelling objects and scenes.

To that finish, we practice our personal picture era basis mannequin, named Pinterest Canvas, to function the spine mannequin that may be fine-tuned for all downstream product functions. Pinterest Canvas is a latent diffusion mannequin skilled solely in-house at Pinterest that adheres carefully to straightforward latent diffusion mannequin designs. For effectivity, the diffusion mannequin itself operates within the latent area discovered by a variational autoencoder (VAE). The ultimate latent illustration generated by the diffusion mannequin is then decoded into picture pixels by this VAE’s decoder. Textual content captions are encoded utilizing each CLIP-ViT/L and OpenCLIP-ViT/G, and are fed to a convolutional UNet through cross-attention to be able to incorporate textual content conditioning info throughout the era course of.

Throughout coaching, we pattern random caption-image pairs from our dataset. We then encode every picture into its latent illustration utilizing our VAE, embed every textual content caption utilizing CLIP, and pattern a random diffusion timestep for every pair. Noise is added to every picture latent, in accordance with its sampled diffusion timestep, and the UNet is tasked with denoising the latent given the textual content embedding and timestep index.

The bottom Canvas mannequin makes use of an ordinary VAE-based diffusion mannequin to supply a text-to-image basis.

We filter our coaching information aggressively, in an try to verify photographs are of top quality, adhere to belief and security requirements, and have related related textual content information. Textual content for every picture is collected from a wide range of sources to make sure range, together with public Pin titles and descriptions, in addition to generated alt textual content for search engine optimization and accessibility use circumstances. Even after this stringent filtering, we’re nonetheless left with over 1.5 billion top quality text-image pairs, which ensures that after a protracted and thoroughly managed coaching schedule, Pinterest Canvas converges to generate top quality photographs that seize an inspiring and interesting aesthetic.

There are various extra enhancements we may layer on high of this coaching protocol to additional enhance the efficiency of the bottom mannequin. Notably, we’ve explored utilizing reinforcement studying to encourage Canvas to generate extra various and visually interesting photographs, which we’ve written about in a ECCV publication: Large-scale Reinforcement Learning for Diffusion Models. Nevertheless, on this publish we’d prefer to as an alternative discover how we transcend this base mannequin to coach picture era fashions that may carry out particular visualization duties.

Coaching Pinterest Canvas offers us a powerful base mannequin that understands what objects appear to be, what their names are, and the way they’re sometimes composed into scenes. Nevertheless, as beforehand acknowledged, our purpose is coaching fashions that may visualize or reimagine actual concepts or merchandise in new contexts. We’ll use the bottom mannequin as a place to begin, however modify the coaching process. Now, as an alternative of coaching it to create photographs from scratch, we’ll ask it to fill in lacking components of photographs, a process generally known as inpainting.

Observe that we’re discussing only one potential specialization of Pinterest Canvas — on this case, one which performs inpainting. Nevertheless, in follow now we have concepts for lots of different duties to assist carry out other forms of visualizations!

So as to get our mannequin to inpaint photographs correctly, we’ll want to provide some further info as properly. As an alternative of solely passing a textual content caption and {a partially} noisy latent, we moreover move:

  1. A goal picture with lacking parts
  2. A binary masks, indicating whether or not pixels within the goal picture are legitimate or lacking

Collectively, these inputs outline the inpainting drawback: the top purpose is to generate a picture that matches the supplied goal picture, however with the lacking parts stuffed in.

The Canvas outpainting mannequin offers an object masks that the diffusion mannequin makes use of as conditioning, however doesn’t modify.

This mannequin is skilled in two phases. Within the first stage, we use the identical dataset as we did for the bottom Pinterest Canvas, and we moreover generate random masks for the mannequin to inpaint throughout coaching. This stage teaches the mannequin to fill in lacking picture areas, however as a result of the masks usually are not straight associated to the picture in any method, we discover that after first stage coaching, the mannequin usually extends or adjustments the shapes of objects.

Thus, within the second stage, we focus particularly on product photographs, and use a segmentation mannequin to generate product masks by separating the foreground and background. Present textual content captions sometimes describe solely the product whereas neglecting the background, which is essential to information the background inpainting course of, so we incorporate extra full and detailed captions from a visible LLM. On this stage, we practice a LoRA on all UNet layers to allow speedy, parameter environment friendly fine-tuning. Lastly, we briefly fine-tune on a curated set of highly-engaged promoted product photographs, to steer the mannequin towards aesthetics that resonate with Pinners.

Separating the coaching into these two phases permits us to ease the mannequin into the brand new inpainting process — the primary stage retains the identical coaching information however introduces the extra masks enter, and the second stage teaches the mannequin to protect object boundaries and focus solely on producing background content material. After convergence, we find yourself with a mannequin that may take a product Pin and generate a background in accordance with a textual content immediate:

In follow we additionally discovered that our VAE struggled with reconstructing wonderful particulars in photographs. Merely compositing the unique picture and generated picture collectively as a post-processing step produced seen mixing artifacts. We discovered that it helped to retrain our VAE to simply accept these further conditioning inputs as properly, in order that throughout the decoding course of it seamlessly blends the unique and generated picture content material, whereas making certain pixel-perfect reconstructions of merchandise.

Like different diffusion fashions, Pinterest Canvas is able to producing a number of variations, which frequently differ in high quality. We leverage this to spice up high quality throughout inference, by producing a number of backgrounds for a product, and deciding on the highest ok with a reward mannequin skilled on human judgments spanning defects, constancy, and aesthetics.

Though we’re fairly pleased with the standard of outcomes generated by our backdrop outpainting mannequin, in follow it’s nonetheless fairly limiting to try to describe the specified background solely in phrases. Generally it’s simpler to easily present examples of the fashion you’re after! To this finish, we additional increase our mannequin with the flexibility to situation on different photographs, utilizing their fashion to information the era course of.

To allow this extra performance, we construct off of IP-Adapter, a way for coaching an adapter community that processes further picture prompts. Throughout the diffusion UNet, these further picture prompts are encoded into embeddings after which handed alongside the textual content embeddings to new image-specific cross consideration layers, thereby permitting the diffusion community to take care of each picture and textual content prompts. We comply with the IP-Adapter coaching setup and situation straight on the goal picture. So as to protect backdrop era functionality, we discovered it was vital to collectively fine-tune on the second stage backdrop inpainting process, and so we reuse the identical products-focused dataset.

For personalization, we append stylistic context within the type of concatenated UVE and CLIP embeddings as further conditioning info to information the mannequin to supply backgrounds in a selected visible fashion.

We’re experimenting with alternative ways of amassing conditioning photographs, together with utilizing boards with robust types in addition to routinely mining fashion clusters, although merely conditioning on the bottom fact picture itself was surprisingly efficient as properly. We additionally discovered that utilizing our internally developed Unified Visible Embedding (UVE) to embed the conditioning photographs typically led to a a lot stronger impact on the ensuing generations, as in comparison with solely utilizing different embeddings like CLIP. UVE is our core visible sign at Pinterest used for visible search and proposals, and by offering it as a conditioning enter to Pinterest Canvas, we’re in a position to faucet into that wealthy visible understanding to extra strongly affect the ensuing outputs. We’re excited to start out gathering buyer enter on these approaches by way of the lately introduced Pinterest Ad Labs.

The following set of enhancements to the Pinterest Canvas mannequin fall into three classes:

  • The underlying diffusion spine mannequin is being upgraded to a extra trendy Transformer diffusion structure (DiT). Our coaching outcomes already point out that this mannequin is ready to generate product backgrounds at the next decision and constancy, significantly after we concurrently improve to a extra performant fine-tuned textual content encoder.
  • One lively space of analysis for the staff is rethinking the binary-masking method to mannequin conditioning. The binary pixel masking constraint (additionally known as hard-masking) is vital to satisfy our promise to retailers that their merchandise won’t ever be altered or misrepresented by not permitting the mannequin to switch the pixels contained in the product masks. Nevertheless, there are some eventualities the place this constraint prevents us from producing immersive and helpful background visualizations for our customers. For instance, the mannequin’s background era functionality can’t introduce dynamic lighting into the scene if the pixels or alpha channel can’t be modified in any method, making it tough to work with scenes involving a number of merchandise or extra advanced backgrounds. One other space the place a soft-masking method can be helpful is to permit the mannequin to scrub up errors from the segmentation mannequin if it has a excessive confidence that both an excessive amount of or not sufficient of the border was clipped.
  • Since we discovered that using our Pinterest-optimized visible embeddings (UVE) for picture conditioning led to a lot stronger outcomes in comparison with the CLIP-like baselines, we are going to proceed incorporating UVE modeling enhancements into Pinterest Canvas. Following this perception, we’re exploring utilizing CLIP-like multimodal embeddings skilled from Pinterest information and particularly tuning them to enhance the textual content conditioning part of the mannequin.

We’re excited to share extra about this work in a future publish!