logo

SegGen: Supercharging Segmentation Models
with Text2Mask and Mask2Img Synthesis

1HKUST 2Adobe Research

SegGen not only significantly improves the performance of state-of-the-art segmentation models on standard benchmarks (COCO and ADE20K), but also boosts the generalization ability towards challenging images from unseen domains. The three columns on the left are from PASCAL dataset and the three on the right are synthesized by generative model Kandinsky 2.

Abstract

We propose SegGen, a highly-effective training data generation method for image segmentation, which pushes the performance limits of state-of-the-art segmentation models to a significant extent. SegGen designs and integrates two data generation strategies: MaskSyn and ImgSyn. (i) MaskSyn synthesizes new mask-image pairs via our proposed text-to-mask generation model and mask-to-image generation model, greatly improving the diversity in segmentation masks for model supervision; (ii) ImgSyn synthesizes new images based on existing masks using the mask-to-image generation model, strongly improving image diversity for model inputs.

On the highly competitive ADE20K and COCO benchmarks, our data generation method markedly improves the performance of state-of-the-art segmentation models in semantic segmentation, panoptic segmentation, and instance segmentation. Notably, in terms of the ADE20K mIoU, Mask2Former R50 is largely boosted from 47.2 to 49.9 (+2.7); Mask2Former Swin-L is also significantly increased from 56.1 to 57.4 (+1.3). These promising results strongly suggest the effectiveness of our SegGen even when abundant human-annotated training data is utilized. Moreover, training with our synthetic data makes the segmentation models more robust towards unseen domains.

Workflow

Workflow of SegGen : We introduce two generative models: a text-to-mask (Text2Mask) generation model and a mask-to-image( Mask2Img) generation model, based on which we design two approaches for generating new segmentation training samples: MaskSyn and ImgSyn. (a) MaskSyn focuses on generating new segmentation masks. It first extracts the caption of the real image as a text prompt and uses it to generate new masks with the Text2Mask model. Then, the new masks and text prompt are fed into the Mask2Img model to produce the corresponding new images. (b) ImgSyn focuses on the synthesis of new images. It directly inputs human-labeled masks and text prompts into the Mask2Img model to generate new images.

MaskSyn synthesizes new mask-image pairs. It first generates synthetic segmentation masks with a Text2Mask model, and then synthesizes new images with a Mask2Image model.

ImgSyn synthesizes new images. It generates new images conditioned on human-annotated segmentation masks with a Mask2Image model.

Generated mask-image pairs by MaskSyn. Both masks and images are synthesized by our SegGen. The synthetic masks are highly diverse.

Generated images by ImgSyn. The masks are human-annotated. The synthetic images are realistic and align well with the masks.


Better Alignment

Our synthetic images align better with the human-labeled masks than real images in many cases due to the inaccuracies in human annotations. The left 4 samples are from ADE20K and the right 4 are from COCO.

Experiments

BibTeX

@article{ye2023seggen,
  title={SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis},
  author={Ye, Hanrong and Kuen, Jason and Liu, Qing and Lin, Zhe and Price, Brian and Xu, Dan},
  journal={arXiv preprint arXiv:2311.03355},
  year={2023}
}
The website template was adapted from HyperNerf.