当前位置: 首页 > news >正文

通过学习更多样化的生成数据进行更广泛的数据分发来改进实例分割

大家读完觉得有帮助记得关注和点赞!!!

本次使用的英文整理的一些记录,练习一下为后续SCI发表论文打好基础

Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data

Abstract

Instance segmentation is data-hungry, and as model capacity increases, data scale becomes crucial for improving the accuracy. Most instance segmentation datasets today require costly manual annotation, limiting their data scale. Models trained on such data are prone to overfitting on the training set, especially for those rare categories. While recent works have delved into exploiting generative models to create synthetic datasets for data augmentation, these approaches do not efficiently harness the full potential of generative models.

To address these issues, we introduce a more efficient strategy to construct generative datasets for data augmentation, termed DiverGen. Firstly, we provide an explanation of the role of generative data from the perspective of distribution discrepancy. We investigate the impact of different data on the distribution learned by the model. We argue that generative data can expand the data distribution that the model can learn, thus mitigating overfitting. Additionally, we find that the diversity of generative data is crucial for improving model performance and enhance it through various strategies, including category diversity, prompt diversity, and generative model diversity. With these strategies, we can scale the data to millions while maintaining the trend of model performance improvement. On the LVIS dataset, DiverGen significantly outperforms the strong model X-Paste, achieving +1.1 box AP and +1.1 mask AP across all categories, and +1.9 box AP and +2.5 mask AP for rare categories. Our codes are available at https://github.com/aim-uofa/DiverGen.

1Introduction

Instance segmentation [9, 4, 2] is one of the challenging tasks in computer vision, requiring the prediction of masks and categories for instances in an image, which serves as the foundation for numerous visual applications. As models’ learning capabilities improve, the demand for training data increases. However, current datasets for instance segmentation heavily rely on manual annotation, which is time-consuming and costly, and the dataset scale cannot meet the training needs of models. Despite the recent emergence of the automatically annotated dataset SA-1B [12], it lacks category annotations, failing to meet the requirements of instance segmentation. Meanwhile, the ongoing development of the generative model has largely improved the controllability and realism of generated samples. For example, the recent text2image diffusion model [22, 24] can generate high-quality images corresponding to input prompts. Therefore, current methods [34, 28, 27] use generative models for data augmentation by generating datasets to supplement the training of models on real datasets and improve model performance. Although current methods have proposed various strategies to enable generative data to boost model performance, there are still some limitations: 1) Existing methods have not fully exploited the potential of generative models. First, some methods [34] not only use generative data but also need to crawl images from the internet, which is significantly challenging to obtain large-scale data. Meanwhile, the content of data crawled from the internet is uncontrollable and needs extra checking. Second, existing methods do not fully use the controllability of generative models. Current methods often adopt manually designed templates to construct prompts, limiting the potential output of generative models. 2) Existing methods [28, 27] often explain the role of generative data from the perspective of class imbalance or data scarcity, without considering the discrepancy between real-world data and generative data. Moreover, these methods typically show improved model performance only in scenarios with a limited number of real samples, and the effectiveness of generative data on existing large-scale real datasets, like LVIS [8], is not thoroughly investigated.

In this paper, we first explore the role of generative data from the perspective of distribution discrepancy, addressing two main questions: 1) Why does generative data augmentation enhance model performance? 2) What types of generative data are beneficial for improving model performance? First, we find that there exist discrepancies between the model learned distribution of the limited real training data and the distribution of real-world data. We visualize the data and find that compared to the real-world data, generative data can expand the data distribution that the model can learn. Furthermore, we find that the role of adding generative data is to alleviate the bias of the real training data, effectively mitigating overfitting the training data. Second, we find that there are also discrepancies between the distribution of the generative data and the real-world data distribution. If these discrepancies are not handled properly, the full potential of the generative model cannot be utilized. By conducting several experiments, we find that using diverse generative data enables models to better adapt to these discrepancies, improving model performance.

Based on the above analysis, we propose an efficient strategy for enhancing data diversity, namely, Generative Data Diversity Enhancement. We design various diversity enhancement strategies to increase data diversity from the perspectives of category diversity, prompt diversity, and generative model diversity. For category diversity, we observe that models trained with generative data covering all categories adapt better to distribution discrepancy than models trained with partial categories. Therefore, we introduce not only categories from LVIS [8] but also extra categories from ImageNet-1K [23] to enhance category diversity in data generation, thereby reinforcing the model’s adaptability to distribution discrepancy. For prompt diversity, we find that as the scale of the generative dataset increases, manually designed prompts cannot scale up to the corresponding level, limiting the diversity of output images from the generative model. Thus, we design a set of diverse prompt generation strategies to use large language models, like ChatGPT, for prompt generation, requiring the large language models to output maximally diverse prompts under constraints. By combining manually designed prompts and ChatGPT designed prompts, we effectively enrich prompt diversity and further improve generative data diversity. For generative model diversity, we find that data from different generative models also exhibit distribution discrepancies. Exposing models to data from different generative models during training can enhance adaptability to different distributions. Therefore, we employ Stable Diffusion [22] and DeepFloyd-IF [24] to generate images for all categories separately and mix the two types of data during training to increase data diversity.

At the same time, we optimize the data generation workflow and propose a four-stage generative pipeline consisting of instance generation, instance annotation, instance filtration, and instance augmentation. In the instance generation stage, we employ our proposed Generative Data Diversity Enhancement to enhance data diversity, producing diverse raw data. In the instance annotation stage, we introduce an annotation strategy called SAM-background. This strategy obtains high-quality annotations by using background points as input prompts for SAM [12], obtaining the annotations of raw data. In the instance filtration stage, we introduce a metric called CLIP inter-similarity. Utilizing the CLIP [21] image encoder, we extract embeddings from generative and real data, and then compute their similarity. A lower similarity indicates lower data quality. After filtration, we obtain the final generative dataset. In the instance augmentation stage, we use the instance paste strategy [34] to increase model learning efficiency on generative data.

Experiments demonstrate that our designed data diversity strategies can effectively improve model performance and maintain the trend of performance gains as the data scale increases to the million level, which enables large-scale generative data for data augmentation. On the LVIS dataset, DiverGen significantly outperforms the strong model X-Paste [34], achieving +1.1 box AP [8] and +1.1 mask AP across all categories, and +1.9 box AP and +2.5 mask AP for rare categories.

In summary, our main contributions are as follows:

  • • 

    We explain the role of generative data from the perspective of distribution discrepancy. We find that generative data can expand the data distribution that the model can learn, mitigating overfitting the training set and the diversity of generative data is crucial for improving model performance.

  • • 

    We propose the Generative Data Diversity Enhancement strategy to increase data diversity from the aspects of category diversity, prompt diversity, and generative model diversity. By enhancing data diversity, we can scale the data to millions while maintaining the trend of model performance improvement.

  • • 

    We optimize the data generation pipeline. We propose an annotation strategy SAM-background to obtain higher-quality annotations. We also introduce a filtration metric called CLIP inter-similarity to filter data and further improve the quality of the generative dataset.

2Related Work

Instance segmentation. Instance segmentation is an important task in the field of computer vision and has been extensively studied. Unlike semantic segmentation, instance segmentation not only classifies the pixels at a pixel level but also distinguishes different instances of the same category. Previously, the focus of instance segmentation research has primarily been on the design of model structures. Mask-RCNN [9] unifies the tasks of object detection and instance segmentation. Subsequently, Mask2Former [4] further unified the tasks of semantic segmentation and instance segmentation by leveraging the structure of DETR [2].

Refer to caption

Figure 1:Visualization of data distributions on different sources. Compared to real-world data (LVIS train and LVIS val), generative data (Stable Diffusion and IF) can expand the data distribution that the model can learn.

Orthogonal to these studies focusing on model architecture, our work primarily investigates how to better utilize generated data for this task. We focus on the challenging long-tail dataset LVIS [8] because it is only the long-tailed categories that face the issue of limited real data and require generative images for augmentation, making it more practically meaningful.

Generative data augmentation. The use of generative models to synthesize training data for assisting perception tasks such as classification [6, 32], detection [34, 3], segmentation [14, 28, 27], etc. has received widespread attention from researchers. In the field of segmentation, early works [33, 13] utilize generative adversarial networks (GANs) to synthesize additional training samples. With the rise of diffusion models, there have been numerous efforts [34, 14, 28, 27, 30] to utilize text2image diffusion models, such as Stable Diffusion [22], to boost the segmentation performance. Li et al. [14] combine the Stable Diffusion model with a novel grounding module and establish an automatic pipeline for constructing a segmentation dataset. DiffuMask [28] exploits the potential of cross-attention maps between text and images to synthesize accurate semantic labels. More recently, FreeMask [30] uses a mask-to-image generation model to generate images conditioned on the provided semantic masks. However, the aforementioned work is only applicable to semantic segmentation. The most relevant work to ours is X-Paste [34], which promotes instance segmentation through copy-pasting the generative images and a filter strategy based on CLIP [21].

In summary, most methods only demonstrate significant advantages when training data is extremely limited. They consider generating data as a means to compensate for data scarcity or class imbalance. However, in this work, we take a further step to examine and analyze this problem from the perspective of data distribution. We propose a pipeline that enhances diversity from multiple levels to alleviate the impact of data distribution discrepancies. This provides new insights and inspirations for further advancements in this field.

3Our Proposed DiverGen

3.1Analysis of Data Distribution

Existing methods [34, 28, 29] often attribute the role of generative data to addressing class imbalance or data scarcity. In this paper, we provide an explanation for two main questions from the perspective of distribution discrepancy.

Why does generative data augmentation enhance model performance? We argue that there exist discrepancies between the model learned distribution of the limited real training data and the distribution of real-world data. The role of adding generative data is to alleviate the bias of the real training data, effectively mitigating overfitting the training data.

First, to intuitively understand the discrepancies between different data sources, we use CLIP [21] image encoder to extract the embeddings of images from different data sources, and then use UMAP [18] to reduce dimensions for visualization. Visualization of data distributions on different sources is shown in Figure 1. Real-world data (LVIS [8] train and LVIS val) cluster near the center, while generative data (Stable Diffusion [22] and IF [24]) are more dispersed, indicating that generative data can expand the data distribution that the model can learn.

Then, to characterize the distribution learned by the model, we employ the free energy formulation used by Joseph et al. [10]. This formulation transforms the logits outputted by the classification head into an energy function. The formulation is shown below:

F⁢(𝒒;h)=−τ⁢log⁢∑c=1nexp⁡(hc⁢(𝒒)τ).(1)

Here, 𝒒 is the feature of instance, hc⁢(𝒒) is the ct⁢h logit outputted by classification head h(.), n is the number of categories and τ is the temperature parameter. We train one model using only the LVIS train set (θtrain), and another model using LVIS train with generative data (θgen). Both models are evaluated on the LVIS val set and we use instances that are successfully matched by both models to obtain energy values. Additionally, we train another model using LVIS val (θval), treating it as representative of real-world data distribution. Then, we further fit Gaussian distributions to the histograms of energy values to obtain the mean μ and standard deviation σ for each model and compute the KL divergence [11] between them. DK⁢L⁢(pθtrain∥pθval) is 0.063, and DK⁢L⁢(pθgen∥pθval) is 0.019. The latter is lower, indicating that using generative data mitigates the bias of limited real training data.

Moreover, we also analyze the role of generative data from a metric perspective. We randomly select up to five images per category to form a minitrain set and then conduct inferences using θtrain and θgen. Then, we define a metric, termed train-val gap (TVG), which is formulated as follows:

TVGwk=APwk⁢m⁢i⁢n⁢i⁢t⁢r⁢a⁢i⁢n−APwk⁢v⁢a⁢l.(2)

Here, TVGwk is train-val gap of w category on task k, APwk⁢d is AP [8] of w category on k obtained on dataset d, w∈{f,c,r}, with f, c, r standing for frequent, common, rare [8] respectively, and k∈{b⁢o⁢x,m⁢a⁢s⁢k}, with b⁢o⁢x, m⁢a⁢s⁢k referring to the object detection and instance segmentation. The train-val gap serves as a measure of the disparity in the model’s performance between the training and validation sets. A larger gap indicates a higher degree of overfitting the training set. The results, as presented in Table 1, show that the metrics for the rare categories consistently surpass those of frequent and common. This observation suggests that the model tends to overfit more on the rare categories that have fewer examples. With the augmentation of generative data, all TVG of θgen are lower than θtrain, showing that adding generative data can effectively alleviate overfitting the training data.

Data SourceTVGfb⁢o⁢xTVGfm⁢a⁢s⁢kTVGcb⁢o⁢xTVGcm⁢a⁢s⁢kTVGrb⁢o⁢xTVGrm⁢a⁢s⁢k
LVIS13.1610.7121.8016.8039.5931.68
LVIS + Gen9.648.3815.6412.6929.3922.49

Table 1:Results of train-val gap on different data sources. With the augmentation of generative data, all TVG of LVIS are lower than LVIS + Gen, showing that adding generative data can effectively alleviate overfitting to the training data.

What types of generative data are beneficial for improving model performance? We argue that there are also discrepancies between the distribution of the generative data and the real-world data distribution. If these discrepancies are not properly addressed, the full potential of the generative model cannot be attained.

We divide the generative data into ‘frequent’, ‘common’, and ‘rare’ [8] groups, and train three models using each group of data as instance paste source. The inference results are shown in Table 2. We find that the metrics on the corresponding category subset are lowest when training with only one group of data. We consider model performance to be primarily influenced by the quality and diversity of data. Given that the quality of generative data is relatively consistent, we contend insufficient diversity in the data can mislead the distribution that the model can learn and a more comprehensive understanding is obtained by the model from a diverse set of data. Therefore, we believe that using diverse generative data enables models to better adapt to these discrepancies, improving model performance.

# Gen CategoryAPfb⁢o⁢xAPfm⁢a⁢s⁢kAPcb⁢o⁢xAPcm⁢a⁢s⁢kAPrb⁢o⁢xAPrm⁢a⁢s⁢k
none50.1443.8447.5443.1241.3936.83
f50.8144.2447.9643.5141.5137.92
c51.8645.2247.6942.7942.3237.30
r51.4644.9048.2443.5132.6729.04
all52.1045.4550.2944.8746.0341.86

Table 2:Results of different category data subset for training. The metrics on the corresponding category subset are lowest when training with only one group of data, showing insufficient diversity in the data can mislead the distribution that the model can learn. Blue font means the lowest value in models using generative data.

3.2Generative Data Diversity Enhancement

Through the analysis above, we find that the diversity of generative data is crucial for improving model performance. Therefore, we design a series of strategies to enhance data diversity at three levels: category diversity, prompt diversity, and generative model diversity, which help the model to better adapt to the distribution discrepancy between generative data and real data.

Category diversity. The above experiments show that including data from partial categories results in lower performance than incorporating data from all categories. We believe that, akin to human learning, the model can learn features beneficial to the current category from some other categories. Therefore, we consider increasing the diversity of data by adding extra categories. First, we select some extra categories besides LVIS from ImageNet-1K [23] categories based on WordNet [5] similarity. Then, the generative data from LVIS and extra categories are mixed for training, requiring the model to learn to distinguish all categories. Finally, we truncate the parameters in the classification head corresponding to the extra categories during inference, ensuring that the inferred category range remains within LVIS.

Prompt diversity. The output images of the text2image generative model typically rely on the input prompts. Existing methods [34] usually generate prompts by manually designing templates, such as “a photo of a single {category_name}." When the data scale is small, designing prompts manually is convenient and fast. However, when generating a large scale of data, it is challenging to scale the number of manually designed prompts correspondingly. Intuitively, it is essential to diversify the prompts to enhance data diversity. To easily generate a large number of prompts, we choose large language model, like ChatGPT, to enhance the prompt diversity. We have three requirements for the large language model: 1) each prompt should be as different as possible; 2) each prompt should ensure that there is only one object in the image; 3) prompts should describe different attributes of the category. For example, if the category is food, prompts should cover attributes like color, brand, size, freshness, packaging type, packaging color, etc. Limited by the inference cost of ChatGPT, we use the manually designed prompts as the base and only use ChatGPT to enhance the prompt diversity for a subset of categories. Moreover, we also leverage the controllability of the generative model, adding the constraint “in a white background" after each prompt to make the background of output images simple and clear, which reduces the difficulty of mask annotation.

Generative model diversity. The quality and style of output images vary across generative models, and the data distribution learned solely from one generative model’s data is limited. Therefore, we introduce multiple generative models to enhance the diversity of data, allowing the model to learn from wider data distributions. We selected two commonly used generative models, Stable Diffusion [22] (SD) and DeepFloyd-IF [24] (IF). We use Stable Diffusion V1.5, generating images with a resolution of 512 × 512, and use images output from Stage II of IF with a resolution of 256 × 256. For each category in LVIS, we generated 1k images using two models separately. Examples from different generative models are shown in Figure 2.

Refer to caption

Figure 2:Examples of various generative models. The samples generated by different generative models vary, even within the same category.

Refer to caption

Figure 3:Overview of the DiverGen pipeline. In instance generation, we enhance data diversity at three levels: category diversity, prompt diversity, and generative model diversity. Next, we use SAM-background to obtain high-quality masks. Then, we use CLIP inter-similarity to filter out low-quality data. At last, we use the instance paste strategy to increase model learning efficiency on generative data.

3.3Generative Pipeline

The generative pipeline of DiverGen is built upon X-Paste [34]. It can be divided into four stages: instance generation, instance annotation, instance filtration and instance augmentation. The overview of DiverGen is illustrated in Figure 3.

Instance generation. Instance generation is a crucial stage for enhancing data diversity. In this stage, we employ our proposed Generative Data Diversity Enhancement (GDDE), as mentioned in Sec 3.2. In category diversity enhancement, we utilize the category information from LVIS [8] categories and extra categories selected from ImageNet-1K [23]. In prompt diversity enhancement, we utilize manually designed prompts and ChatGPT designed prompts to enhance prompt diversity. In model diversity enhancement, we employ two generative models, SD and IF.

Instance annotation. We employ SAM [12] as our annotation model. SAM is a class-agnostic promptable segmenter that outputs corresponding masks based on input prompts, such as points, boxes, etc. In instance generation, leveraging the controllability of the generative model, the generative images have two characteristics: 1) each image predominantly contains only one foreground object; 2) the background of the images is relatively simple. Therefore, we introduce a SAM-background (SAM-bg) annotation strategy. SAM-bg takes the four corner points of an image as input prompts for SAM to obtain the background mask, then inverts the background mask as the mask of the foreground object. Due to the conditional constraints during the instance generation stage, this strategy is simple but effective in producing high-quality masks.

Instance filtration. In the instance filtration stage, X-Paste utilizes the CLIP score (similarity between images and text) as the metric for image filtering. However, we observe that the CLIP score is ineffective in filtering low-quality images. In contrast to the similarity between images and text, we think the similarity between images can better filter out low-quality images. Therefore, we propose a new metric called CLIP inter-similarity. We use the image encoder of CLIP [21] to extract image embeddings for objects in the training set and generative images, then calculate the similarity between them. If the similarity is too low, it indicates a significant disparity between the generative and real images, suggesting that it is probably a poor-quality image and needs to be filtered.

Instance augmentation. We use the augmentation strategy proposed by X-Paste [34] but do not use the data retrieved from the network or the instances in LVIS [8] training set as the paste data source, only use the generative data as the paste data source.

4Experiments

4.1Settings

Datasets. We choose LVIS [8] for our experiments. LVIS is a large-scale instance segmentation dataset, containing 164k images with approximately two million high-quality annotations of instance segmentation and object detection. LVIS dataset uses images from COCO 2017 [15] dataset, but redefines the train/val/test splits, with around 100k images in the training set and around 20k images in the validation set. The annotations in LVIS cover 1,203 categories, with a typical long-tailed distribution of categories, so LVIS further divides the categories into frequent, common, and rare based on the frequency of each category in the dataset. We use the official LVIS training split and the validation split.

Evaluation metrics. The evaluation metrics are LVIS box average precision (APb⁢o⁢x) and mask average precision (APm⁢a⁢s⁢k). We also provide the average precision of rare categories (APrb⁢o⁢x and APrm⁢a⁢s⁢k). The maximum number of detections per image is 300.

Implementation details. We use CenterNet2 [35] as the baseline and Swin-L [16] as the backbone. In the training process, we initialize the parameters by the pre-trained Swin-L weights provided by Liu et al. [16]. The training size is 896 and the batch size is 16. The maximum training iterations is 180,000 with an initial learning rate of 0.0001. We use the instance paste strategy provided by Zhao et al. [34].

4.2Main Results

Data diversity is more important than quantity. To investigate the impact of different scales of generative data, we use generative data of varying scales as paste data sources. We construct three datasets using only DeepFloyd-IF [24] with manually designed prompts, all containing original LVIS 1,203 categories, but with per-category quantities of 0.25k, 0.5k, and 1k, resulting in total dataset scales of 300k, 600k, and 1,200k. As shown in Table 3, we find that using generative data improves model performance compared to the baseline. However, as the dataset scale increases, the model performance initially improves but then declines. The model performance using 1,200k data is lower than that using 600k data. Due to the limited number of manually designed prompts, the generative model produces similar data, as shown in Figure 4(a). Consequently, the model can not gain benefits from more data. However, when using our proposed Generative Data Diversity Enhancement (GDDE), due to the increased data diversity, the model trained with 1,200k images achieves better results than using 600k images, with an improvement of 1.21 box AP and 1.04 mask AP. Moreover, when using the same data scale of 600k, the mask AP increased by 0.64 AP and the box AP increased by 0.55 AP when using GDDE compared to not using it. The results demonstrate that data diversity is more important than quantity. When the scale of data is small, increasing the quantity of data can improve model performance, which we consider is an indirect way of increasing data diversity. However, this simplistic approach of solely increasing quantity to increase diversity has an upper limit. When it reaches this limit, explicit data diversity enhancement strategies become necessary to maintain the trend of model performance improvement.

# Gen DataGDDEAPb⁢o⁢xAPm⁢a⁢s⁢kAPrb⁢o⁢xAPrm⁢a⁢s⁢k
047.5042.3241.3936.83
300k49.6544.0145.6841.11
600k50.0344.4447.1541.96
1200k49.4443.7542.9637.91
600k50.6744.9948.5243.63
1200k51.2445.4850.0745.85

Table 3:Results of different scales of generative data. When using the same data scale, models using our proposed GDDE can achieve higher performance than those without it, showing that data diversity is more important than quantity.

Refer to caption

(a)Images of manually designed prompts.

Refer to caption

(b)Images of ChatGPT designed prompts.

Figure 4:Examples of generative data using different prompts. By using prompts designed by ChatGPT, the diversity of generated images in terms of shapes, textures, etc. can be significantly improved.

Comparision with previous methods. We compare DiverGen with previous data-augmentation related methods in Table 4. Compared to the baseline CenterNet2 [35], our method significantly improves, increasing box AP by +3.7 and mask AP by +3.2. Regarding rare categories, our method surpasses the baseline with +8.7 in box AP and +9.0 in mask AP. Compared to the previous strong model X-Paste [34], we outperform it with +1.1 in box AP and +1.1 in mask AP of all categories, and +1.9 in box AP and +2.5 in mask AP of rare categories. It is worth mentioning that, X-Paste utilizes both generative data and web-retrieved data as paste data sources during training, while our method exclusively uses generative data as the paste data source. We achieve this by designing diversity enhancement strategies, further unlocking the potential of generative models.

MethodBackboneAPboxAPmaskAPb⁢o⁢xrAPm⁢a⁢s⁢kr
Copy-Paste [7] EfficientNet-B741.638.1-32.1
Tan et al. [26] ResNeSt-269-41.5-30.0
Detic [36] Swin-B46.941.745.941.7
CenterNet2 [35] Swin-L47.542.341.436.8
X-Paste [34] Swin-L50.144.448.243.3
DiverGen (Ours)Swin-L51.245.550.145.8
(+1.1)(+1.1)(+1.9)(+2.5)

Table 4:Comparison with previous methods on LVIS val set.

4.3Ablation Studies

We analyze the effects of the proposed strategies in DiverGen through a series of ablation studies using the Swin-L [16] backbone.

Effect of category diversity. We select 50, 250, and 566 extra categories from ImagNet-1K [23], and generate 0.5k images for each category, which are added to the baseline. The baseline only uses 1,203 categories of LIVS [8] to generate data. We show the results in Table 5. Generally, increasing the number of extra categories initially improves then declines model performance, peaking at 250 extra categories. The trend suggests that using extra categories to enhance category diversity can improve the model’s generalization capabilities, but too many extra categories may mislead the model, leading to a decrease in performance.

# Extra CategoryAPb⁢o⁢xAPm⁢a⁢s⁢kAPrb⁢o⁢xAPrm⁢a⁢s⁢k
049.4443.7542.9637.91
5049.9244.1744.9439.86
25050.5944.7747.9942.91
56650.3544.6347.6842.53

Table 5:Ablation of the number of extra categories during training. Using extra categories to enhance category diversity can improve the model’s generalization capabilities, but too many extra categories may mislead the model, leading to a decrease in performance.

Effect of prompt diversity. We select a subset of categories and use ChatGPT to generate 32 and 128 prompts for each category, with each prompt being used to generate 8 and 2 images, respectively, ensuring that the image count for each category is 0.25k. The baseline uses only one prompt per category to generate 0.25k images. The regenerated images will replace the corresponding categories in the baseline to ensure that the final data scale is consistent. The results are presented in Table 6. With the increase in prompt diversity, there is a continuous improvement in model performance, indicating that prompt diversity is indeed beneficial for enhancing model performance.

# PromptAPb⁢o⁢xAPm⁢a⁢s⁢kAPrb⁢o⁢xAPrm⁢a⁢s⁢k
149.6544.0145.6841.11
3250.0344.3945.8341.32
12850.2744.5046.4941.25

Table 6:Ablation of the number of prompts used to generate data. With the increase in prompt diversity, there is a continuous improvement in model performance, indicating that prompt diversity is indeed beneficial for enhancing model performance.

Effect of generative model diversity. We choose two commonly used generative models, Stable Diffusion [22] (SD) and DeepFloyd-IF [24] (IF). We generate 1k images per category for each generative model, totaling 1,200k. When using a mixed dataset (SD + IF), we take 600k from SD and 600k from IF per category, respectively, to ensure the total dataset scale is consistent. The baseline does not use any generative data (none). As shown in Table 7, using data generated by either SD or IF alone can improve performance, further mixing the generative data of both leads to significant performance gains. This demonstrates that increasing model diversity is beneficial for improving model performance.

ModelAPb⁢o⁢xAPm⁢a⁢s⁢kAPrb⁢o⁢xAPrm⁢a⁢s⁢k
none47.5042.3241.3936.83
SD [22] 48.1342.8243.6839.15
IF [24] 49.4443.7542.9637.91
SD + IF50.7845.2748.9444.35

Table 7:Ablation of different generative models. Increasing model diversity is beneficial for improving model performance.

Effect of annotation strategy. X-Paste [34] uses four models (U2Net [20], SelfReformer [31], UFO [25] and CLIPseg [17]) to generate masks and selects the one with the highest CLIP score. We compare our proposed annotation strategy (SAM-bg) to that proposed by X-Paste (max CLIP). In Table 8, SAM-bg outperforms max CLIP strategy across all metrics, indicating that our proposed strategy can produce better annotations, improving model performance. As shown in Figure 5, SAM-bg unlocks the potential capability of SAM, obtaining precise and refined masks.

Refer to caption

Figure 5:Examples of object mask of different annotation strategies. SAM-bg can obtain more complete and delicate masks.

StrategyAPb⁢o⁢xAPm⁢a⁢s⁢kAPrb⁢o⁢xAPrm⁢a⁢s⁢k
max CLIP [34] 49.1043.4542.7537.55
SAM-bg49.4443.7542.9637.91

Table 8:Ablation of different annotation strategies. Our proposed SAM-bg can produce better annotations, improving model performance.

Effect of CLIP inter-similarity. We compare our proposed CLIP inter-similarity to CLIP score [34]. The results are shown in Table 9. The performance of data filtered by CLIP inter-similarity is higher than that of CLIP score, demonstrating that CLIP inter-similarity can filter low-quality images more effectively.

StrategyAPb⁢o⁢xAPm⁢a⁢s⁢kAPrb⁢o⁢xAPrm⁢a⁢s⁢k
none49.4443.7542.9637.91
CLIP score [34] 49.8444.2744.8340.82
CLIP inter-similarity50.0744.4445.5341.16

Table 9:Ablation of the different filtration strategies. Our proposed CLIP inter-similarity can filter low-quality images more effectively.

5Conclusions

In this paper, we explain the role of generative data augmentation from the perspective of data distribution discrepancies and find that generative data can expand the data distribution that the model can learn, mitigating overfitting the training set. Furthermore, we find that data diversity of generative data is crucial for improving model performance. Therefore, we design an efficient data diversity enhancement strategy, Generative Data Diversity Enhancement. We design various diversity enhancement strategies to increase data diversity from the aspects of category diversity, prompt diversity, and generative model diversity. Finally, we optimize the data generative pipeline by designing the annotation strategy SAM-background to obtain higher quality annotations and introducing the metric CLIP inter-similarity to filter data, which further improves the quality of the generative dataset. Through these designed strategies, our proposed method significantly outperforms the existing strong models. We hope DiverGen can provide new insights and inspirations for future research on the effectiveness and efficiency of generative data augmentation.

相关文章:

通过学习更多样化的生成数据进行更广泛的数据分发来改进实例分割

大家读完觉得有帮助记得关注和点赞!!! 本次使用的英文整理的一些记录,练习一下为后续SCI发表论文打好基础 Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data Abstract In…...

NVIDIA视频编解码

开源了两个项目:英伟达显卡视频编解码、jetson视频编解码。都是基于官方SDK进行的封装,由于官方自带的demo晦涩难懂并且每块都是独立的,我对SDK进行二次封装并形成了一套较为完整的视频编解码流程,调用简单,有完整的测…...

GCC支持Objective C的故事?Objective-C?GCC只能编译C语言吗?Objective-C 1.0和2.0有什么区别?

GCC支持Objective C的故事 Objective-C 主要由 Stepstone 公司的Brad Cox和 Tom Love 在1980 年左右发明。乔布斯离开苹果公司后成立了NeXT STEP公司, 买下了Objective-C 语言的授权。GCC对Objective-C语言的支持是在1992年加入的,具体是在GCC 1.3版本中…...

详解深度学习中的Dropout

Dropout是一种在神经网络训练中常用的正则化技术,其操作是在每次训练迭代中随机“丢弃”一部分神经元(即将其输出置为零)。以下是对这一操作的详细解释: 一、基本思想 Dropout的基本思想是减少神经元之间的复杂共适应关系&#…...

SQL-杂记1

PIVOT的使用: 行转列IIF()的使用:IIF( boolean_expression, true_value, false_value)多个字段使用MX()函数 SELECTD.ID,字段1,字段2,字段3,字段4,字段5,X.MinDateValue FROM 表名 D WITH(NOLOCK) OUTER APPLY (SELECT MIN(DateValue) AS MinDateValueFROM (VALUES (字段1),(字…...

Python(十七)excel指定列自动翻译成英文

前言 本章主要讲述在excel的指定列后面添加一列,并翻译成英文 一、效果图 二、代码 实际需求: # -*- codeing utf-8 -*- # time: 2025/1/16 16:32 # Author : Mikasa # # Aim:自动将客户发的货物清单里的商品名称,翻译成英文…...

Ubuntu20.04取消root账号自动登录的方法,触觉智能RK3568开发板演示

Ubuntu20.04默认情况下为root账号自动登录,本文介绍如何取消root账号自动登录,改为通过输入账号密码登录,使用触觉智能EVB3568鸿蒙开发板演示,搭载瑞芯微RK3568,四核A55处理器,主频2.0Ghz,1T算力…...

诡异的Spring @RequestBody驼峰命名字段映射失败为null问题记录

问题 一个非常常规的Spring Controller,代码如下: import lombok.RequiredArgsConstructor;Slf4j RestController RequiredArgsConstructor RequestMapping("/api/v1/config") public class ConfigController {private final ConfigService …...

YOLOv10改进,YOLOv10检测头融合RFAConv卷积,添加小目标检测层(四头检测)+CA注意机制,全网首发

摘要 空间注意力已广泛应用于提升卷积神经网络(CNN)的性能,但它存在一定的局限性。作者提出了一个新的视角,认为空间注意力机制本质上解决了卷积核参数共享的问题。然而,空间注意力生成的注意力图信息对于大尺寸卷积核来说是不足够的。因此,提出了一种新型的注意力机制—…...

周末总结(2024/01/18)

工作 人际关系核心实践: 要学会随时回应别人的善意,执行时间控制在5分钟以内 坚持每天早会打招呼 遇到接不住的话题时拉低自己,抬高别人(无阴阳气息) 朋友圈点赞控制在5min以内,职场社交不要放在5min以外 职场的人际关系在面对利…...

LLM - 大模型 ScallingLaws 的迁移学习与混合训练(PLM) 教程(3)

欢迎关注我的CSDN:https://spike.blog.csdn.net/ 本文地址:https://spike.blog.csdn.net/article/details/145212097 免责声明:本文来源于个人知识与公开资料,仅用于学术交流,欢迎讨论,不支持转载。 Scalin…...

【软件开发过程管理规范】需求管理,需求分析,设计开发管理,测试管理(Word)

一、需求管理规程 1 简介 2 过程总体描述 2.1 过程概述 2.2 过程流程图 3 过程元素描述 3.1 准备阶段 3.2 需求调研 3.3 需求分析 软件开发人员及用户往往容易忽略信息沟通,这导致软件开发出来后不能很好地满足用户的需要,从而造成返工。而返工不仅在技术…...

计算机网络 (49)网络安全问题概述

前言 计算机网络安全问题是一个复杂且多维的领域,它涉及到网络系统的硬件、软件以及数据的安全保护,确保这些元素不因偶然的或恶意的原因而遭到破坏、更改或泄露。 一、计算机网络安全的定义 计算机网络安全是指利用网络管理控制和技术措施,保…...

RPA编程实践:Electron实践开始

文章目录 前言闲话少叙,打开官网版本发布安装在 Windows 上安装在 macOS 上安装在 Linux (Ubuntu) 上安装 前言 上回说道,我们electron适合于熟悉web开发,但想要研发桌面应用的人。 但我觉得这个需求应该不是很多。 因为使用electron&#…...

ORB-SLAM2源码学习:MapPoint.cc④: 新增地图点总结

前言 让我们来总结ORB-SLAM2 中的新增地图点。 1.在第一阶段跟踪中的恒速模型跟踪中新增地图点 针对双目相机或RGB-D相机,找出上一帧中具有有效深度值且不是地图点的特征点,将其中较近的点作为上一帧新的临时地图点, 并记录在向扯mlpTempo…...

2025西湖论剑-babytrace

前言 就做了下题目,pwn1/3 都是签到,pwn2 后面绕 ptrace 有点意思,简单记录一下 漏洞分析 子进程中的读/写功能没有检查负数的情况,存在越界读写: void __fastcall get_value(__int64 *int64_arr) {__int64 ll; //…...

绘图专用,26个常见流程图符号及其解释

关注作者 当您设计网站、构建应用程序或绘制业务系统时,您需要一种方法来清晰地绘制步骤和用户流程。虽然您可以使用流程图来概述这些过程,但箭头和方框只能帮助您到目前为止。为了清楚地表达您的意思,您需要流程图符号。 为了帮助解释每个流…...

【个人学习记录】软件开发生命周期(SDLC)是什么?

软件开发生命周期(Software Development Life Cycle,SDLC)是一个用于规划、创建、测试和部署信息系统的结构化过程。它包含以下主要阶段: 需求分析(Requirements Analysis) 收集并分析用户需求定义系统目标…...

自学SpringBoot笔记

概念 什么是SpringBoot? Spring Boot 是基于 Spring Framework 的一款开源框架,主要用于简化 Spring 应用程序的开发。它通过提供一系列的 开箱即用的功能 和 自动配置,让开发者可以快速构建生产级别的独立应用程序,而无需手动配…...

03JavaWeb——Ajax-Vue-Element(项目实战)

1 Ajax 1.1 Ajax介绍 1.1.1 Ajax概述 我们前端页面中的数据,如下图所示的表格中的学生信息,应该来自于后台,那么我们的后台和前端是互不影响的2个程序,那么我们前端应该如何从后台获取数据呢?因为是2个程序&#xf…...

[leetcode](找到vector中的特定元素并删除)无重复字符的最长子串

一.找到vector中的特定元素并删除 #include <iostream> #include <vector> #include <algorithm> int main() { // 示例 vector std::vector<int> vec {1, 2, 3, 4, 5, 6}; // 要删除的元素 int aim 3; // 查找元素 auto it std::fin…...

Mockito+PowerMock+Junit单元测试

一、单元测试用途 1、日常开发团队要求规范&#xff0c;需要对开发需求代码进行单元测试并要求行覆盖率达到要求&#xff0c;DevOps流水线也会开设相关门禁阀值阻断代码提交&#xff0c;一般新增代码行覆盖率80%左右。 二、Mock测试介绍 1、Mock是为了解决不同的单元之间由于…...

Ncat: bind to :::7777: Address already in use报错问题解决

问题描述 Ncat: bind to :::7777: Address already in use. QUITTING. 具体解决方法 If you are in linux environment try, Use netstat -tulpn to display the processeskill -9 <pid> This will terminate the process If you are using windows, Use netstat -…...

Docker 搭建mysql 连接超时问题,xxl-job启动mysql连接报错,禁用dns

1.本地连接Navicat报错信息&#xff0c;猜测是navicat默认连接超时导致的&#xff0c;后面换成idea一个插件虽然慢但连接上了 2013 - Lost connection to MySQL server at reading initial communication packet 2.启动xxl-job会报错&#xff0c;网上有人mysql驱动与数据库不匹…...

在线图片像素颜色拾取工具

在线图片像素颜色拾取工具&#xff0c;非常方便的一个工具&#xff0c;无需登录&#xff0c;用完就走。 包括中文和英文版本。 https://getcolor.openai2025.com...

Qt之登录界面(splash)

在上一篇多文档窗口设计(MDI)的基础上增加了一个登录界面&#xff08;splash&#xff09;. 该模块可以扩展为常规的软件登录界面。 界面展示如下 如果用户名和密码输入正确&#xff0c;则调到MDI界面&#xff0c;如果用户名和密码一共输入三次以上&#xff0c;则程序强制退出…...

NotebookLM:Google 最新 AI 笔记助理解析与实战应用

NotebookLM&#xff1a;Google 最新 AI 笔记助理解析与实战应用 在 AI 驱动的生产力工具不断进化的今天&#xff0c;Google 推出的 NotebookLM&#xff08;Notebook Language Model&#xff09;成为了一款备受关注的智能笔记助理。它结合了 Google 的大语言模型&#xff08;LL…...

软路由系统iStoreOS 一键安装 docker compose

一键安装命令 大家好&#xff01;今天我来分享一个快速安装 docker-compose 的方法。以下是我常用的命令&#xff0c;当前版本是 V2.32.4。如果你需要最新版本&#xff0c;可以查看获取docker compose最新版本号 部分&#xff0c;获取最新版本号后替换命令中的版本号即可。 w…...

vue3本地文件下载

开发记录&#xff1a; vue3本地下载文件要把文件放到public下&#xff0c;如果放在src里面可能会出现这个问题...

纯代码实现给WordPress添加文章复制功能

在给wordpress添加内容时&#xff0c;有时会遇到文章复制的功能&#xff0c;但是wordpress又没有这个功能。把下面一段代码添加到functions.php文件中&#xff0c;就可以实现这个功能。 /** Function for post duplication. Dups appear as drafts. User is redirected to the…...

戴南做网站/郑州seo优化外包顾问阿亮

高手必读 网络端口安全防护技巧放送众所周知&#xff0c;计算机之间通信是通过端口进行的&#xff0c;例如你访问一个网站时&#xff0c;Windows就会在本机开一个端口&#xff08;例 如1025端口&#xff09;&#xff0c;然后去连接远方网站服务器的一个端口&#xff0c;别人访问…...

贵州有哪些公司做网站做得好/网站seo排名免费咨询

网络基础之IP地址和子网掩码 IP地址与子网掩码 ip地址与子网掩码 子网掩码详解 类别 网络号 /占位数 主机号 /占位数 用途 A 1&#xff5e;126 / 8 0&#xff5e;255 0&#xff5e;255 1&#xff5e;254 / 24 国家级 B 128&#xff5e;191 0&#xff5e;255 / 16 0&#x…...

wordpress页面制作/公司网站建设服务

web框架 目前主流的 JavaScript 框架排名中&#xff0c;jQuery 和 Ext 可算是佼佼者&#xff0c;获得了用户的广泛好评。国内的一些框架很多也是仿照 jQuery 对 JavaScript 进行了包装&#xff0c;不过这些框架的鼻祖 YUI 还是坚持用自己的 JavaScript 类库。 jQuery 是目前用…...

网站客服系统软件/哈尔滨seo网站管理

威纶通触摸屏怎么实现批量数据监视(三菱、西门子问题汇总)三菱常见问题解答1:通讯参数是8位数据位&#xff0c;偶校验&#xff0c;1位停止位&#xff0c;波特率19200&#xff0c;无帧头无帧尾&#xff0c;无协议模式&#xff0c;那么D8120多少?答&#xff1a;数据位&#xff1…...

广告联盟上怎么做网站/搜狗收录查询

[关于统计学专业的学习进阶] Introductory 1.1 Introduction to Statistical Reasoning   统计学概念&#xff08;实验设计、描述统计、相关和回归、概率、抽样、机会模型、显著性检验等&#xff09;   Textbook: Seeing Through Statistics, Jessica M. Utts, 2008 1.2 In…...

北京企业网站建设/百度产品大全

一、文件上传的html与PHP注意事项1.form要设定enctype属性&#xff0c;method设置为post。enctype设置为multipart/form-data后&#xff0c;图片上传信息会被列入$_FILES超全局数组&#xff0c;而非$_POST&#xff0c;从而达到真正的上传目的2.设定隐藏input&#xff1a;表示最…...