TransMix: Attend to Mix for Vision Transformers

Transformer-based architectures are widely used in the field of computer vision. However, transformers-based networks are hard to optimize and can easily overfit if the training data is not sufficient. A common solution to the problem is using data augmentation and regularization techniques.

Image credit: Wikitude via Flickr, CC BY-SA 2.0

A recent paper on arXiv.org argues that this approach has its drawbacks as not all pixels are created equal.

Instead of investigating how to better mix images on the input level, the researchers focus on how to mild the gap between the input and the label space. The attention maps naturally generated in Vision Transformers are shown to be well suited for this job.

The method can be merged into the training pipeline with no extra parameters and minimal computation overhead. It is demonstrated that the approach leads to consistent and remarkable improvement for a wide range of tasks and models, like object detection or instance segmentation.

Mixup-based augmentation has been found to be effective for generalizing models during training, especially for Vision Transformers (ViTs) since they can easily overfit. However, previous mixup-based methods have an underlying prior knowledge that the linearly interpolated ratio of targets should be kept the same as the ratio proposed in input interpolation. This may lead to a strange phenomenon that sometimes there is no valid object in the mixed image due to the random process in augmentation but there is still response in the label space. To bridge such gap between the input and label spaces, we propose TransMix, which mixes labels based on the attention maps of Vision Transformers. The confidence of the label will be larger if the corresponding input image is weighted higher by the attention map. TransMix is embarrassingly simple and can be implemented in just a few lines of code without introducing any extra parameters and FLOPs to ViT-based models. Experimental results show that our method can consistently improve various ViT-based models at scales on ImageNet classification. After pre-trained with TransMix on ImageNet, the ViT-based models also demonstrate better transferability to semantic segmentation, object detection and instance segmentation. TransMix also exhibits to be more robust when evaluating on 4 different benchmarks. Code will be made publicly available at this https URL.

Research paper: Chen, J.-N., Sun, S., He, J., Torr, P., Yuille, A., and Bai, S., “TransMix: Attend to Mix for Vision Transformers”, 2021. Link: https://arxiv.org/abs/2111.09833

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

TransMix: Attend to Mix for Vision Transformers

Hospital Information Software: Bang for the Buck

Application of Information Technology in Our Daily Life

Information Technology and Textile Industry

The Stealth Sheet Ghillie Blanket: The Key to Ultimate Concealment

Outsmarting the Competition: Gaining the Ultimate Edge with a Tailored Website

Building Stronger Micropayment Platforms Enhancing Information Disclosure for User Protection

AI Skills in High Demand in Silicon Valley

More Stories