Convolutional Neural Networks (CNNs), architectures consisting of convolutional layers have been the standard choice in vision tasks. Recent studies have shown that Vision Transformers (VTs), architectures based on self-attention modules, achieve comparable performance in challenging tasks such as object detection and semantic segmentation. However, the image processing mechanism of VTs is different from that of conventional CNNs.
This poses several questions about their generalizability, robustness, reliability, and texture bias when used to extract features for complex tasks. To address these questions, we study and compare VT and CNN architectures as a feature extractors in object detection and semantic segmentation. Our extensive empirical results show that features generated by VTs are the most robust to distribution shifts, whereas CNNs perform better at higher image resolutions in object detection. Furthermore, our results demonstrate that VTs in dense protection tasks produce more reliable and less texture-biased predictions.
Read our paper now!