3/24/2023 0 Comments Autobot hierarchyWhile every layer in ViT processes information of the whole image, with this new method, stacked transformer layers are used to process only a region (i.e., block) of the image containing a few spatially adjacent image patches. The original ViT architecture divides an input image into small patches, projects pixels of each patch to a vector with predefined dimension, and then feeds the sequences of all vectors to the overall ViT architecture containing multiple stacked identical transformer layers. The overall architecture is simple to implement by adding just a few lines of Python code to the source code of the original ViT. Moreover, we introduce GradCAT, a new technique for interpreting the decision process of a trained model at inference time. We have shown such a design can meaningfully improve data efficiency with faster convergence and provide valuable interpretability benefits. The resulting architecture achieves competitive results on ImageNet and outperforms results on data-efficient benchmarks. This process is repeated in a hierarchical manner, resulting in a pyramid network structure. The central idea of this work is to decouple feature learning and feature abstraction ( pooling) components: nested transformer layers encode visual knowledge of image patches separately, and then the processed information is aggregated. To address these challenges, in “ Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding”, we present a rethinking of existing hierarchical structure–driven designs, and provide a novel and orthogonal approach to significantly simplify them. Moreover, these approaches lack an interpretable design, so it is difficult to explain the inner-workings of trained models. Existing ViT-based methods focus on designing a variety of modifications inside self-attention layers to achieve such a hierarchy, but while these offer promising performance improvements, they often require substantial architectural re-designs. The principle of hierarchical structure is one of the core ideas in vision models, where bottom layers learn more local object structures on the high-dimensional pixel space and top layers learn more abstracted and high-level knowledge at low-dimensional feature space. Many are exploring to find more suitable architectural re-designs that can learn visual representations effectively, such as by adding convolutional layers and building hierarchical structures with local self-attention. Such a design has been observed to be data inefficient - although the original ViT can perform better than convolutional networks with hundreds of millions of images for pre-training, such a data requirement is not always practical, and it still underperforms compared to convolutional networks when given less data. However, the number of connections between patches increases quadratically with image size. The core idea of ViT is to utilize the power of self-attention layers to learn global relationships between small patches of images. In visual understanding, the Vision Transformer (ViT) and its variants have received significant attention recently due to their superior performance on many core visual applications, such as image classification, object detection, and video understanding. Posted by Zizhao Zhang, Software Engineer, Google Research, Cloud AI Team
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |