Swin 与 ViT 的层次化协同 ——中草药图像细粒度分类的动态融合框架
摘要
捉细节和结构方面存在不足,CNN 难以建模长距离依赖关系,而 ViT 因全局注意力机制计算复杂度且需要大量数据训练
导致小规模数据分类受限。本文提出一种结合 Swin-Transformer 和 ViT-Transformer 的双分支融合模型,利用局部窗口注
意力和全局自注意力的互补特性,并采用冻结 ViT 浅层参数的优化策略,有效降低计算成本。该模型旨在解决细粒度植物
分类问题,为中草药识别提供高效模型。
关键词
全文:
PDF参考
[1] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et
al. An image is worth 16×16 words: Transformers for image
recognition at scale[J]. Advances in Neural Information Processing
Systems, 2020, 33: 1-22.
[2] LIU Z, LIN Y, CAO Y, et al. Swin Transformer:
Hierarchical vision transformer using shifted windows[C]//
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 10012-10022.
[3] HE K, ZHANG X, REN S, et al. Deep residual learning
for image recognition[C]//Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2016: 770-778.
[4] HAN K, WANG Y, CHEN H, et al. A survey on vision
transformer[J]. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2022, 44(12): 1-20.
[5] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et
al. An image is worth 16×16 words: Transformers for image
recognition at scale[J/OL]. arXiv preprint arXiv:2010.11929, 2020.
[6] LIU Z, LIN Y, CAO Y, et al. Swin Transformer:
Hierarchical vision transformer using shifted windows[J/OL]. arXiv
preprint arXiv:2103.14030, 2021.
DOI: http://dx.doi.org/10.12361/2661-376X-07-05-171642
Refbacks
- 当前没有refback。

