OpenAlex Citation Counts

OpenAlex Citations Logo

OpenAlex is a bibliographic catalogue of scientific papers, authors and institutions accessible in open access mode, named after the Library of Alexandria. It's citation coverage is excellent and I hope you will find utility in this listing of citing articles!

If you click the article title, you'll navigate to the article, as listed in CrossRef. If you click the Open Access links, you'll navigate to the "best Open Access location". Clicking the citation count will open this listing for that article. Lastly at the bottom of the page, you'll find basic pagination options.

Requested Article:

An Empirical Study of Training End-to-End Vision-and-Language Transformers
Zi-Yi Dou, Yichong Xu, Zhe Gan, et al.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Open Access | Times Cited: 218

Showing 1-25 of 218 citing articles:

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
Yupan Huang, Tengchao Lv, Lei Cui, et al.
Proceedings of the 30th ACM International Conference on Multimedia (2022)
Open Access | Times Cited: 256

VLP: A Survey on Vision-language Pre-training
Feilong Chen, Duzhen Zhang, Minglun Han, et al.
Deleted Journal (2023) Vol. 20, Iss. 1, pp. 38-56
Open Access | Times Cited: 128

Scaling Language-Image Pre-Training via Masking
Yanghao Li, Haoqi Fan, Ronghang Hu, et al.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Open Access | Times Cited: 125

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Tristan Thrush, Ryan Jiang, Max Bartolo, et al.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Open Access | Times Cited: 115

Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval
Ding Jiang, Mang Ye
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Open Access | Times Cited: 105

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, et al.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Open Access | Times Cited: 97

Generalized Decoding for Pixel, Image, and Language
Xueyan Zou, Zi-Yi Dou, Jianwei Yang, et al.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), pp. 15116-15127
Open Access | Times Cited: 95

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Chenliang Li, Haiyang Xu, Junfeng Tian, et al.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2022)
Open Access | Times Cited: 93

Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey
Xiao Wang, Guangyao Chen, Guangwu Qian, et al.
Deleted Journal (2023) Vol. 20, Iss. 4, pp. 447-482
Open Access | Times Cited: 92

Injecting Semantic Concepts into End-to-End Image Captioning
Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, et al.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), pp. 17988-17998
Open Access | Times Cited: 84

Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing
Shruthi Bannur, Stephanie L. Hyland, Qianchu Liu, et al.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), pp. 15016-15027
Open Access | Times Cited: 51

Pedestrian-specific Bipartite-aware Similarity Learning for Text-based Person Retrieval
Fei Shen, Xiangbo Shu, Xiaoyu Du, et al.
(2023), pp. 8922-8931
Closed Access | Times Cited: 43

PMC-CLIP: Contrastive Language-Image Pre-training Using Biomedical Documents
Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, et al.
Lecture notes in computer science (2023), pp. 525-536
Closed Access | Times Cited: 42

Multimodal Large Language Models: A Survey
Jiayang Wu, Wensheng Gan, Zefeng Chen, et al.
2021 IEEE International Conference on Big Data (Big Data) (2023)
Open Access | Times Cited: 42

FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions
Noam Rotstein, David Bensaïd, Shaked Brody, et al.
2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2024), pp. 5677-5688
Open Access | Times Cited: 15

Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training
Zhihong Chen, Yuhao Du, Jinpeng Hu, et al.
Lecture notes in computer science (2022), pp. 679-689
Closed Access | Times Cited: 63

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, et al.
Lecture notes in computer science (2022), pp. 521-539
Closed Access | Times Cited: 53

Image-text Retrieval: A Survey on Recent Research and Development
Min Cao, Shiping Li, Juntao Li, et al.
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (2022), pp. 5410-5417
Open Access | Times Cited: 47

VindLU: A Recipe for Effective Video-and-Language Pretraining
Feng Cheng, Xizi Wang, Jie Lei, et al.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Open Access | Times Cited: 38

GeoLayoutLM: Geometric Pre-training for Visual Information Extraction
Chuwei Luo, Changxu Cheng, Zheng Qi, et al.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Open Access | Times Cited: 33

BridgeTower: Building Bridges between Encoders in Vision-Language Representation Learning
Xu Xiao, Chenfei Wu, Shachar Rosenman, et al.
Proceedings of the AAAI Conference on Artificial Intelligence (2023) Vol. 37, Iss. 9, pp. 10637-10647
Open Access | Times Cited: 30

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Tsu-Jui Fu, Linjie Li, Zhe Gan, et al.
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), pp. 22898-22909
Open Access | Times Cited: 30

Deep image captioning: A review of methods, trends and future challenges
Liming Xu, Quan Tang, Jiancheng Lv, et al.
Neurocomputing (2023) Vol. 546, pp. 126287-126287
Closed Access | Times Cited: 29

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Shraman Pramanick, Yale Song, Sayan Nag, et al.
2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2023), pp. 5262-5274
Open Access | Times Cited: 23

Dynamic Contrastive Distillation for Image-Text Retrieval
Jun Rao, Liang Ding, Shuhan Qi, et al.
IEEE Transactions on Multimedia (2023) Vol. 25, pp. 8383-8395
Open Access | Times Cited: 22

Page 1 - Next Page

Scroll to top