Dual modality prompt learning for visual question-grounded answering in robotic surgery

Visual Computing for Industry, Biomedicine, and Art

Table 1 Evaluations of different models on EndoVis-18 [25] and EndoVis-17 [26] datasets

Models	VisualFeature		EndoVis-18			EndoVis-17
Models	Detection	Inference speed	ACC	F-score	mIoU	ACC	F-score	mIoU
VisualBERT [29]			0.5973	0.3223	0.7340	0.4382	0.3743	0.6822
VisualBERT R [30]			0.6064	0.3226	0.7305	0.4267	0.3506	0.6947
MCAN [31]			0.6084	0.3428	0.7257	0.4258	0.3035	0.6832
VQA-DeiT [24]	FRCNN [32]	55.28 ms	0.6049	0.3238	0.7217	0.4492	0.3213	0.7134
MUTAN [33]			0.6049	0.3238	0.7217	0.4364	0.3206	0.6870
MFH [34]			0.6179	0.3158	0.7227	0.3729	0.2048	0.7183
BlockTucker [35]			0.6067	0.3414	0.7313	0.4364	0.3210	0.6825
CAT-ViL DeiT [15]			0.6192	0.3521	0.7482	0.4555	0.3676	0.7049
DMPL (Ours)			0.6461	0.4930	0.7620	0.4760	0.3800	0.7138
VisualBERT [29]			0.6268	0.3329	0.7391	0.4005	0.3381	0.7073
VisualBERT R [30]			0.6301	0.3390	0.7352	0.4190	0.3370	0.7137
MCAN [31]			0.6285	0.3338	0.7526	0.4137	0.2932	0.7029
VQA-DeiT [24]	ResNet18 [36]	6.64 ms	0.6104	0.3156	0.7341	0.3797	0.2858	0.6909
MUTAN [33]			0.6283	0.3395	0.7639	0.4242	0.3482	0.7218
MFH [34]			0.6283	0.3254	0.7592	0.4103	0.3500	0.7216
BlockTucker [35]			0.6201	0.3286	0.7653	0.4221	0.3515	0.7288
CAT-ViL DeiT [15]			0.6452	0.3321	0.7705	0.4491	0.3622	0.7322
DMPL (Ours)			0.6953	0.5137	0.7827	0.4957	0.3717	0.7436