From: Dual modality prompt learning for visual question-grounded answering in robotic surgery
Models | VisualFeature | EndoVis-18 | EndoVis-17 | |||||
---|---|---|---|---|---|---|---|---|
Detection | Inference speed | ACC | F-score | mIoU | ACC | F-score | mIoU | |
VisualBERT [29] | Â | Â | 0.5973 | 0.3223 | 0.7340 | 0.4382 | 0.3743 | 0.6822 |
VisualBERT R [30] | Â | Â | 0.6064 | 0.3226 | 0.7305 | 0.4267 | 0.3506 | 0.6947 |
MCAN [31] | Â | Â | 0.6084 | 0.3428 | 0.7257 | 0.4258 | 0.3035 | 0.6832 |
VQA-DeiT [24] | FRCNN [32] | 55.28Â ms | 0.6049 | 0.3238 | 0.7217 | 0.4492 | 0.3213 | 0.7134 |
MUTAN [33] | Â | Â | 0.6049 | 0.3238 | 0.7217 | 0.4364 | 0.3206 | 0.6870 |
MFH [34] | Â | Â | 0.6179 | 0.3158 | 0.7227 | 0.3729 | 0.2048 | 0.7183 |
BlockTucker [35] | Â | Â | 0.6067 | 0.3414 | 0.7313 | 0.4364 | 0.3210 | 0.6825 |
CAT-ViL DeiT [15] | Â | Â | 0.6192 | 0.3521 | 0.7482 | 0.4555 | 0.3676 | 0.7049 |
DMPL (Ours) | Â | Â | 0.6461 | 0.4930 | 0.7620 | 0.4760 | 0.3800 | 0.7138 |
VisualBERT [29] | Â | Â | 0.6268 | 0.3329 | 0.7391 | 0.4005 | 0.3381 | 0.7073 |
VisualBERT R [30] | Â | Â | 0.6301 | 0.3390 | 0.7352 | 0.4190 | 0.3370 | 0.7137 |
MCAN [31] | Â | Â | 0.6285 | 0.3338 | 0.7526 | 0.4137 | 0.2932 | 0.7029 |
VQA-DeiT [24] | ResNet18 [36] | 6.64Â ms | 0.6104 | 0.3156 | 0.7341 | 0.3797 | 0.2858 | 0.6909 |
MUTAN [33] | Â | Â | 0.6283 | 0.3395 | 0.7639 | 0.4242 | 0.3482 | 0.7218 |
MFH [34] | Â | Â | 0.6283 | 0.3254 | 0.7592 | 0.4103 | 0.3500 | 0.7216 |
BlockTucker [35] | Â | Â | 0.6201 | 0.3286 | 0.7653 | 0.4221 | 0.3515 | 0.7288 |
CAT-ViL DeiT [15] | Â | Â | 0.6452 | 0.3321 | 0.7705 | 0.4491 | 0.3622 | 0.7322 |
DMPL (Ours) | Â | Â | 0.6953 | 0.5137 | 0.7827 | 0.4957 | 0.3717 | 0.7436 |