Skip to main content

Table 1 Evaluations of different models on EndoVis-18 [25] and EndoVis-17 [26] datasets

From: Dual modality prompt learning for visual question-grounded answering in robotic surgery

Models

VisualFeature

EndoVis-18

EndoVis-17

Detection

Inference speed

ACC

F-score

mIoU

ACC

F-score

mIoU

VisualBERT [29]

  

0.5973

0.3223

0.7340

0.4382

0.3743

0.6822

VisualBERT R [30]

  

0.6064

0.3226

0.7305

0.4267

0.3506

0.6947

MCAN [31]

  

0.6084

0.3428

0.7257

0.4258

0.3035

0.6832

VQA-DeiT [24]

FRCNN [32]

55.28 ms

0.6049

0.3238

0.7217

0.4492

0.3213

0.7134

MUTAN [33]

  

0.6049

0.3238

0.7217

0.4364

0.3206

0.6870

MFH [34]

  

0.6179

0.3158

0.7227

0.3729

0.2048

0.7183

BlockTucker [35]

  

0.6067

0.3414

0.7313

0.4364

0.3210

0.6825

CAT-ViL DeiT [15]

  

0.6192

0.3521

0.7482

0.4555

0.3676

0.7049

DMPL (Ours)

  

0.6461

0.4930

0.7620

0.4760

0.3800

0.7138

VisualBERT [29]

  

0.6268

0.3329

0.7391

0.4005

0.3381

0.7073

VisualBERT R [30]

  

0.6301

0.3390

0.7352

0.4190

0.3370

0.7137

MCAN [31]

  

0.6285

0.3338

0.7526

0.4137

0.2932

0.7029

VQA-DeiT [24]

ResNet18 [36]

6.64 ms

0.6104

0.3156

0.7341

0.3797

0.2858

0.6909

MUTAN [33]

  

0.6283

0.3395

0.7639

0.4242

0.3482

0.7218

MFH [34]

  

0.6283

0.3254

0.7592

0.4103

0.3500

0.7216

BlockTucker [35]

  

0.6201

0.3286

0.7653

0.4221

0.3515

0.7288

CAT-ViL DeiT [15]

  

0.6452

0.3321

0.7705

0.4491

0.3622

0.7322

DMPL (Ours)

  

0.6953

0.5137

0.7827

0.4957

0.3717

0.7436