Dual modality prompt learning for visual question-grounded answering in robotic surgery

With recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the interpretative capacity of the VQA models and their ability to explore specific image regions. To address this issue, this study proposes a grounded VQA model for robotic surgery, capable of localizing a specific region during answer prediction. Drawing inspiration from prompt learning in language models, a dual-modality prompt model was developed to enhance precise multimodal information interactions. Specifically, two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model. A visual complementary prompter merges visual prompt knowledge with visual information features to guide accurate localization. The textual complementary prompter aligns visual information with textual prompt knowledge and textual information, guiding textual information towards a more accurate inference of the answer. Additionally, a multiple iterative fusion strategy was adopted for comprehensive answer reasoning, to ensure high-quality generation of textual and grounded answers. The experimental results validate the effectiveness of the model, demonstrating its superiority over existing methods on the EndoVis-18 and End-oVis-17 datasets.


Introduction
Visual question answering (VQA) has emerged as a pivotal multimodal task in recent years, seamlessly integrating visual and language components.With the development of deep learning, the VQA system has the potential to serve as an auxiliary tool with extensive applications in the healthcare domain, offering valuable assistance to physicians in diagnosis and decision-making [1][2][3].Most existing deep-learning-based VQA models primarily generate and validate these textual responses [4][5][6].However, such a validation mechanism is too simple to ensure that the model correctly answers questions based on visual content.Consequently, research efforts have been dedicated to enhancing answer validation mechanisms for a more reliable VQA system [7][8][9][10].
Recently, some datasets in the field of VQA have incorporated a grounded answer validation mechanism [11][12][13].This verification process involves the accurate identification of specific regions in the image corresponding to the textual answers, thereby enhancing both the accuracy (ACC) and interpretability of the answers.These developments have increased the feasibility of developing secure and reliable VQA models for medical applications.Subsequently, some works [14][15][16] introduced visual question-grounded answering (VQGA) specifically for robotic surgery, to establish a correspondence between the answers and the spatial location of objects within the surgical scene.Bai et al. [15] proposed the co-attention gated vision-language embedding (CAT-ViL) and surgical-visual question localized-answering (surgical-VQLA) [14] models to enhance the embedding of multimodal information.This enhancement facilitates the generation of text and corresponding foundational answers, thereby enabling a better understanding of the surgical scene.Similarly, the CS-VQLA [16] model attempts to address the problem of continuous model learning in complex surgical scenes through distillation, providing more accurate and localized answers.
Although these models effectively integrate problems and image information to improve system ACC, their dependence on the visual and textual prompts from the current surgical scenario hampers their ability to comprehensively address diverse instances within the same surgical context in answering relevant questions.Consequently, augmenting the VQGA model's understanding of global knowledge is imperative for refining the precise localization of local information.This refinement is crucial for meeting the stringent demands of real-time decision-making and ACC in surgical operations.
In this study, we propose a framework that combines prompt learning and pre-training models for VQGA in robotic surgery, enabling the model to attend to global information based on the prompted knowledge.To better utilize the knowledge in the prompts and supplement the visual and semantic information of each image, we designed a cross-modal prompt interaction mechanism that integrates the prompted knowledge into the encoding end of the CAT-ViL [15] model through cross attention, focusing the model attention on capturing finegrained information for subsequent matching instances, grounding the answers on images and text.Specifically, two simple and lightweight prompt fusion modules were proposed to insert into the encoding end of the base model.The visual complementary prompter (VCP) integrates visual box prompt features with the original image encoding, whereas the text complementary prompter (TCP) module fuses visual box prompt features with the original question encoding and labels the features of the prompt.This combination allows the model to absorb rich visual and semantic information by prompting knowledge at the encoding end.Notably, an iterative fusion strategy was adopted to hierarchically superimpose and interact with different information, aiming to promote better alignment and fusion between different modal and question prompt features.The contributions of this study are summarized as follows: 1) A VQGA framework based on dual-modality prompt learning is proposed to effectively explore the prompted knowledge in robotic surgery tasks for better grounded answer generation.2) Two prompters are proposed to complement visual and textual information, whereby a multiple iterative fusion strategy is adopted to encode the knowledge in prompts and multimodal features, which facilitates the aggregation of complementary information across multiple feature levels.3) Using the proposed model, state-of-the-art performance was obtained on two challenging datasets, surpassing the achievements of previous studies by a considerable margin.
Explainable VQA systems have proven to be reliable in the medical field [9,11,12].With the medical artificial intelligence field flourishing [7,17], VQA systems based on robotic surgery have been extensively developed and applied.VQA systems can answer questions about specific visual elements in surgical videos or images [8,16].For example, surgeons can ask the system questions about certain aspects of the surgical field, such as identifying specific tissues, organs, and surgical tools.However, a key problem with robotic surgery-based VQA systems is their lack of interpretability.Although these systems can provide textual answers to questions, they cannot highlight the relevant regions of the image corresponding to the textual answers.Surgical scenarios often involve various instruments and actions that can confuse the questioner.To help questioners deal with this confusion, researchers [14,17] proposed the establishment of a VQGA system to effectively learn and understand surgical scenes.
The surgical-VQLA [14] model combines a visual transformer with a gating visual-linguistic embedding system to accurately locate specific surgical areas during answer prediction.The CAT-ViL model proposed by Bai et al. [15] achieves answer grounding by emphasizing the effective integration of multimodal inputs.The latest CS-VQLA [16] model utilizes distillation to achieve continuous learning, resulting in more accurate textual and grounded answers.Both studies emphasize the importance of integrating visual and language data in robotic surgery to improve the ACC and efficiency of answers.However, to enhance their practical applicability, these models must achieve higher ACC levels and utilize the rich visual and semantic information inherent in the images and text more effectively.
In recent years, there has been rapid development in the field of multimodal pre-training of large-scale models.Researchers often use fine-tuning to leverage pretrained large models for downstream tasks.This method is often inefficient in terms of parameters, usually requiring numerous copies specific to each task and substantial storage for each version of the fully pretrained model.Recently, the emergence of prompt learning as a new paradigm has drastically improved the performance of various downstream natural language processing tasks [18,19], and has been effective in several computer vision tasks.For example, the Visual Prompt Tuning model proposed by Jia et al. [20] adds a set of learnable parameters to transformer encoders and has been demonstrated superior to complete fine-tuning in 20 downstream recognition tasks.The AdaptFormer network introduced by Chen et al. [21] integrates lightweight modules into the vision transformer, achieving better results than fully fine-tuned models in action recognition benchmarks.The convolutional bypass model proposed by Jie and Deng [22] utilizes convolutional bypasses in pretrained vision transformers for prompt learning.The designs of these prompts often strategically utilize prior knowledge and capabilities of the model to direct attention or provide explanations for the expected results.The effectiveness of prompt learning largely depends on the architecture of the underlying model and training on relevant datasets.Inspired by this, we introduced prompt learning into the VQGA system for further exploration, incorporating dual-modality prompt knowledge from both visual and textual sources.A novel prompt-learning framework specifically tailored for VQGA was developed to better utilize the potential knowledge contained in each modality prompt.To the best of our knowledge, this is the first study to apply prompt learning to VQGA systems for robotic surgery.

Architectural overview
As shown in Fig. 1a, we propose a VQGA framework based on bimodal prompts.Initially, the framework utilizes the pretrained CAT-ViL [15] model to extract prompts for visual and textual knowledge, referred to as box and label prompt features, respectively.CAT-ViL is a model trained for the VQLA task and can directly obtain prompt features by configuration.The questions and images are input into the model, whereby the model outputs information on three localized visual features and corresponding three label features.These serve as the prompt features in our model.To better utilize the prompt knowledge from both modalities, we designed complementary visual and text prompters.These are integrated with the existing pretrained model through a layered iterative fusion approach using the prompt information to guide the interaction and fusion of multimodal information.
The overall architecture of DMPL for VQGA in robotic surgery is presented in Fig. 1.The proposed network components include the VCP and TCP.The pretrained CAT-ViL model [15] is leveraged to extract prompt features for visual box prompt and textual label prompt features.The VCP integrates visual prompt and visual features through a cross-attention mechanism, aiming to guide the localization of visual features using the features of visual prompts.TCP aligns questions and visual features using the attentional feature fusion (AFF) module [23], subsequently aligning these with label prompt features to synchronize the three types of information before guiding them through cross-attention.The goal is to generate the correct answer by jointly guiding textual information with label prompt and visual features.Six fusion iterations are performed between the prompt features and extracted original features, as indicated by the blue box in Fig. 1a, to ensure more effective guidance.Subsequently, the refined visual and textual data obtained through complex reasoning are merged using the stacked iterative attentional feature fusion (iAFF) module [23].The combined data are then decoded along with the postreasoning visual and textual prompt features.Finally, the model generates the final textual answers and grounded answers through a dedicated classification head and an object detection head.
To enhance the utilization of these bimodal prompt systems, separate generators were designed for the visual and textual supplementary prompts.These generators were integrated with the original pretrained model using a layered iterative fusion approach.This method employs prompts to direct the interaction and fusion of multimodal information, thereby enabling more precise generation of textual answers and localization of visually grounded answers.The proposed architecture also enhances the utilization of local visual regions.The TCP incorporates the features of visual regions and aligns them with both textual and text prompt features.This alignment guides the textual information through visual information, thereby mitigating the language bias in VQA models to generate textual answers through reasoning based on visual information.To strengthen the interaction between the different modalities, textual information features are also input into the guided attention module under the guidance of the text-complementary prompt module.This further guides the localization of the visual region information, enabling the model to generate more accurate textual and grounded localized visual answers.Subsequently, the visual and textual data refined through layered reasoning are amalgamated using a sophisticated stacked attention mechanism.These combined data are then cohesively encoded alongside visual and textual feature prompts using a data-efficient image transformer (DeiT) [24] encoding.The model was trained using a hybrid loss function that combines crossentropy loss, L1-norm, and generalized intersection loss, to ensure a comprehensive learning process.Ultimately, this framework culminates in the generation of the final textual and visual referential answers achieved through a dedicated classification head and an object detection head, respectively.

VCP
This paper proposes a method of visual complementarity to enhance multimodal visually related information for a more accurate inference of ground truth answers through more effective utilization of visual prompting knowledge.
First, as shown in Fig. 1b, both the box prompt features obtained from the pretrained model and visual features encoded through the image encoder are independently encoded using the self-attention mechanism to obtain the prompt features B s and image encoding V s , thereby enhancing the internal relationships between each feature.The encoded box prompt features are then downsampled to restrict the range of relevant information.Next, the image features V s are applied using spatial softmax, performing smoothing across all spatial dimensions and employing channel-wise spatial attention to generate the enhanced embedded image features V m based on the following formula: Second, the prompt features B s after downsampling, are effectively integrated with the visual features V m .Drawing inspiration from the segment anything model, two cross-attention fusion modules are used to amalgamate the enhanced visual information features with prompt features.The feature V c is obtained through cross-attention from the prompt (as a query) directed to the embedded image.In contrast, prompt features B c are obtained through cross-attention from the embedded image (as a query) directed to the prompts.These two cross-attention mechanisms facilitate the learning (1) of the dependencies between prompt knowledge and visual features.The formulae are as follows: where T represents the matrix transpose operation, and B d is obtained after downsampling B s .
Third, to better guide the localization of the correct answer position with prompt and textual information, visual features M i are obtained by passing B c through a residual structure and text-guided attention mechanism [15].
Finally, an iterative fusion strategy is adopted to repeat the process described above, integrating valuable insights by adjusting the prompt and visual information of the instances.Thus, a wealth of details relevant to the domains of both visual and textual features are retained throughout the dynamic process of information exchange.The visual features O i and prompt features Y i are then used as new adjusted visual features and box prompt features, which are continuously input into the VCP for iterative updates.This yields the final prompt features and visual features Y 6 .The formula is as follows: where Ψ represents the operation of the VCP presented above, and i represents the number of iterations.

TCP
To effectively extract information from the comparison of features and label prompts across different modalities, this study introduces a text-complementary prompter.As illustrated in Fig. 1c, the label prompt features are initially encoded alongside text information features using a self-attention mechanism, resulting in features denoted as L s (for label prompts) and Q s (for text information).Subsequently, label-prompt features L s undergo upsampling.Drawing inspiration from AFF [23], the positional information of the visual input is projected into two joint feature spaces.This projection is performed alongside the labeling of prompt features and text information using the AFF module, thereby aligning multimodal knowledge and enhancing the interaction of information.The specific formula is as follows: (2) where Γ AFF represents the AFF operation; β represents the upsampling operation; V s represents input image features.
To derive more meaningful multimodal knowledge prompts from the visual and textual alignment features, the learned multimodal prompts are incrementally integrated into the text feature space using a residual approach.This process utilizes two cross-attention fusion modules in merging the text and prompt features.The text features R i are acquired by applying cross attention to the prompts (serving as the query) for text encoding.Conversely, prompt features G i are obtained by applying cross attention to the text encoding (serving as the query) for the prompts.The formula for this process is as follows: where T represents the matrix transpose operation; R i represents the visually guided text features; and G i represents the multimodal information guided prompt features.
As shown in Fig. 1, to avoid excessive guidance from visual information and maintain balance during the six iterations, visual feature information V i is added only in the first iteration.In subsequent iterations, the prompt multimodal features C i replace V i as the input to the module.Throughout the iterations, features C i interact repeatedly with features R i and G i through crossattention, aiming to guide textual information based on prompt information.This results in the refined text prompt feature G 6 and text information feature R 6 .The formula is as follows: where i is an integer that represents the number of iterations, and Φ represents the text complementary prompt module.To infer instance boundaries and answers relevant to the question from text features R i and visual features M i guided by prompt knowledge, we devised a cross-modal feature integration mechanism.This involves the integration of R 6 and M 6 through a fusion module composed of two iAFF modules [23], resulting in the generation of a fused embedding The fused feature F i is input into the pretrained DeiTbase [24] module, and through residual connections, F i is merged with Y 6 and G 6 , further refining the relationships between the features within each domain.Finally, the classification from DeiT is run on the feedforward network, to predict instance bounding boxes and answers relevant to the question.

Datasets and evaluation metrics
Experiments were conducted on the EndoVis-2018 [25] and EndoVis-2017 [26] datasets.The EndoVis-2018 comprises video sequences from 14 robotic surgeries [27], with the training set consisting of 1560 frames and 9014 question-answer pairs and the test set comprising 447 frames and 2769 question-answer pairs.The questionanswer pairs cover 18 answer categories encompassing various single-word answers related to organs, surgical instruments, and interactions between instruments and organs.For questions involving the interaction between organs and instruments, the bounding box incorporates both the organ and instrument.Each example video contains multiple question-answer pairs along with their corresponding bounding box annotations [14].The EndoVis-2017 dataset includes video sequences from 10 robotic surgeries with 97 frames containing 472 QA pairs.This dataset was utilized solely for external validation and not as part of the training set.
PSI-AVA dataset, specifically designed for robotassisted radical prostatectomy surgeries, significantly contributes to the field of surgical scene understanding.PSI-AVA-VQA is an innovative dataset featuring QA pairs derived from critical surgical instances across eight cases from a comprehensive PSI-AVA surgical scene collection.These QA pairs were carefully created from annotations related to surgical phases, steps, and locations within the PSI-AVA collection.With 10,291 QA pairs, the PSI-AVA-VQA dataset encompasses 35 distinct answer categories, including four locations, 11 phases of surgery, and 20 distinct surgical steps.Annotations categorize the QA pairs into three groups: location, phase, and step, adhering to the original PSI-AVA dataset's fold-1 training/test division methodology.
The reasoning performance of the model was evaluated using the ACC of the text answers and precision of the grounded answer.For grounding answer evaluation, the similarity between each bounding box annotation and ground truth was measured using Intersection over Union (IoU), computing the mean IoU (mIoU) scores for all the test examples.The textual answers were evaluated using two metrics: ACC and F-score.(11)

Implementation details
The proposed model was trained under cross-entropy loss, L1-norm, and generalized IoU loss using the Adam [28] optimizer, with initial learning rates of 1e-5 for all parameters.The proposed model was trained on the End-oVis-18 training set, with the performance evaluated on the EndoVis-18 validation set, using EndoVis-17 as an external validation dataset to test the model's generalization ability.The experiments were conducted using the Python PyTorch framework on a server equipped with an NVIDIA Tesla A100 GPU.

Comparison results
The proposed DMPL model was compared with previous studies on the EndoVis-18 [25] and EndoVis-17 [26] datasets, and the results are reported in Table 1, based on the answering and bounding box metrics.The proposed model surpassed previous advanced methods in most scenarios.Specifically, regarding ACC, F-score, and mIoU score, DMPL outperformed CAT-ViL DeiT [15] by 5.01%, 18.16%, and 1.22% on the EndoVis-18 dataset, and 4.66%, 0.95%, and 1.14% on the EndoVis-17 dataset, respectively.The experimental results indicate that the proposed DMPL achieves superior performance in terms of VQA while maintaining clinical alignment between the answer and related visual instances.This improvement in performance is mainly due to the incorporation of visual and textual supplementary prompts through the two proposed prompters, which assist the model in filtering valid multidomain information.
To further evaluate the robustness of the model, quantitative experiments were conducted to assess performance degradation of the model when confronted with corrupted images.Following the ref.[37], we selected 15 corruption types prevalent in the real world for our experiments and set five levels of corruption for each type.As shown in Fig. 2, the performance of all the models is directly proportional to image quality.However, compared with advanced models, such as VisualBERT [29], VisualBERT ResMLP [30], and CAT-ViL DeiT [15], the proposed DMPL achieves the best performance across all levels of image corruption.This indicates that the proposed model has the capability to maintain robustness when dealing with previously unseen corrupted images, which can be attributed to the prompt knowledge introduced.
To further validate the generalizability of the proposed model, we trained and tested it on the PSI-AVA dataset.As shown in Table 2, compared to the latest models, the proposed model exhibits higher ACC and recall rates on this dataset but lower precision and F-scores.These results suggest that the model can identify most positive samples, indicating a certain level of generalizability.However, many negative samples are misclassified as positive.This issue may arise from the model optimization process not being sufficiently detailed too leniently labeling the samples as positive.Addressing this issue will be the focus of future work. of both.Experiments were performed on the EndoVis-18 [25] and EndoVis-17 [26] datasets, maintaining settings consistent with the quantitative outcomes outlined in Table 3.
The results unequivocally illustrate that incorporating either VCP or TCP, individually or concurrently, significantly improves the predictive ACC of the model for both bounding boxes and responses.This surpasses the performance of advanced models across various benchmarks and underscores the essential role played by each prompt in enhancing the proficiency of the model.However, the simultaneous integration of both prompters resulted in a comparatively marginal increase in bounding box prediction ACC compared with independent integration.This observation stems from the methodology of the proposed model in which each prompter contributes complex semantic features sequentially, leading to the transformation of the initial prompter's contribution into more rudimentary semantic features.This sequential integration may inadvertently affect bounding box prediction ACC by potentially introducing confusion between the complex semantic inputs from the secondary prompter.
To further demonstrate the robustness of the proposed model, the qualitative results of VisualBERT [29], VisualBERT ResMLP [30], CAT-ViL DeiT [15], and our model were visualized on the EndoVis-18 dataset [25] for 15 types of image corruptions at level 2 of image degradation.As shown in Fig. 3, various types of image corruption interfere with the localization of bounding boxes in advanced models, thereby indirectly affecting the predictions of answers.By contrast, the proposed model can successfully suppress the interference introduced by image corruption, correctly predicting the answers.

Qualitative analysis
In Fig. 4, the four sets of sample image-question pairs can be visualized along with the ground-truth answers and generated answers.The proposed model demonstrates a pronounced ability to pinpoint instance locations pertinent to posed questions, markedly enhancing the caliber of the generated responses.For instance, in Example 2, advanced models such as VisualBERT [29], VisualBERT ResMLP [30], and CAT-ViL DeiT [14] erroneously ground the bounding box to the image's bottomleft corner, which leads to an inaccurate prediction of "bottom-left" as the answer.Conversely, the proposed model accurately identifies the bounding box at the topleft position, providing the correct answer.An analogous outcome is observed in Example 4. The research findings indicate that the VCP framework, through the integration of visual and textual prompt knowledge, effectively disregards irrelevant areas within images.This approach significantly minimizes the distractions in answer prediction, thereby enhancing the precision and focus of the response mechanism.
In Example 1, advanced models excel in predicting accurate answers but struggle to precisely localize relevant bounding boxes.Conversely, VisualBERT [29] in Example 3 adeptly identifies the positions of instances related to the query but fails to deliver the correct answer.This highlights the ongoing challenge for advanced models to seamlessly integrate visual text and location-answer alignments.By contrast, the proposed model consistently achieves these alignments across both examples.This proficiency is attributed to the effective alignment and interaction of multi-domain knowledge within the TCP,   As the EndoVis-17 dataset does not provide training data and is only used for testing, the experimental results from this dataset reflect the generalizability of the model.Therefore, we focused on analyzing the model's performance on this dataset to acquire a deeper understanding of its strengths and weaknesses.
As shown in Fig. 5, Examples 1 and 2 demonstrate that the proposed model achieves more accurate visual answer localization, leading to correct textual answers unlike other models generating erroneous answers because of incorrect localization.In Example 3, other models also locate the correct answer, but with an overly broad and imprecise range.This inclusion of excessively irrelevant information results in incorrect textual answers.
These three examples prove that the proposed model significantly improves the precision of visual answer localization compared to previous methods.Example 4 shows that all models identified the correct visual information; however, the answers generated by the other models were still incorrect.This proves that the proposed model, guided by prompt information, is engaged in a more thorough visual and textual information interaction and alignment.These examples demonstrate that the proposed model outperforms previous models in terms of both localization and answer ACC.

Limitations
The proposed model leverages the advantages of dualmodal prompt learning to some extent to improve the ACC of answers.However, achieving precise semantic alignment between modalities at a fine-grained level still poses certain challenges.In other words, there may be instances where the word in the question does not align accurately with the object in the image.
Figure 6 presents four examples of incorrect results.In Example 1, although the answer is correct, the positioning is incorrect.In Example 2, the positioning is correct but the answer is incorrect.These two examples indicate that the textual and visual information are not sufficiently aligned, leading to discrepancies between the visual and textual answers.Although the proposed approach mitigates language bias to some extent, it remains susceptible to biases inherent in the training data.This can influence the system's decision-making process, particularly in scenarios in which textual and visual prompts suggest conflicting interpretations.
Example 3 demonstrates an area of localization that is not sufficiently specific, focusing on excessive incorrect visual information, resulting in an incorrect answer.This example also suggests that the model ACC in localizing specific regions, particularly small targets, requires improvement.There is room for the model to enhance its focus on local information.Example 4 shows incorrect localization and answers, indicating that the ACC of the model requires further improvement.Additionally, the volume of training data in the dataset was a limiting factor for the model.The model had to learn from additional data to further improve its ACC.

Conclusions
In this study, a dual-modality prompt learning framework was designed for VQGA in robotic surgery.The proposed framework leverages the prompt knowledge generated by pretrained models to facilitate the joint encoding of cross-modal inputs, thereby improving the understanding of surgical scenes while localizing specific areas relevant to answering questions.The experimental evaluations conducted on the EndoVis-18 and EndoVis-17 datasets revealed that the proposed model effectively focuses on pertinent regions within images while capturing efficient multimodal alignments.Consequently, the extension of prompt learning into the realm of VQA not only proves beneficial, but also emerges as a promising avenue for future research.

Fig. 1
Fig. 1 Overall architecture of the proposed DMPL.a Dual modality prompt learning (DMPL) network; b VCP; c TCP

Fig. 3 Fig. 4
Fig. 3 Qualitative robustness experiments on the EndoVis-18 dataset.Experiments were conducted on 15 types of image corruption at level 2 of image degradation to visualize the answers predicted by the models and the associated bounding boxes.The 15 types of image corruption included Gaussian, shot, and impulse noise; defocus, glass, motion, and zoom blur; snow, frost, fog, brightness, contrast, elastic transform, pixelate, and jpeg compression

Fig. 5
Fig. 5 Examples of the true results generated by the proposed model and other models on the EndoVis17 [25] dataset.Text in red denotes wrong answers.Examples 1, 2, 3, 4 refer to these four examples from left to right