PatientVLM Meets DocVLM:
Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis

1Indian Institute of Technology Jodhpur, 2All India Institute of Medical Sciences Delhi
*Equal Contribution

Abstract

Traditionally, AI research in medical diagnosis has largely centered on image analysis. While this has led to notable advancements, the absence of patient-reported symptoms continues to hinder diagnostic accuracy. To address this, we propose a Pre-Consultation Dialogue framework that mimics real-world diagnostic procedures, where doctors iteratively query patients before reaching a conclusion. Specifically, we simulate diagnostic dialogues between two vision–language models (VLMs): a DocVLM, which generates follow-up questions based on the image and dialogue history, and a PatientVLM, which responds using a symptom profile derived from the ground-truth diagnosis. We additionally conducted a small-scale clinical validation of the synthetic symptoms generated by our framework, confirming their usefulness for diagnosis. These DocVLM–PatientVLM interactions yield realistic, multi-turn dialogues paired with images and diagnoses, which are then used to fine-tune the DocVLM. This dialogue-based training substantially enhances diagnostic performance. For instance, using Qwen2.5-VL-7B as the base model, with symptoms generated using our framework achieves an F1 score of 81.0%, compared to just 56.5% with direct image-only fine-tuning on the DermaMNIST dataset.

Overview of the Pre-Consultation Dialogue Framework (PCDF)

Framework Diagram

In the Dialogue Simulation phase (left), a DocVLM and a PatientVLM engage in multi-turn conversation. At each turn \(t\)t, the DocVLM generates a follow-up question using the image, the current dialogue history, and an instruction prompt Pdoc. The PatientVLM responds using the image, the ground-truth diagnosis label (to simulate symptom expression), the question from the DocVLM, and an instruction prompt Ppat. This interaction continues for T turns, yielding an image–dialogue–diagnosis triplet. In the Dialogue-conditioned DocVLM Finetuning phase (right), the DocVLM is instruction-finetuned (using prompt Pdocft) on these synthetic triplets to perform dialogue-aware, accurate, and interpretable diagnosis.

PCDF Generated Dialogues

Loading samples...
Speed:

VLM Performance Comparison

Model Setting DermaMNIST PneumoniaMNIST RetinaMNIST PathMNIST
Accuracy F1 Accuracy F1 Accuracy F1 Accuracy F1
InternVL3-2B Image-only SFT 66.8 36.5 89.6 88.4 52.5 31.5 83.5 70.9
+PCDF (Ours) 89.6 73.7 98.7 98.6 72.2 54.9 95.7 85.5
Qwen2.5-VL-7B Image-only SFT 77.8 56.5 85.6 83.3 54.8 33.8 71.6 73.5
+PCDF (Ours) 92.0 81.0 95.0 94.5 58.2 39.7 79.5 77.9
Gemma3-4B Image-only SFT 87.2 78.3 96.0 95.7 64.8 47.7 89.5 86.0
+PCDF (Ours) 92.8 81.9 99.0 99.0 76.0 67.7 92.1 90.2
MedGemma3-4B Image-only SFT 89.0 81.5 99.2 99.1 79.2 71.2 93.2 90.9
+PCDF (Ours) 94.4 86.4 99.4 99.3 82.2 81.3 97.5 96.9

BibTeX

@inproceedings{lokesh2026patientvlm,
  title     = {PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis},
  author    = {Lokesh, K and Penamakuri, Abhirama Subramanyam and Agarwal, Uday and Challa, Apoorva and Gowda, Shreya K and Gupta, Somesh and Mishra, Anand},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  year      = {2026},
  publisher = {AAAI Press}
}