Traditionally, AI research in medical diagnosis has largely centered on image analysis. While this has led to notable advancements, the absence of patient-reported symptoms continues to hinder diagnostic accuracy. To address this, we propose a Pre-Consultation Dialogue framework that mimics real-world diagnostic procedures, where doctors iteratively query patients before reaching a conclusion. Specifically, we simulate diagnostic dialogues between two vision–language models (VLMs): a DocVLM, which generates follow-up questions based on the image and dialogue history, and a PatientVLM, which responds using a symptom profile derived from the ground-truth diagnosis. We additionally conducted a small-scale clinical validation of the synthetic symptoms generated by our framework, confirming their usefulness for diagnosis. These DocVLM–PatientVLM interactions yield realistic, multi-turn dialogues paired with images and diagnoses, which are then used to fine-tune the DocVLM. This dialogue-based training substantially enhances diagnostic performance. For instance, using Qwen2.5-VL-7B as the base model, with symptoms generated using our framework achieves an F1 score of 81.0%, compared to just 56.5% with direct image-only fine-tuning on the DermaMNIST dataset.
In the Dialogue Simulation phase (left), a DocVLM and a PatientVLM engage in multi-turn conversation. At each turn \(t\)t, the DocVLM generates a follow-up question using the image, the current dialogue history, and an instruction prompt Pdoc. The PatientVLM responds using the image, the ground-truth diagnosis label (to simulate symptom expression), the question from the DocVLM, and an instruction prompt Ppat. This interaction continues for T turns, yielding an image–dialogue–diagnosis triplet. In the Dialogue-conditioned DocVLM Finetuning phase (right), the DocVLM is instruction-finetuned (using prompt Pdocft) on these synthetic triplets to perform dialogue-aware, accurate, and interpretable diagnosis.
| Model | Setting | DermaMNIST | PneumoniaMNIST | RetinaMNIST | PathMNIST | ||||
|---|---|---|---|---|---|---|---|---|---|
| Accuracy | F1 | Accuracy | F1 | Accuracy | F1 | Accuracy | F1 | ||
| InternVL3-2B | Image-only SFT | 66.8 | 36.5 | 89.6 | 88.4 | 52.5 | 31.5 | 83.5 | 70.9 |
| +PCDF (Ours) | 89.6 | 73.7 | 98.7 | 98.6 | 72.2 | 54.9 | 95.7 | 85.5 | |
| Qwen2.5-VL-7B | Image-only SFT | 77.8 | 56.5 | 85.6 | 83.3 | 54.8 | 33.8 | 71.6 | 73.5 |
| +PCDF (Ours) | 92.0 | 81.0 | 95.0 | 94.5 | 58.2 | 39.7 | 79.5 | 77.9 | |
| Gemma3-4B | Image-only SFT | 87.2 | 78.3 | 96.0 | 95.7 | 64.8 | 47.7 | 89.5 | 86.0 |
| +PCDF (Ours) | 92.8 | 81.9 | 99.0 | 99.0 | 76.0 | 67.7 | 92.1 | 90.2 | |
| MedGemma3-4B | Image-only SFT | 89.0 | 81.5 | 99.2 | 99.1 | 79.2 | 71.2 | 93.2 | 90.9 |
| +PCDF (Ours) | 94.4 | 86.4 | 99.4 | 99.3 | 82.2 | 81.3 | 97.5 | 96.9 | |
@inproceedings{lokesh2026patientvlm,
title = {PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis},
author = {Lokesh, K and Penamakuri, Abhirama Subramanyam and Agarwal, Uday and Challa, Apoorva and Gowda, Shreya K and Gupta, Somesh and Mishra, Anand},
booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
year = {2026},
publisher = {AAAI Press}
}