PatientVLM Meets DocVLM

Abstract

Traditionally, AI research in medical diagnosis has largely centered on image analysis. While this has led to notable advancements, the absence of patient-reported symptoms continues to hinder diagnostic accuracy. To address this, we propose a Pre-Consultation Dialogue framework that mimics real-world diagnostic procedures, where doctors iteratively query patients before reaching a conclusion. Specifically, we simulate diagnostic dialogues between two vision–language models (VLMs): a DocVLM, which generates follow-up questions based on the image and dialogue history, and a PatientVLM, which responds using a symptom profile derived from the ground-truth diagnosis. We additionally conducted a small-scale clinical validation of the synthetic symptoms generated by our framework, confirming their usefulness for diagnosis. These DocVLM–PatientVLM interactions yield realistic, multi-turn dialogues paired with images and diagnoses, which are then used to fine-tune the DocVLM. This dialogue-based training substantially enhances diagnostic performance. For instance, using Qwen2.5-VL-7B as the base model, with symptoms generated using our framework achieves an F1 score of 81.0%, compared to just 56.5% with direct image-only fine-tuning on the DermaMNIST dataset.

Overview of the Pre-Consultation Dialogue Framework (PCDF)

In the Dialogue Simulation phase (left), a DocVLM and a PatientVLM engage in multi-turn conversation. At each turn \(t\)t, the DocVLM generates a follow-up question using the image, the current dialogue history, and an instruction prompt Pdoc. The PatientVLM responds using the image, the ground-truth diagnosis label (to simulate symptom expression), the question from the DocVLM, and an instruction prompt Ppat. This interaction continues for T turns, yielding an image–dialogue–diagnosis triplet. In the Dialogue-conditioned DocVLM Finetuning phase (right), the DocVLM is instruction-finetuned (using prompt Pdocft) on these synthetic triplets to perform dialogue-aware, accurate, and interpretable diagnosis.

Speed:

VLM Performance Comparison

Model	Setting	DermaMNIST		PneumoniaMNIST		RetinaMNIST		PathMNIST
Model	Setting	Accuracy	F1	Accuracy	F1	Accuracy	F1	Accuracy	F1
InternVL3-2B	Image-only SFT	66.8	36.5	89.6	88.4	52.5	31.5	83.5	70.9
InternVL3-2B	+PCDF (Ours)	89.6	73.7	98.7	98.6	72.2	54.9	95.7	85.5
Qwen2.5-VL-7B	Image-only SFT	77.8	56.5	85.6	83.3	54.8	33.8	71.6	73.5
Qwen2.5-VL-7B	+PCDF (Ours)	92.0	81.0	95.0	94.5	58.2	39.7	79.5	77.9
Gemma3-4B	Image-only SFT	87.2	78.3	96.0	95.7	64.8	47.7	89.5	86.0
Gemma3-4B	+PCDF (Ours)	92.8	81.9	99.0	99.0	76.0	67.7	92.1	90.2
MedGemma3-4B	Image-only SFT	89.0	81.5	99.2	99.1	79.2	71.2	93.2	90.9
MedGemma3-4B	+PCDF (Ours)	94.4	86.4	99.4	99.3	82.2	81.3	97.5	96.9

BibTeX

@inproceedings{lokesh2026patientvlm,
  title     = {PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis},
  author    = {Lokesh, K and Penamakuri, Abhirama Subramanyam and Agarwal, Uday and Challa, Apoorva and Gowda, Shreya K and Gupta, Somesh and Mishra, Anand},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  year      = {2026},
  publisher = {AAAI Press}
}

PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis

Abstract

Overview of the Pre-Consultation Dialogue Framework (PCDF)

PCDF Generated Dialogues

VLM Performance Comparison

BibTeX

PatientVLM Meets DocVLM:
Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis