To address the challenge of deploying large vision-language models for practical applications, we employ knowledge distillation to transfer contextual reasoning capabilities from a large teacher model to smaller, more efficient student models. We use Qwen2.5-VL-72B as our teacher model to distill knowledge into three independent Qwen2.5-VL-3B student models—one specialized for location reasoning, one for size reasoning, and one for co-occurrence reasoning.
This distillation process reduces the model size by 24× while maintaining most of the reasoning capabilities, making context classification feasible for practical deployment. Each student model learns not only the final decision (in-context or out-of-context) but also the reasoning process for its specific criterion, enabling interpretable explanations.
Dataset Split: We use 73,929 images with unanimous consensus from all three LVLMs (Molmo-72B, Qwen2.5-VL-72B, and InternVL3.5-38B). From these, we sample 24,000 images maintaining a balanced 1:1 ratio of in-context to out-of-context examples:
- Training set: 20,000 images
- Validation set: 2,000 images
- Test set: 2,000 images
We use context reasoning responses from Qwen2.5-VL-72B as the ground truth for training. The three student models are trained independently, with each focusing on one of the three context criteria:
- Location Student Model: Evaluates whether the spatial placement of the object is appropriate for the scene layout. Learns to identify violations such as objects appearing in physically impossible locations (e.g., a horse in a bathroom next to urinals).
- Size Student Model: Assesses whether the object's scale aligns with scene geometry and expected proportions. Detects anomalies like unreasonably sized objects (e.g., a tiny zebra on a beach next to a person).
- Co-occurrence Student Model: Determines whether the object's simultaneous presence with existing objects is contextually plausible. Identifies violations where objects that absolutely cannot appear together are present (e.g., a cow in the sky with the moon).