More examples of fine-grained context classification and prediction
We provide additional examples of our proposed pipeline, fine-grained context classification examples, and the objects-from-context prediction.
We provide concrete examples that highlight the differences in context reasoning before and after SFT using Qwen2.5-VL-3B, specifically across our three criteria: location, size, and co-occurrence, and demonstrate that SFT substantially improves both the accuracy of context classification and the plausibility of the model’s contextual reasoning as illustrated in Section 5.1.
Figure 1: Location Context Reasoning for the Sink (Before vs. After SFT). Example image from COCO.
Figure 2: Size Context Reasoning for the Dining Table (Before vs. After SFT). Example image from COinCO.
Figure 3: Co-occurrence Context Reasoning for the Toilet (Before vs. After SFT). Example image from COinCO.
Here, we provide additional examples for the Objects-from-Context prediction task at the instance-level. Our trained multilayer perceptron is provided with an inpainted image and infers which object was present before the image was altered. Using the top prediction class, we reapply Stable Diffusion inpainting to reconstruct the original object within the mask region.