COinCO — Training Details

Training Details and Model Structure

Fine-Grained Context Classification

This section provides supplementary information for Section 5.1, detailing the knowledge distillation process and training specifications for our fine-grained context classification models. We distill a large 72B parameter teacher model into three efficient 3B student models, each specialized for a specific context criterion: location, size, and co-occurrence.

Training Details

To address the challenge of deploying large vision-language models for practical applications, we employ knowledge distillation to transfer contextual reasoning capabilities from a large teacher model to smaller, more efficient student models. We use Qwen2.5-VL-72B as our teacher model to distill knowledge into three independent Qwen2.5-VL-3B student models—one specialized for location reasoning, one for size reasoning, and one for co-occurrence reasoning.

This distillation process reduces the model size by 24× while maintaining most of the reasoning capabilities, making context classification feasible for practical deployment. Each student model learns not only the final decision (in-context or out-of-context) but also the reasoning process for its specific criterion, enabling interpretable explanations.

Dataset Split: We use 73,929 images with unanimous consensus from all three LVLMs (Molmo-72B, Qwen2.5-VL-72B, and InternVL3.5-38B). From these, we sample 24,000 images maintaining a balanced 1:1 ratio of in-context to out-of-context examples:

Training set: 20,000 images
Validation set: 2,000 images
Test set: 2,000 images

We use context reasoning responses from Qwen2.5-VL-72B as the ground truth for training. The three student models are trained independently, with each focusing on one of the three context criteria:

Location Student Model: Evaluates whether the spatial placement of the object is appropriate for the scene layout. Learns to identify violations such as objects appearing in physically impossible locations (e.g., a horse in a bathroom next to urinals).
Size Student Model: Assesses whether the object's scale aligns with scene geometry and expected proportions. Detects anomalies like unreasonably sized objects (e.g., a tiny zebra on a beach next to a person).
Co-occurrence Student Model: Determines whether the object's simultaneous presence with existing objects is contextually plausible. Identifies violations where objects that absolutely cannot appear together are present (e.g., a cow in the sky with the moon).

Model Architecture

Each of the three student models is based on the Qwen2.5-VL-3B architecture, a compact yet powerful vision-language model. Through knowledge distillation from the 72B parameter teacher model, these 3B student models achieve practical inference speeds while maintaining high-quality reasoning capabilities. The modular design allows criterion-specific refinements, demonstrating that complex contextual understanding can be effectively decomposed into focused reasoning tasks.

While the 72B teacher model is too slow for practical deployment, these efficient 3B student models make context reasoning feasible for real-world applications. Each model provides interpretable natural language explanations for its decisions, enabling transparent and understandable context classification.

Objects-from-Context Prediction Model

This section provides supplementary information for Section 5.2, detailing the training process and architectural specifications of our object recovery model. The following content includes comprehensive training parameters, model architecture, and resource utilization statistics.

Training Details

The training process involved optimizing an 80-class classification model with embeddings as inputs. Each input embedding consisted of two components: latent1 and latent2, both of size 4 × 32 × 32, extracted using the VAE module from the stable-diffusion-2-1 model. These embeddings were generated from corresponding image and mask pairs and served as the input to the model.

We utilized a learning rate of 10^-3 and an Adam optimizer for training. The batch size was set to 512 to ensure efficient utilization of GPU memory. A step learning rate scheduler was applied with a step size of 40 epochs and a decay factor of 0.1. The model was trained for 120 epochs on a single A6000 GPU, and Cross-Entropy loss was used as the optimization objective.

The classification task involved 80 classes, and the network architecture was designed to process the 4 × 32 × 32 embeddings efficiently.

Context Models & Training Details

LVLMs Context Reasoning Prompt

Context Reasoning Prompt (Scrollable)

Training Details and Model Structure

Fine-Grained Context Classification

Training Details

Model Architecture

Objects-from-Context Prediction Model

Training Details

Model Details

Layer Summary.

Parameters and Memory Usage.