#  VQualA 2025 DIQA Challenge

**Repository Path**: qian-lan/vqual-a-2025-diqa-challenge

## Basic Information

- **Project Name**:  VQualA 2025 DIQA Challenge
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-07-21
- **Last Updated**: 2025-07-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# DIQA Challenge at VQualA 2025 - NJUST-KMG Team Solution

This repository contains the implementation of our solution for the Document Image Quality Assessment (DIQA) challenge at VQualA 2025. Our approach leverages fine-tuned multimodal vision-language models to assess the perceptual quality of enhanced document images.

## 🏆 Results

Our ensemble method achieved competitive results on the DIQA-5000 dataset:
- **Qwen2.5-VL-7B**: 0.910
- **MiMo-VL-7B**: 0.913
- **Qwen2-VL-7B**: 0.870
- **Ensemble (Qwen2.5 + MiMo)**: 0.921
- **Ensemble (All three models)**: 0.924

## 📋 Table of Contents

- [Environment Setup](#environment-setup)
- [Dataset Preparation](#dataset-preparation)
- [Model Download](#model-download)
- [Training Data Generation](#training-data-generation)
- [Model Training](#model-training)
- [Inference and Submission](#inference-and-submission)
- [Model Ensemble](#model-ensemble)
- [Technical Details](#technical-details)

## 🔧 Environment Setup

### Prerequisites
- Python 3.8+
- CUDA 11.8+ (for GPU training)
- At least 24GB GPU memory (recommended: NVIDIA RTX 4090 or A100)

### Installation

1. Clone the repository:
```bash
git clone https://github.com/your-username/vqual-a-2025-diqa-challenge.git
cd vqual-a-2025-diqa-challenge
```

2. Install ms-swift framework:
```bash
cd ms-swift-main
pip install -e .
```

3. Install additional dependencies:
```bash
pip install modelscope
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```

## 📊 Dataset Preparation

### Download DIQA Dataset

1. Download the DIQA-5000 dataset from the official challenge website
2. Extract and organize the dataset structure as follows:

```
datasets/
├── DIQA/
│   ├── train/
│   │   ├── train.csv
│   │   ├── res/          # Enhanced images
│   │   └── ori/          # Original images
│   ├── val/
│   │   ├── val.csv
│   │   ├── res/
│   │   └── ori/
│   └── test/
│       ├── test.csv
│       ├── res/          # Test images with TTA variants
│       ├── flip/         # Horizontally flipped images
│       ├── gamma/        # Gamma corrected images
│       └── blur/         # Blurred images
└── train_data/           # Generated training data (JSON format)
```

### Dataset Statistics
- **Training set**: 3,465 images
- **Validation set**: 35 images
- **Test1 set**: 500 images
- **Test2 set**: 1000 images
- **Image resolution**: Variable (typically 4000×6000×3)

## 🤖 Model Download

Download the pre-trained models using ModelScope:

```bash
cd code
chmod +x down_models.sh
./down_models.sh
```

This will download:
- **Qwen2.5-VL-7B-Instruct**
- **MiMo-VL-7B-RL**
- **Qwen2-VL-7B-Instruct**

Models will be saved to `models/` directory.

## 📝 Training Data Generation

Generate training data in the format required by ms-swift:

```bash
cd code
python generate_training_data_diqa_train_round2.py
python gen_train_data_diqa_train_round2_addval0.4.py
python gen_train_data_cls_20.py
```

This script:
- Converts CSV annotations to JSON format
- Creates conversation-style training samples
- Applies score-to-label mapping for better training
- Outputs: `datasets/train_data/diqa_train_round2.json`

### Training Data Format

Each training sample follows this structure:
```json
{
  "messages": [
    {
      "content": "<image>这是一张经过增强算法处理后的图片，请评估文档图片质量...",
      "role": "user"
    },
    {
      "content": "增强后的文档图像评分如下：\nOverall: 3.45\nSharpness: 3.67\nColor Fidelity: 3.23",
      "role": "assistant"
    }
  ],
  "images": ["/path/to/image.jpg"]
}
```

## 🚀 Model Training

### Training Configuration

Our training uses LoRA (Low-Rank Adaptation) fine-tuning with the following key parameters:
- **LoRA rank**: 16
- **LoRA alpha**: 32
- **Learning rate**: 1e-4
- **Epochs**: 5
- **Batch size**: 1 (with gradient accumulation of 16)
- **Precision**: bfloat16
- **Data augmentation**: Enabled (brightness, grid distortion, cropping)

### Train Individual Models

#### 1. Train Qwen2.5-VL-7B-Instruct
```bash
cd ms-swift-main
chmod +x diqa_train_qwen2.5-vl.sh
./diqa_train_qwen2.5-vl.sh
```

#### 2. Train MiMo-VL-7B-RL
```bash
chmod +x diqa_train_mimo.sh
./diqa_train_mimo.sh
```

#### 3. Train Qwen2-VL-7B-Instruct
```bash
chmod +x diqa_train_qwen2-vl.sh
./diqa_train_qwen2-vl.sh
```

### Training Time
- **Approximate training time**: 6 hours per model on NVIDIA RTX 4090
- **Memory usage**: ~22GB GPU memory
- **Total training time**: ~18 hours for all three models

### Training Output
Models will be saved to:
- `ms-swift-main/output/Qwen2.5-VL_diqa_train_round2_epoch5/`
- `ms-swift-main/output/MiMo-VL-7B-RL_diqa_round2_epoch5/`
- `ms-swift-main/output/Qwen2-VL_diqa_train_round2_epoch5/`

## 🔮 Inference and Submission

### Single Model Inference with TTA

Run inference with Test-Time Augmentation (TTA) for enhanced robustness:

```bash
cd ms-swift-main
python infer_submit_tta.py
```

Key features of our TTA approach:
- **Augmentation types**: Original, horizontal flip, gamma correction, blur
- **Ensemble strategy**: Average predictions across all TTA variants
- **Score rounding**: Results rounded to 2 decimal places
- **Batch processing**: Efficient batch inference for faster processing

### TTA Configuration

The TTA pipeline applies the following transformations:
- **res**: Original enhanced images
- **flip**: Horizontally flipped images
- **gamma**: Gamma-corrected images
- **blur**: Gaussian blurred images

### Inference Parameters

- **Batch size**: 4
- **Max tokens**: 32
- **Temperature**: 0 (deterministic)
- **Runtime**: ~0.75s per image on RTX 4090

### Customizing Inference

To run inference with a specific model checkpoint, modify the parameters in `infer_submit_tta.py`:

```python
# Example configuration
adapter_path = '/path/to/your/checkpoint'
csv_name = 'your_results.csv'
submission_name = 'your_submission.zip'
```

## 🎯 Model Ensemble

Combine predictions from multiple models for improved performance:

```bash
cd code
python submit_combine.py
```

### Ensemble Strategy

Our final ensemble combines three models with weighted averaging:
- **Qwen2.5-VL-7B**: 30% weight
- **MiMo-VL-7B**: 40% weight (best single model)
- **Qwen2-VL-7B**: 30% weight

### Ensemble Configuration

Modify the file paths and weights in `submit_combine.py`:

```python
# Model result files
csv_file_1 = '/path/to/qwen2.5_results.csv'
csv_file_2 = '/path/to/mimo_results.csv'
csv_file_3 = '/path/to/qwen2_results.csv'

# Ensemble weights (must sum to 1.0)
def blend_scores(col):
    return df1[col] * 0.3 + df2[col] * 0.4 + df3[col] * 0.3
```

### Ensemble Benefits

- **Improved robustness**: Reduces individual model biases
- **Higher correlation**: Better alignment with human judgments
- **Complementary strengths**: Each model captures different aspects of quality

## 📊 Technical Details

### Model Architecture
- **Base models**: Multimodal Vision-Language Models
- **Fine-tuning method**: LoRA (Low-Rank Adaptation)
- **Vision encoder**: Frozen during training
- **Text decoder**: LoRA-adapted layers

### Training Strategies
- **Data augmentation**: Random brightness, grid distortion, cropping
- **Gradient accumulation**: 16 steps for effective batch size of 16
- **Mixed precision**: bfloat16 for memory efficiency
- **Warmup ratio**: 0.05

### Evaluation Metrics
- **Primary metric**: Pearson correlation with human MOS scores
- **Secondary metrics**: Spearman correlation, RMSE

### Hardware Requirements
- **Training**: NVIDIA RTX 4090 (24GB) or A100
- **Inference**: NVIDIA RTX 4090 or equivalent
- **Storage**: ~50GB for models and datasets

### Performance Optimization
- **LoRA parameters**: ~7M trainable parameters per model
- **Memory efficiency**: bfloat16 precision reduces memory usage by 50%
- **Batch processing**: TTA inference processes multiple augmentations simultaneously
- **Checkpoint selection**: Best checkpoints selected based on validation performance

## 📁 File Structure

```
├── code/                           # Utility scripts
│   ├── down_models.sh             # Model download script
│   ├── generate_training_data_diqa_train_round2.py  # Training data generation
│   ├── submit_combine.py          # Model ensemble script
│   ├── submit_single.py           # Single model submission
│   └── evaluate_predictions.py    # Evaluation utilities
├── ms-swift-main/                 # ms-swift framework
│   ├── diqa_train_qwen2.5-vl.sh  # Qwen2.5-VL training script
│   ├── diqa_train_mimo.sh        # MiMo-VL training script
│   ├── diqa_train_qwen2-vl.sh    # Qwen2-VL training script
│   ├── infer_submit_tta.py       # TTA inference script
│   └── output/                   # Training outputs
├── datasets/                      # Dataset directory
│   ├── DIQA/                     # Original DIQA dataset
│   └── train_data/               # Generated training data
└── models/                       # Pre-trained models
```

## 🚀 Quick Start

For a complete end-to-end run:

```bash
# 1. Setup environment
cd ms-swift-main && pip install -e .
pip install modelscope torch torchvision torchaudio

# 2. Download models
cd ../code && ./down_models.sh

# 3. Generate training data
python generate_training_data_diqa_train_round2.py

# 4. Train models (run in parallel on different GPUs)
cd ../ms-swift-main
./diqa_train_qwen2.5-vl.sh  # GPU 0
./diqa_train_mimo.sh        # GPU 1
./diqa_train_qwen2-vl.sh    # GPU 2

# 5. Run inference with TTA
python infer_submit_tta.py

# 6. Create ensemble submission
cd ../code && python submit_combine.py
```

## 🔧 Troubleshooting

### Common Issues

1. **CUDA Out of Memory**
   - Reduce batch size or gradient accumulation steps
   - Use smaller MAX_PIXELS value
   - Enable gradient checkpointing

2. **Model Loading Errors**
   - Ensure models are downloaded completely
   - Check file permissions
   - Verify ModelScope authentication

3. **Training Convergence Issues**
   - Adjust learning rate (try 5e-5 or 2e-4)
   - Increase warmup steps
   - Check data quality and format

4. **Inference Speed Optimization**
   - Increase batch size if memory allows
   - Use mixed precision inference
   - Consider model quantization

### Performance Tips

- **Multi-GPU Training**: Modify CUDA_VISIBLE_DEVICES in training scripts
- **Data Loading**: Increase dataloader_num_workers for faster I/O
- **Memory Management**: Monitor GPU memory usage with `nvidia-smi`
- **Checkpoint Management**: Keep only best checkpoints to save disk space


## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- **ms-swift framework**: [ModelScope Swift](https://github.com/modelscope/swift)
- **Qwen2-VL**: [Qwen2-VL](https://github.com/QwenLM/Qwen2-VL)
- **Qwen2.5-VL**: [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)
- **MiMo-VL**: [MiMo-VL](https://github.com/XiaomiMiMo/MiMo-VL)
- **VQualA 2025 Challenge**: For providing the DIQA-5000 dataset

## 📧 Contact

- **Team Leader**: Zhe Zhang (zhe.zhang@njust.edu.cn)
- **Affiliation**: Nanjing University of Science and Technology

---

**Note**: This implementation is based on our submission to the VQualA 2025 DIQA Challenge. For the latest updates and improvements, please check our repository regularly.