# VQualA 2025 DIQA Challenge **Repository Path**: qian-lan/vqual-a-2025-diqa-challenge ## Basic Information - **Project Name**: VQualA 2025 DIQA Challenge - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-07-21 - **Last Updated**: 2025-07-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # DIQA Challenge at VQualA 2025 - NJUST-KMG Team Solution This repository contains the implementation of our solution for the Document Image Quality Assessment (DIQA) challenge at VQualA 2025. Our approach leverages fine-tuned multimodal vision-language models to assess the perceptual quality of enhanced document images. ## 🏆 Results Our ensemble method achieved competitive results on the DIQA-5000 dataset: - **Qwen2.5-VL-7B**: 0.910 - **MiMo-VL-7B**: 0.913 - **Qwen2-VL-7B**: 0.870 - **Ensemble (Qwen2.5 + MiMo)**: 0.921 - **Ensemble (All three models)**: 0.924 ## 📋 Table of Contents - [Environment Setup](#environment-setup) - [Dataset Preparation](#dataset-preparation) - [Model Download](#model-download) - [Training Data Generation](#training-data-generation) - [Model Training](#model-training) - [Inference and Submission](#inference-and-submission) - [Model Ensemble](#model-ensemble) - [Technical Details](#technical-details) ## 🔧 Environment Setup ### Prerequisites - Python 3.8+ - CUDA 11.8+ (for GPU training) - At least 24GB GPU memory (recommended: NVIDIA RTX 4090 or A100) ### Installation 1. Clone the repository: ```bash git clone https://github.com/your-username/vqual-a-2025-diqa-challenge.git cd vqual-a-2025-diqa-challenge ``` 2. Install ms-swift framework: ```bash cd ms-swift-main pip install -e . ``` 3. Install additional dependencies: ```bash pip install modelscope pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 ``` ## 📊 Dataset Preparation ### Download DIQA Dataset 1. Download the DIQA-5000 dataset from the official challenge website 2. Extract and organize the dataset structure as follows: ``` datasets/ ├── DIQA/ │ ├── train/ │ │ ├── train.csv │ │ ├── res/ # Enhanced images │ │ └── ori/ # Original images │ ├── val/ │ │ ├── val.csv │ │ ├── res/ │ │ └── ori/ │ └── test/ │ ├── test.csv │ ├── res/ # Test images with TTA variants │ ├── flip/ # Horizontally flipped images │ ├── gamma/ # Gamma corrected images │ └── blur/ # Blurred images └── train_data/ # Generated training data (JSON format) ``` ### Dataset Statistics - **Training set**: 3,465 images - **Validation set**: 35 images - **Test1 set**: 500 images - **Test2 set**: 1000 images - **Image resolution**: Variable (typically 4000×6000×3) ## 🤖 Model Download Download the pre-trained models using ModelScope: ```bash cd code chmod +x down_models.sh ./down_models.sh ``` This will download: - **Qwen2.5-VL-7B-Instruct** - **MiMo-VL-7B-RL** - **Qwen2-VL-7B-Instruct** Models will be saved to `models/` directory. ## 📝 Training Data Generation Generate training data in the format required by ms-swift: ```bash cd code python generate_training_data_diqa_train_round2.py python gen_train_data_diqa_train_round2_addval0.4.py python gen_train_data_cls_20.py ``` This script: - Converts CSV annotations to JSON format - Creates conversation-style training samples - Applies score-to-label mapping for better training - Outputs: `datasets/train_data/diqa_train_round2.json` ### Training Data Format Each training sample follows this structure: ```json { "messages": [ { "content": "čŋ™æ˜¯ä¸€åŧ įģčŋ‡åĸžåŧēįŽ—æŗ•å¤„į†åŽįš„å›žį‰‡īŧŒč¯ˇč¯„äŧ°æ–‡æĄŖå›žį‰‡č´¨é‡...", "role": "user" }, { "content": "åĸžåŧēåŽįš„æ–‡æĄŖå›žåƒč¯„åˆ†åς䏋īŧš\nOverall: 3.45\nSharpness: 3.67\nColor Fidelity: 3.23", "role": "assistant" } ], "images": ["/path/to/image.jpg"] } ``` ## 🚀 Model Training ### Training Configuration Our training uses LoRA (Low-Rank Adaptation) fine-tuning with the following key parameters: - **LoRA rank**: 16 - **LoRA alpha**: 32 - **Learning rate**: 1e-4 - **Epochs**: 5 - **Batch size**: 1 (with gradient accumulation of 16) - **Precision**: bfloat16 - **Data augmentation**: Enabled (brightness, grid distortion, cropping) ### Train Individual Models #### 1. Train Qwen2.5-VL-7B-Instruct ```bash cd ms-swift-main chmod +x diqa_train_qwen2.5-vl.sh ./diqa_train_qwen2.5-vl.sh ``` #### 2. Train MiMo-VL-7B-RL ```bash chmod +x diqa_train_mimo.sh ./diqa_train_mimo.sh ``` #### 3. Train Qwen2-VL-7B-Instruct ```bash chmod +x diqa_train_qwen2-vl.sh ./diqa_train_qwen2-vl.sh ``` ### Training Time - **Approximate training time**: 6 hours per model on NVIDIA RTX 4090 - **Memory usage**: ~22GB GPU memory - **Total training time**: ~18 hours for all three models ### Training Output Models will be saved to: - `ms-swift-main/output/Qwen2.5-VL_diqa_train_round2_epoch5/` - `ms-swift-main/output/MiMo-VL-7B-RL_diqa_round2_epoch5/` - `ms-swift-main/output/Qwen2-VL_diqa_train_round2_epoch5/` ## 🔮 Inference and Submission ### Single Model Inference with TTA Run inference with Test-Time Augmentation (TTA) for enhanced robustness: ```bash cd ms-swift-main python infer_submit_tta.py ``` Key features of our TTA approach: - **Augmentation types**: Original, horizontal flip, gamma correction, blur - **Ensemble strategy**: Average predictions across all TTA variants - **Score rounding**: Results rounded to 2 decimal places - **Batch processing**: Efficient batch inference for faster processing ### TTA Configuration The TTA pipeline applies the following transformations: - **res**: Original enhanced images - **flip**: Horizontally flipped images - **gamma**: Gamma-corrected images - **blur**: Gaussian blurred images ### Inference Parameters - **Batch size**: 4 - **Max tokens**: 32 - **Temperature**: 0 (deterministic) - **Runtime**: ~0.75s per image on RTX 4090 ### Customizing Inference To run inference with a specific model checkpoint, modify the parameters in `infer_submit_tta.py`: ```python # Example configuration adapter_path = '/path/to/your/checkpoint' csv_name = 'your_results.csv' submission_name = 'your_submission.zip' ``` ## đŸŽ¯ Model Ensemble Combine predictions from multiple models for improved performance: ```bash cd code python submit_combine.py ``` ### Ensemble Strategy Our final ensemble combines three models with weighted averaging: - **Qwen2.5-VL-7B**: 30% weight - **MiMo-VL-7B**: 40% weight (best single model) - **Qwen2-VL-7B**: 30% weight ### Ensemble Configuration Modify the file paths and weights in `submit_combine.py`: ```python # Model result files csv_file_1 = '/path/to/qwen2.5_results.csv' csv_file_2 = '/path/to/mimo_results.csv' csv_file_3 = '/path/to/qwen2_results.csv' # Ensemble weights (must sum to 1.0) def blend_scores(col): return df1[col] * 0.3 + df2[col] * 0.4 + df3[col] * 0.3 ``` ### Ensemble Benefits - **Improved robustness**: Reduces individual model biases - **Higher correlation**: Better alignment with human judgments - **Complementary strengths**: Each model captures different aspects of quality ## 📊 Technical Details ### Model Architecture - **Base models**: Multimodal Vision-Language Models - **Fine-tuning method**: LoRA (Low-Rank Adaptation) - **Vision encoder**: Frozen during training - **Text decoder**: LoRA-adapted layers ### Training Strategies - **Data augmentation**: Random brightness, grid distortion, cropping - **Gradient accumulation**: 16 steps for effective batch size of 16 - **Mixed precision**: bfloat16 for memory efficiency - **Warmup ratio**: 0.05 ### Evaluation Metrics - **Primary metric**: Pearson correlation with human MOS scores - **Secondary metrics**: Spearman correlation, RMSE ### Hardware Requirements - **Training**: NVIDIA RTX 4090 (24GB) or A100 - **Inference**: NVIDIA RTX 4090 or equivalent - **Storage**: ~50GB for models and datasets ### Performance Optimization - **LoRA parameters**: ~7M trainable parameters per model - **Memory efficiency**: bfloat16 precision reduces memory usage by 50% - **Batch processing**: TTA inference processes multiple augmentations simultaneously - **Checkpoint selection**: Best checkpoints selected based on validation performance ## 📁 File Structure ``` ├── code/ # Utility scripts │ ├── down_models.sh # Model download script │ ├── generate_training_data_diqa_train_round2.py # Training data generation │ ├── submit_combine.py # Model ensemble script │ ├── submit_single.py # Single model submission │ └── evaluate_predictions.py # Evaluation utilities ├── ms-swift-main/ # ms-swift framework │ ├── diqa_train_qwen2.5-vl.sh # Qwen2.5-VL training script │ ├── diqa_train_mimo.sh # MiMo-VL training script │ ├── diqa_train_qwen2-vl.sh # Qwen2-VL training script │ ├── infer_submit_tta.py # TTA inference script │ └── output/ # Training outputs ├── datasets/ # Dataset directory │ ├── DIQA/ # Original DIQA dataset │ └── train_data/ # Generated training data └── models/ # Pre-trained models ``` ## 🚀 Quick Start For a complete end-to-end run: ```bash # 1. Setup environment cd ms-swift-main && pip install -e . pip install modelscope torch torchvision torchaudio # 2. Download models cd ../code && ./down_models.sh # 3. Generate training data python generate_training_data_diqa_train_round2.py # 4. Train models (run in parallel on different GPUs) cd ../ms-swift-main ./diqa_train_qwen2.5-vl.sh # GPU 0 ./diqa_train_mimo.sh # GPU 1 ./diqa_train_qwen2-vl.sh # GPU 2 # 5. Run inference with TTA python infer_submit_tta.py # 6. Create ensemble submission cd ../code && python submit_combine.py ``` ## 🔧 Troubleshooting ### Common Issues 1. **CUDA Out of Memory** - Reduce batch size or gradient accumulation steps - Use smaller MAX_PIXELS value - Enable gradient checkpointing 2. **Model Loading Errors** - Ensure models are downloaded completely - Check file permissions - Verify ModelScope authentication 3. **Training Convergence Issues** - Adjust learning rate (try 5e-5 or 2e-4) - Increase warmup steps - Check data quality and format 4. **Inference Speed Optimization** - Increase batch size if memory allows - Use mixed precision inference - Consider model quantization ### Performance Tips - **Multi-GPU Training**: Modify CUDA_VISIBLE_DEVICES in training scripts - **Data Loading**: Increase dataloader_num_workers for faster I/O - **Memory Management**: Monitor GPU memory usage with `nvidia-smi` - **Checkpoint Management**: Keep only best checkpoints to save disk space ## 📄 License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## 🙏 Acknowledgments - **ms-swift framework**: [ModelScope Swift](https://github.com/modelscope/swift) - **Qwen2-VL**: [Qwen2-VL](https://github.com/QwenLM/Qwen2-VL) - **Qwen2.5-VL**: [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) - **MiMo-VL**: [MiMo-VL](https://github.com/XiaomiMiMo/MiMo-VL) - **VQualA 2025 Challenge**: For providing the DIQA-5000 dataset ## 📧 Contact - **Team Leader**: Zhe Zhang (zhe.zhang@njust.edu.cn) - **Affiliation**: Nanjing University of Science and Technology --- **Note**: This implementation is based on our submission to the VQualA 2025 DIQA Challenge. For the latest updates and improvements, please check our repository regularly.