# ai-learning **Repository Path**: wzcool/ai-learning ## Basic Information - **Project Name**: ai-learning - **Description**: ai-learning - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-07-30 - **Last Updated**: 2025-07-30 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Used Car Price Prediction - Tianchi Competition This project implements a machine learning pipeline for predicting used car prices using ensemble methods. ## Dataset Overview - **Training Data**: 150,000 samples with 31 features - **Test Data**: 50,000 samples with 30 features - **Target**: Car price (ranging from 11 to 99,999) ## Features ### Original Features - `SaleID`: Unique identifier - `name`: Car name/model identifier - `regDate`: Registration date - `model`: Car model code - `brand`: Brand identifier - `bodyType`: Body type (SUV, sedan, etc.) - `fuelType`: Fuel type - `gearbox`: Transmission type - `power`: Engine power - `kilometer`: Mileage - `notRepairedDamage`: Damage repair status - `regionCode`: Region code - `seller`: Seller type - `offerType`: Offer type - `creatDate`: Creation date - `v_0` to `v_14`: Anonymous features ### Engineered Features - `car_age`: Age of the car (calculated from registration date) - `power_per_km`: Power to mileage ratio ## Model Architecture ### Ensemble Approach The solution uses an ensemble of three models: 1. **XGBoost Regressor** - n_estimators: 500 - max_depth: 8 - learning_rate: 0.05 2. **Random Forest Regressor** - n_estimators: 300 - max_depth: 10 3. **Ridge Regression** - alpha: 10 - Uses standardized features ### Final Prediction Simple average of all three model predictions. ## Pipeline Components ### 1. Data Loading - Handles space-separated CSV format - Loads both training and test datasets ### 2. Data Preprocessing - **Missing Value Handling**: - Categorical: Fill with mode - Numerical: Fill with median - **Categorical Encoding**: Label encoding for categorical variables - **Feature Scaling**: StandardScaler for Ridge regression ### 3. Feature Engineering - Car age calculation from registration date - Power-to-kilometer ratio feature - Handles unseen categories in test data ### 4. Model Training - Individual model training and evaluation - Cross-validation with 5-fold CV - Performance metrics: MSE, MAE, R² ### 5. Prediction & Submission - Ensemble prediction on test data - Generates competition-format submission file ## Usage ```python # Run the complete pipeline python used_car_price_prediction.py ``` This will: 1. Load and explore the data 2. Preprocess and engineer features 3. Train the ensemble model 4. Generate predictions 5. Save submission file as 'used_car_submission.csv' 6. Save trained model as 'used_car_model.pkl' 7. Generate feature importance plot ## Output Files - `used_car_submission.csv`: Competition submission file - `used_car_model.pkl`: Trained model for future use - `feature_importance.png`: Feature importance visualization ## Model Performance The pipeline includes comprehensive evaluation: - Individual model performance on validation set - Ensemble model performance - 5-fold cross-validation scores - Feature importance analysis ## Key Features - **Robust Data Handling**: Handles missing values and unseen categories - **Feature Engineering**: Creates meaningful derived features - **Ensemble Learning**: Combines multiple algorithms for better performance - **Cross-Validation**: Ensures model generalization - **Modular Design**: Object-oriented approach for easy modification - **Comprehensive Evaluation**: Multiple metrics and visualizations ## Requirements ``` pandas numpy scikit-learn xgboost matplotlib seaborn joblib ``` ## Competition Format The solution generates predictions in the required Tianchi competition format with columns: - `SaleID`: Test sample identifier - `price`: Predicted price