# ai-learning

**Repository Path**: wzcool/ai-learning

## Basic Information

- **Project Name**: ai-learning
- **Description**: ai-learning
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-07-30
- **Last Updated**: 2025-07-30

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Used Car Price Prediction - Tianchi Competition

This project implements a machine learning pipeline for predicting used car prices using ensemble methods.

## Dataset Overview

- **Training Data**: 150,000 samples with 31 features
- **Test Data**: 50,000 samples with 30 features
- **Target**: Car price (ranging from 11 to 99,999)

## Features

### Original Features
- `SaleID`: Unique identifier
- `name`: Car name/model identifier
- `regDate`: Registration date
- `model`: Car model code
- `brand`: Brand identifier
- `bodyType`: Body type (SUV, sedan, etc.)
- `fuelType`: Fuel type
- `gearbox`: Transmission type
- `power`: Engine power
- `kilometer`: Mileage
- `notRepairedDamage`: Damage repair status
- `regionCode`: Region code
- `seller`: Seller type
- `offerType`: Offer type
- `creatDate`: Creation date
- `v_0` to `v_14`: Anonymous features

### Engineered Features
- `car_age`: Age of the car (calculated from registration date)
- `power_per_km`: Power to mileage ratio

## Model Architecture

### Ensemble Approach
The solution uses an ensemble of three models:

1. **XGBoost Regressor**
   - n_estimators: 500
   - max_depth: 8
   - learning_rate: 0.05

2. **Random Forest Regressor**
   - n_estimators: 300
   - max_depth: 10

3. **Ridge Regression**
   - alpha: 10
   - Uses standardized features

### Final Prediction
Simple average of all three model predictions.

## Pipeline Components

### 1. Data Loading
- Handles space-separated CSV format
- Loads both training and test datasets

### 2. Data Preprocessing
- **Missing Value Handling**:
  - Categorical: Fill with mode
  - Numerical: Fill with median
- **Categorical Encoding**: Label encoding for categorical variables
- **Feature Scaling**: StandardScaler for Ridge regression

### 3. Feature Engineering
- Car age calculation from registration date
- Power-to-kilometer ratio feature
- Handles unseen categories in test data

### 4. Model Training
- Individual model training and evaluation
- Cross-validation with 5-fold CV
- Performance metrics: MSE, MAE, R²

### 5. Prediction & Submission
- Ensemble prediction on test data
- Generates competition-format submission file

## Usage

```python
# Run the complete pipeline
python used_car_price_prediction.py
```

This will:
1. Load and explore the data
2. Preprocess and engineer features
3. Train the ensemble model
4. Generate predictions
5. Save submission file as 'used_car_submission.csv'
6. Save trained model as 'used_car_model.pkl'
7. Generate feature importance plot

## Output Files

- `used_car_submission.csv`: Competition submission file
- `used_car_model.pkl`: Trained model for future use
- `feature_importance.png`: Feature importance visualization

## Model Performance

The pipeline includes comprehensive evaluation:
- Individual model performance on validation set
- Ensemble model performance
- 5-fold cross-validation scores
- Feature importance analysis

## Key Features

- **Robust Data Handling**: Handles missing values and unseen categories
- **Feature Engineering**: Creates meaningful derived features
- **Ensemble Learning**: Combines multiple algorithms for better performance
- **Cross-Validation**: Ensures model generalization
- **Modular Design**: Object-oriented approach for easy modification
- **Comprehensive Evaluation**: Multiple metrics and visualizations

## Requirements

```
pandas
numpy
scikit-learn
xgboost
matplotlib
seaborn
joblib
```

## Competition Format

The solution generates predictions in the required Tianchi competition format with columns:
- `SaleID`: Test sample identifier
- `price`: Predicted price