# MegaHan97K
**Repository Path**: dlml2/MegaHan97K
## Basic Information
- **Project Name**: MegaHan97K
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-07-03
- **Last Updated**: 2025-07-03
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

[](http://dlvc-lab.net/lianwen/)
[](https://www.sciencedirect.com/science/article/abs/pii/S0031320325004170)
[](https://arxiv.org/abs/2506.04807)
[](https://github.com/SCUT-DLVCLab/MegaHan97K)
* We introduce **MegaHan97K**, a mega-category, large-scale dataset that contains the largest 97,455 Chinese character categories.
* MegaHan97K includes Chinese characters of 97,455 categories, which significantly surpasses existing datasets with at least six times larger categories and holds the largest volume.
* MegaHan97K pioneers to support the latest Chinese GB18030-2022 standard, ensuring the most comprehensive coverage and compatibility with modern Chinese processing systems.
* MegaHan97K contains three distinct subsets: handwritten, historical, and synthetic. Each subset contains a greater number of character categories compared to existing datasets, resulting in remarkable scale and diversity advantages.
* MegaHan97K effectively mitigates long-tail distribution issues by providing a balanced and sufficient number of samples for each category, ensuring robust training and validation of CCR models.

## 🔥 Download
| **Setting** | **Dataset** | **status** |
|-------------------------|----------------|------------|
| **General CCR** | [Baiduyun:k4ch](https://pan.baidu.com/s/1LwIS-K812Q0LjBajpvQeVw?pwd=k4ch)/[OneDrive](https://1drv.ms/u/c/d3b0ec8fe3491f94/EYi4e5_dtLBMmFl9I669KjEBr2PqPWEd7VLxeIzHDlKhgg?e=YXrQEO) | Released |
| **Zero-Shot CCR** | [Baiduyun:bxde](https://pan.baidu.com/s/1tKhrIZK7zmpQq3NNCo5Edw?pwd=bxfe)/[OneDrive](https://1drv.ms/u/c/d3b0ec8fe3491f94/ETsFnx-i6sRJvrVrgnvO3h4BMugmO2TUObjD9ddz3xfEmw?e=IoUcXq) | Released |
## 🛠️ Usage
* Clone this repo:
```bash
git clone https://github.com/SCUT-DLVCLab/MegaHan97K.git
```
* Execute the following command to obtain example samples from the MegaHan97K dataset.
```python
python MegaHan_Dataloader.py
```
**Note:**
- The MegaHan97K dataset can only be used for non-commercial research purposes. For scholar or organization who wants to use the MegaHan97K dataset, please first fill in this [Application Form](./application-form/Application-Form-for-Using-MegaHan97K.docx) and sign the [Legal Commitment](./application-form/Legal-Commitment.docx) and email them to us. When submitting the application form to us, please list or attached 1-2 of your publications in the recent 6 years to indicate that you (or your team) do research in the related research fields of handwriting analysis and recognition, document image processing, and so on.
- We will give you the decompression password after your application has been received and approved.
- All users must follow all use conditions; otherwise, the authorization will be revoked.
* To access the entire dataset, please first download it, update the ```data_root``` in the python ```MegaHan_Dataloader.py``` script and then execute
```python
python MegaHan_Dataloader.py
```
## ☎️ Cotact
If you have any questions, feel free to contact [Yuyi Zhang](https://github.com/ZZXF11) at [yuyi.zhang11@foxmail.com](yuyi.zhang11@foxmail.com)
## 🌄 Gallery
* **Illustration of the handwritten-original data in MegaHan97K**

* **Illustration of the handwritten-augmented data in MegaHan97K**

* **Illustration of the M5HisDoc data in MegaHan97K**

* **Illustration of the Kangxi dictionary data in MegaHan97K**

* **Illustration of the handwritten-original data in MegaHan97K**

* **Illustration of the handwritten-augmented data in MegaHan97K**

* **Illustration of the synthetic data in MegaHan97K**

## 💙 Acknowledgement
- [M5HisDoc](https://github.com/HCIILAB/M5HisDoc)
- [FontDiffuser](https://github.com/yeungchenwa/FontDiffuser)
## License
MegaHan97K should be used and distributed under [Creative Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License](https://creativecommons.org/licenses/by-nc-nd/4.0/) for non-commercial research purposes.
## Copyright
- This repository can only be used for non-commercial research purposes.
- For commercial use, please contact Prof. Lianwen Jin (eelwjin@scut.edu.cn).
- Copyright 2025, [Deep Learning and Vision Computing Lab (DLVC-Lab)](http://www.dlvc-lab.net), South China University of Technology.