# TextHarmony **Repository Path**: ByteDance/TextHarmony ## Basic Information - **Project Name**: TextHarmony - **Description**: The official code for NeurIPS 2024 paper: Harmonizing Visual Text Comprehension and Generation - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-10-09 - **Last Updated**: 2026-01-25 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Harmonizing Visual Text Comprehension and Generation ## Environment **step 1**: set up the environment ``` git clone https://github.com/bytedance/TextHarmony cd TextHarmony pip install -r requirements.txt # install `MultiScaleDeformableAttention` module cd TextHarmony/models/utils/ops python setup.py install ``` some of the packages like mmcv and flash_attn in requirements.txt may need to be installed manually. **step 2**: download pretraining weights ``` cd TextHarmony python TextHarmony/scripts/download_hf_models.py ``` **step 3**: download the model weight of [TextHarmony](https://huggingface.co/jingqun/textharmony) ``` # concatenate the model files cat pytorch_model.binaa pytorch_model.binab pytorch_model.binac > pytorch_model.bin ``` ## Inference **step1**: modify 'load_from', 'llm_model_path', 'encoder_model_path' and 'pretrained_model_name_or_path' in example_inference.yaml **step 2**: run the following command: ``` torchrun --nproc_per_node 1 --nnodes 1 --master_port 2333 inference.py --config_file=TextHarmony/TextHarmony/configs/release/example_inference.yaml ``` ## Evaluation ### image comprehension **step1**: modify 'data_root' and 'data_path' in 896-moe-eval.yaml. The structure of 'data_path' should be as follows: ``` [ { "image": image_path, "question": question, "answer": answer }, ] ``` **step 2**: run the following command ``` torchrun --nproc_per_node 1 --nnodes 1 --master_port 2333 evaluate.py --config_file=TextHarmony/TextHarmony/configs/release/896-moe-eval.yaml ``` ### image generation **step 1**: download [AnyText-Benchmark](https://github.com/tyxsspa/AnyText?tab=readme-ov-file) **step 2**: generate the target images ``` torchrun --nproc_per_node 1 --nnodes 1 --master_port 2333 inference.py --config_file=TextHarmony/TextHarmony/configs/release/896-moe-inference.yaml ``` **step 3**: calculate the results ``` python TextHarmony/image_eval/eval_dgocr.py ``` ## Training * **TODO** ## Acknowledgment We thank the great work of [MM-Interleaved](https://github.com/OpenGVLab/MM-Interleaved), [TextDiffuser](https://github.com/microsoft/unilm/tree/master/textdiffuser-2), [AnyText](https://github.com/tyxsspa/AnyText) and [LoRAMoE](https://github.com/Ablustrund/LoRAMoE) ## Citation ``` @article{zhao2024harmonizing, title={Harmonizing Visual Text Comprehension and Generation}, author={Zhao, Zhen and Tang, Jingqun and Wu, Binghong and Lin, Chunhui and Wei, Shu and Liu, Hao and Tan, Xin and Zhang, Zhizhong and Huang, Can and Xie, Yuan}, journal={arXiv preprint arXiv:2407.16364}, year={2024} } ```