# magic-doc
**Repository Path**: lzdn/magic-doc
## Basic Information
- **Project Name**: magic-doc
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: issue-template-patch-1
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-04-25
- **Last Updated**: 2025-04-25
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
[](https://github.com/InternLM/magic-doc/tree/main/LICENSE)
[](https://github.com/InternLM/magic-doc/issues)
[](https://github.com/InternLM/magic-doc/issues)
👋 加入我们 Discord 和 微信社区
[English](README.md) | [简体中文](README_zh-CN.md)
### 安装
前置依赖: python3.10
安装依赖
**linux/osx**
```bash
apt-get/yum/brew install libreoffice
```
**windows**
```text
安装 libreoffice
添加 "install_dir\LibreOffice\program" to 环境变量 PATH
```
安装 Magic-Doc
```bash
pip install fairy-doc[cpu] # 安装 cpu 版本
或
pip install fairy-doc[gpu] # 安装 gpu 版本
```
## 简介
Magic-Doc 是一个轻量级、开源的用于将多种格式的文档(PPT/PPTX/DOC/DOCX/PDF)转化为 markdown 格式的工具。支持转换本地文档或者位于 AWS S3 上的文件
## 使用示例
```python
# for local file
from magic_doc.docconv import DocConverter, S3Config
converter = DocConverter(s3_config=None)
markdown_content, time_cost = converter.convert("some_doc.pptx", conv_timeout=300)
```
```python
# for remote file located in aws s3
from magic_doc.docconv import DocConverter, S3Config
s3_config = S3Config(ak='${ak}', sk='${sk}', endpoint='${endpoint}')
converter = DocConverter(s3_config=s3_config)
markdown_content, time_cost = converter.convert("s3://some_bucket/some_doc.pptx", conv_timeout=300)
```
## 性能
环境:AMD EPYC 7742 64-Core Processor, NVIDIA A100, Centos 7
| 文件类型 | 转化速度|
| ------------------ | -------- |
| PDF (digital) | 347 (page/s) |
| PDF (ocr) | 2.7 (page/s) |
| PPT | 20 (page/s) |
| PPTX | 149 (page/s) |
| DOC | 600 (page/s) |
| DOCX | 1482 (page/s) |
## 致谢
- [Antiword](https://github.com/rsdoiel/antiword)
- [LibreOffice](https://www.libreoffice.org/)
- [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/)
- [paddleocr](https://github.com/PaddlePaddle/PaddleOCR)
## 🖊️ 引用
```bibtex
@misc{2024magic-doc,
title={Magic-Doc: A Toolkit that Converts Multiple File Types to Markdown},
author={Magic-Doc Contributors},
howpublished = {\url{https://github.com/InternLM/magic-doc}},
year={2024}
}
```
## 开源许可证
该项目采用[Apache 2.0 开源许可证](LICENSE)。
🔼 Back to top