# pdf_test

**Repository Path**: atp2p/pdf_test

## Basic Information

- **Project Name**: pdf_test
- **Description**: 人才简历识别和分析  pdf图形识别率 90% 以上。 
采用人工大模型的算法，分析，归档 简历资料库
- **Primary Language**: Java
- **License**: AGPL-3.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-04-25
- **Last Updated**: 2025-05-11

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# pdf_test

## 介绍环境安装
python 3.101

python  -m pip freeze > requirements.txt

### faq 输入pip --version 不能正常显示
AttributeError:module 'types' has no attribute 'UnionType'， 

解决方案：
在vs中安装Pip总是显示安装失败。重新安装了对应vs版本的python。

### 利用国内镜像 清华源 pip 安装依赖(默认源下载过慢)
python  -m pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/

python  -m pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple/

### 安装 PDF 阅读解析库
pip install PyPDF2 -i https://pypi.tuna.tsinghua.edu.cn/simple/

### 使用 OCR 技术
对于扫描的 PDF 文件，其中可能包含图像而不是可选的文本，我们需要使用光学字符识别（OCR）技术来提取文本。Python 中常用的 OCR 库是 pytesseract，它是 Google 的 Tesseract OCR 引擎的封装。


#### ModuleNotFoundError: No module named 'pytesseract'
 pip install pytesseract -i https://pypi.tuna.tsinghua.edu.cn/simple/
 
 
#### tesseract is not installed or it‘s not in your PATH.
这个错误表明你的系统上没有安装 Tesseract 或者 Tesseract 不在系统的 PATH 路径中。Tesseract 是一个 OCR（Optical Character Recognition）引擎，用于文本识别。

你可以按照以下步骤解决这个问题：

##### 1. 安装 Tesseract：

Windows：你可以从 Tesseract 官方 GitHub 页面 的 Releases 部分下载 Windows 安装包，并按照说明进行安装。

Linux：在大多数 Linux 发行版中，你可以使用包管理器进行安装。例如，在 Ubuntu 中可以运行：

bash
sudo apt-get install tesseract-ocr 
 
##### 2. 将 Tesseract 添加到 PATH：

确保 Tesseract 的可执行文件所在的目录被添加到系统的 PATH 路径中。这样，Python 的 pytesseract 模块就能找到 Tesseract。

Windows：在系统环境变量中添加 Tesseract 的安装目录。

Linux 和 macOS：你可以将下面的命令添加到你的 shell 配置文件（比如 .bashrc 或 .zshrc）：

bash
export PATH=/path/to/tesseract/bin:$PATH 

### easyocr 读取pdf
 pip install easyocr -i https://pypi.tuna.tsinghua.edu.cn/simple/
 
 pip install paddlepaddle -i https://pypi.tuna.tsinghua.edu.cn/simple

 pip install pytorch torchvision torchaudio cpuonly -c pytorch -i https://pypi.tuna.tsinghua.edu.cn/simple
 
 
### ocr 识别率比较
 
  1. PaddleOCR 识别率 90% 以上，通过程序  pdfocr2.py 处理的 01.png ，结果 02.txt 可以参考查看解析的结果。
  2. easyocr        pdfocr1.py 处理的 1.png
  3. pytesseract    pdfocr.py 处理的 1.png
  
 ### pdf简历解析流程图
 
 ![流程图](al_resume.png)
  
  
###   PandaAI：一个基于AI的对话式数据分析工具
https://mp.weixin.qq.com/s/eUKFBnIU0YQATrJ0KmxFLQ