# word_crawler

**Repository Path**: Martinkeep/word_crawler

## Basic Information

- **Project Name**: word_crawler
- **Description**: 用于Word安卓项目单词的发音、词义数据爬取
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-09-29
- **Last Updated**: 2025-09-29

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 单词爬虫项目

这是一个用于从在线词典（如爱词霸）爬取单词信息并存储到数据库的 Python 工具。该项目可以批量爬取单词信息，并将其插入到 MySQL 数据库中。

## 功能特性

- 从爱词霸爬取单词的音标、发音、释义等信息
- 支持批量导入单词列表
- 插入数据到 MySQL 数据库
- 更新分类的单词数量统计
- 提供日志记录和错误处理

## 依赖库

- `requests`
- `mysql-connector-python`
- `logging`
- `urllib.parse`
- `json`
- `time`
- `random`

## 数据库配置

在 `main()` 函数中配置数据库连接参数，包括主机地址、端口、用户名、密码和数据库名称。

```python
db_config = {
    'host': '192.168.0.13',
    'port': 3306,
    'user': 'cfx2000',
    'password': 'cfx2000',
    'database': 'english',
    'charset': 'utf8mb4',
    'auth_plugin': 'mysql_native_password',
    'autocommit': True,
    'connect_timeout': 30,
    'use_unicode': True
}
```

## 使用方法

### 批量导入单词

使用 `batch_crawl_from_file` 方法从文件中批量爬取单词信息并插入数据库：

```python
crawler = WordCrawler(db_config)
success_count, failed_words = crawler.batch_crawl_from_file(
    file_path='cet4_core.txt',
    class_id='CET4_CORE',
    class_title='四级核心词汇',
    course='1'
)
print(f"批量导入完成: 成功 {success_count} 个，失败 {len(failed_words)} 个")
```

### 插入单个单词

使用 `insert_single_word` 方法插入单个单词：

```python
crawler.insert_single_word(word='example', class_id='CET4_CORE', class_title='四级核心词汇', course='1')
```

### 测试单个单词

使用 `test_single_word` 方法测试单个单词的爬取：

```python
crawler.test_single_word(word='example')
```

## 类和方法说明

### `WordCrawler` 类

- `__init__(self, db_config: Dict)`：初始化数据库配置和请求会话
- `get_connection(self)`：获取数据库连接
- `crawl_from_iciba(self, word: str) -> Optional[Dict]`：从爱词霸爬取单词信息
- `_parse_meanings_fixed(self, parts: list) -> str`：解析单词的释义
- `_extract_word_type_fixed(self, parts: list) -> str`：提取单词的主要词性
- `crawl_word_info(self, word: str) -> Optional[Dict]`：综合多个来源爬取单词信息
- `batch_crawl_from_file(self, file_path: str, class_id: str, class_title: str, course: str)`：从文件批量爬取单词
- `insert_single_word(self, word: str, class_id: str, class_title: str, course: str)`：插入单个单词到数据库
- `update_category_word_count(self, class_id: str)`：更新分类的单词数量统计
- `test_single_word(self, word: str)`：测试单个单词的爬取

## 日志记录

项目使用 `logging` 模块记录详细的操作日志，便于调试和监控。

## 错误处理

项目包含完善的错误处理机制，确保在爬取或数据库操作失败时能够回滚并记录错误信息。

## 许可证

该项目遵循 MIT 许可证。