# spider-homework

**Repository Path**: kivvvi/spider-homework

## Basic Information

- **Project Name**: spider-homework
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 4
- **Created**: 2023-11-19
- **Last Updated**: 2024-01-10

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 使用 Playwright 爬取教务系统课表
## 代码运行

> 确保python环境为最新版

1. 安装`pytest-playwright`
```bash
pip install pytest-playwright
playwright install
```

2. 安装`python-dotenv`
```bash
pip install python-dotenv
```

3. 准备.env文件
```
student_id=YOUR_STUDENT_ID
password=YOUR_PASSWORD
```

4. 运行`main.py`
```bash
python main.py
```

## 代码解释
1. 初始化 Spyder 类
```python
def __init__(self,p):
    env_vars = dotenv_values('.env')
    self.student_id = env_vars['student_id']
    self.password = env_vars['password']
    self.header = {}
    self.browser = p.chromium.launch(headless=False)
    self.context = self.browser.new_context()
    self.page = self.context.new_page()
```
- 初始化 Spyder 类时，读取.env文件中的学生ID和密码，并创建一个浏览器实例和页面实例。<br><br>

2. update_header 方法
```python
def update_header(self, url, headers):
    url_pattern = re.compile(r'findPksjInfoByOne')
    if url_pattern.findall(url):
        self.header = headers
```
- 根据传入的URL和请求头信息更新self.header。<br><br>

3. login 方法
```python
def login(self):
        url = 'http://cas.hnu.edu.cn/cas/login'

        self.page.goto(url)
        self.page.fill('#username', self.student_id)
        self.page.fill('#password', self.password)
        self.page.keyboard.press('Enter')

        # 等待网页url地址变成  https://pt.hnu.edu.cn/personal-center
        self.page.wait_for_url("https://pt.hnu.edu.cn/personal-center")

        # 打开教务系统 http://hdjw.hnu.edu.cn/caslogin
        self.page.goto("http://hdjw.hnu.edu.cn/caslogin")

        # 记录http://hdjw.hnu.edu.cn/resService/jwxtpt/v1/jczy/userIndex/findPksjInfoByOne 的header
        self.page.on("request", lambda request: self.update_header(request.url, request.headers))

        # 等待网页url地址变成 http://hdjw.hnu.edu.cn/Njw2017/index.html#/
        self.page.wait_for_url("http://hdjw.hnu.edu.cn/Njw2017/index.html#/")
        self.page.wait_for_load_state('domcontentloaded')
        self.page.wait_for_timeout(5000)
        # TODO 待优化
        # 上面这些wait是为了保证网页加载完全，这样就能拿到header
        # print(self.header)
```
- 访问登录页面，填写学生ID和密码，并跳转到个人中心和教务系统页面。监听特定请求的header信息。<br><br>

4. get_data 方法
```python
def get_data(self):
        # 开始爬虫
        resp = self.page.request.post(
            url="http://hdjw.hnu.edu.cn/resService/jwxtpt/v1/xsd/xsdqxxkb_info/searchBjkbList?resourceCode=XSMH0703&apiCode=jw.xsd.xsdInfo.controller.XsdQxxkbController.searchBjkbList&sf_request_type=ajax",
            data={'jczy013id': '2023-2024-1', 'pkgl002id': 'W134b1130000WH', 'jczy006id': '', 'skdwid': '14', 'sknj': '', 'jczy004id': ''},
            headers=self.header
        )
        DATA = str(resp.body(),'utf-8')
        print(DATA)
        return DATA
```
- 发起POST请求获取教务系统数据。数据请求包含特定参数和头信息，获取响应体并返回字符串形式的数据。<br><br>