# wespider
**Repository Path**: null_841/wespider
## Basic Information
- **Project Name**: wespider
- **Description**: 包括微信爬虫、微博爬虫、新闻爬虫
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 2
- **Forks**: 0
- **Created**: 2017-10-10
- **Last Updated**: 2020-12-19
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
#### 环境安装
##### 使用的是
系统:Ubuntu16.04
数据库:mysql、redis、mongodb
中间件:kafka
语言:Python3、pip3
##### 基本包的安装
```
sudo apt-get install libmysqlclient-dev
sudo apt-get install python3-dev
sudo apt-get install libxml2-dev libxslt-dev
```
##### 安装chromedriver
```
sudo apt-get install unzip
wget -N http://chromedriver.storage.googleapis.com/2.27/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver
sudo mv -f chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
```
> 参考
https://christopher.su/2015/selenium-chromedriver-ubuntu/
https://sites.google.com/a/chromium.org/chromedriver/getting-started
##### python依赖的安装
```
pip install -r requirements.txt
```
#### 登录mysql后,创建数据库
```bash
CREATE DATABASE `python_spider` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci */
```
#### 初始化数据库管理用户(仅第一次使用) add admin user
```bash
python manage.py makemigrations
python manage.py migrate
python manage.py createsuperuser
```
运行效果如下
```bash
lism@lism-PC:~/projects/wespider$ python manage.py makemigrations
Migrations for 'WeChatModel':
WeChatModel/migrations/0001_initial.py
- Create model Keyword
- Create model LoginUser
- Create model WeChatData
- Create model WeChatUser
Migrations for 'WeiboModel':
WeiboModel/migrations/0001_initial.py
- Create model Keywords
- Create model LoginInFo
- Create model Save2kafkalog
- Create model Seeds
- Create model WbUser
- Create model WeiboData
Migrations for 'NewsModel':
NewsModel/migrations/0001_initial.py
- Create model News
- Create model Site
lism@lism-PC:~/projects/wespider$ python manage.py migrate
Operations to perform:
Apply all migrations: NewsModel, WeChatModel, WeiboModel, admin, auth, contenttypes, sessions
Running migrations:
Applying NewsModel.0001_initial... OK
Applying WeChatModel.0001_initial... OK
Applying WeiboModel.0001_initial... OK
Applying contenttypes.0001_initial... OK
Applying auth.0001_initial... OK
Applying admin.0001_initial... OK
Applying admin.0002_logentry_remove_auto_add... OK
Applying contenttypes.0002_remove_content_type_name... OK
Applying auth.0002_alter_permission_name_max_length... OK
Applying auth.0003_alter_user_email_max_length... OK
Applying auth.0004_alter_user_username_opts... OK
Applying auth.0005_alter_user_last_login_null... OK
Applying auth.0006_require_contenttypes_0002... OK
Applying auth.0007_alter_validators_add_error_messages... OK
Applying auth.0008_alter_user_username_max_length... OK
Applying sessions.0001_initial... OK
lism@lism-PC:~/projects/wespider$ python manage.py createsuperuser
Username (leave blank to use 'lism'): topcom
Email address: 545314690@qq.com
Password:
Password (again):
Superuser created successfully.
```
## 数据库脚本
#### add model
```bash
django-admin.py startapp WeChatModel
```
#### make changes
```bash
python manage.py makemigrations WeChatModel
```
#### commit changes to db
```bash
python manage.py migrate WeChatModel
```
#### 启动django调用
```
bin/startup.sh
```
脚本后访问http://0.0.0.0:8000/admin 使用刚刚创建的用户登录
#### 创建一些微信公众号登录账号后选择一个进行登录

## 一些启动脚本
#### 启动flower监控
```bash
bin/flower-start.sh
```
访问flower http://0.0.0.0:5555/ 查看启动的worker
#### 启动登录worker
```bash
celery multi start wechat-login@118 -A spider.task.workers -Q login_queue -l info -c 1 -Ofair -n
```
再次查看flower,已经有一个wechat_login_queue worker在监控中了

#### 模拟登录
这时worker从任务队列拿出登录任务并执行,调用chromedriver访问谷歌浏览器,并自动输入用户名密码进行登录,需要公众号管理员拿出手机进行扫码登录
扫码成功则完成登录,并把cookie信息存入redis数据库,每次请求接口之前拿出cookie并加入header中。
#### 在django增加关键词并点击抓取

#### 启动抓取关键词、用户、文章的worker
```bash
celery multi start wechat-user@118 -A spider.task.workers -Q search_keyword_queue,user_list_crawl_queue,wechat_user_crawl_queue,wechat_url_crawl_queue -B -l info -c 1 -Ofair -n
```

可以看到任务已经运行成功了一部分了。
#### 查看django里面,看到微信公众号的信息已经被抓取了一部分了

#### 接下来可以在界面点击抓取某个公众号下面的文章,或者运行命令批量抓取
### 一些其他的命令
#### 登录+发送邮件
```bash
celery multi start wechat-login_email@118 -A spider.task.workers -Q login_queue,send_remaind_email_queue -l info -c 1 -Ofair -n
```
##### 微信关键词+微信用户,微信用户爬虫加上-B参数,每24小时更新一次
```bash
celery multi start wechat-user@118 -A spider.task.workers -Q search_keyword_queue,user_list_crawl_queue,wechat_user_crawl_queue,wechat_url_crawl_queue -B -l info -c 1 -Ofair -n
```
#### 微信文章
```bash
celery multi start wechat-article@118 -A spider.task.workers -Q wechat_crawl_queue -l info -c 1 -Ofair -n
```
#### 杀掉所有celery的worker
```bash
ps -ef|grep celery |grep -v grep |awk '{print $2}'|xargs kill -9
```