# wespider **Repository Path**: null_841/wespider ## Basic Information - **Project Name**: wespider - **Description**: 包括微信爬虫、微博爬虫、新闻爬虫 - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 2 - **Forks**: 0 - **Created**: 2017-10-10 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README #### 环境安装 ##### 使用的是 系统:Ubuntu16.04
数据库:mysql、redis、mongodb
中间件:kafka
语言:Python3、pip3 ##### 基本包的安装 ``` sudo apt-get install libmysqlclient-dev sudo apt-get install python3-dev sudo apt-get install libxml2-dev libxslt-dev ``` ##### 安装chromedriver ``` sudo apt-get install unzip wget -N http://chromedriver.storage.googleapis.com/2.27/chromedriver_linux64.zip unzip chromedriver_linux64.zip chmod +x chromedriver sudo mv -f chromedriver /usr/local/share/chromedriver sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver ``` > 参考 https://christopher.su/2015/selenium-chromedriver-ubuntu/ https://sites.google.com/a/chromium.org/chromedriver/getting-started ##### python依赖的安装 ``` pip install -r requirements.txt ``` #### 登录mysql后,创建数据库 ```bash CREATE DATABASE `python_spider` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci */ ``` #### 初始化数据库管理用户(仅第一次使用) add admin user ```bash python manage.py makemigrations python manage.py migrate python manage.py createsuperuser ``` 运行效果如下 ```bash lism@lism-PC:~/projects/wespider$ python manage.py makemigrations Migrations for 'WeChatModel': WeChatModel/migrations/0001_initial.py - Create model Keyword - Create model LoginUser - Create model WeChatData - Create model WeChatUser Migrations for 'WeiboModel': WeiboModel/migrations/0001_initial.py - Create model Keywords - Create model LoginInFo - Create model Save2kafkalog - Create model Seeds - Create model WbUser - Create model WeiboData Migrations for 'NewsModel': NewsModel/migrations/0001_initial.py - Create model News - Create model Site lism@lism-PC:~/projects/wespider$ python manage.py migrate Operations to perform: Apply all migrations: NewsModel, WeChatModel, WeiboModel, admin, auth, contenttypes, sessions Running migrations: Applying NewsModel.0001_initial... OK Applying WeChatModel.0001_initial... OK Applying WeiboModel.0001_initial... OK Applying contenttypes.0001_initial... OK Applying auth.0001_initial... OK Applying admin.0001_initial... OK Applying admin.0002_logentry_remove_auto_add... OK Applying contenttypes.0002_remove_content_type_name... OK Applying auth.0002_alter_permission_name_max_length... OK Applying auth.0003_alter_user_email_max_length... OK Applying auth.0004_alter_user_username_opts... OK Applying auth.0005_alter_user_last_login_null... OK Applying auth.0006_require_contenttypes_0002... OK Applying auth.0007_alter_validators_add_error_messages... OK Applying auth.0008_alter_user_username_max_length... OK Applying sessions.0001_initial... OK lism@lism-PC:~/projects/wespider$ python manage.py createsuperuser Username (leave blank to use 'lism'): topcom Email address: 545314690@qq.com Password: Password (again): Superuser created successfully. ``` ## 数据库脚本 #### add model ```bash django-admin.py startapp WeChatModel ``` #### make changes ```bash python manage.py makemigrations WeChatModel ``` #### commit changes to db ```bash python manage.py migrate WeChatModel ``` #### 启动django调用 ``` bin/startup.sh ``` 脚本后访问http://0.0.0.0:8000/admin 使用刚刚创建的用户登录 #### 创建一些微信公众号登录账号后选择一个进行登录 ![图片](docs/images/login.png "登录") ## 一些启动脚本 #### 启动flower监控 ```bash bin/flower-start.sh ``` 访问flower http://0.0.0.0:5555/ 查看启动的worker #### 启动登录worker ```bash celery multi start wechat-login@118 -A spider.task.workers -Q login_queue -l info -c 1 -Ofair -n ``` 再次查看flower,已经有一个wechat_login_queue worker在监控中了 ![图片](docs/images/flower.png "flower") #### 模拟登录 这时worker从任务队列拿出登录任务并执行,调用chromedriver访问谷歌浏览器,并自动输入用户名密码进行登录,需要公众号管理员拿出手机进行扫码登录 扫码成功则完成登录,并把cookie信息存入redis数据库,每次请求接口之前拿出cookie并加入header中。 #### 在django增加关键词并点击抓取 ![图片](docs/images/keyword.png "keyword") #### 启动抓取关键词、用户、文章的worker ```bash celery multi start wechat-user@118 -A spider.task.workers -Q search_keyword_queue,user_list_crawl_queue,wechat_user_crawl_queue,wechat_url_crawl_queue -B -l info -c 1 -Ofair -n ``` ![图片](docs/images/flower2.png "flower2") 可以看到任务已经运行成功了一部分了。 #### 查看django里面,看到微信公众号的信息已经被抓取了一部分了 ![图片](docs/images/account.png "account") #### 接下来可以在界面点击抓取某个公众号下面的文章,或者运行命令批量抓取 ### 一些其他的命令 #### 登录+发送邮件 ```bash celery multi start wechat-login_email@118 -A spider.task.workers -Q login_queue,send_remaind_email_queue -l info -c 1 -Ofair -n ``` ##### 微信关键词+微信用户,微信用户爬虫加上-B参数,每24小时更新一次 ```bash celery multi start wechat-user@118 -A spider.task.workers -Q search_keyword_queue,user_list_crawl_queue,wechat_user_crawl_queue,wechat_url_crawl_queue -B -l info -c 1 -Ofair -n ``` #### 微信文章 ```bash celery multi start wechat-article@118 -A spider.task.workers -Q wechat_crawl_queue -l info -c 1 -Ofair -n ``` #### 杀掉所有celery的worker ```bash ps -ef|grep celery |grep -v grep |awk '{print $2}'|xargs kill -9 ```