# ECommerceCrawlers

**Repository Path**: xwcoding/ECommerceCrawlers

## Basic Information

- **Project Name**: ECommerceCrawlers
- **Description**: 实战多种网站、电商数据爬虫。包含：淘宝商品、微信公众号、大众点评、招聘网站、闲鱼、阿里任务、scrapy博客园、微博、百度贴吧、豆瓣电影、包图网、全景网、豆瓣音乐、某省药监局、搜狐新闻、机器学习文本采集、fofa资产采集、汽车之家、国家统计局、百度关键词收录数、蜘蛛泛目录、今日头条、豆瓣影评️️️。微信爬虫展示项目:
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: master
- **Homepage**: http://wechat.doonsec.com/
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 2032
- **Created**: 2021-08-03
- **Last Updated**: 2021-08-03

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

[![](https://img.shields.io/badge/language-Python35-green.svg)]() [![](https://img.shields.io/badge/Branch-master-green.svg?longCache=true)]() [![](https://img.shields.io/github/followers/DropsDevopsOrg.svg?label=Follow)]() ![GitHub contributors](https://img.shields.io/github/contributors/DropsDevopsOrg/ECommerceCrawlers.svg) [![](https://img.shields.io/github/forks/DropsDevopsOrg/ECommerceCrawlers.svg?label=Fork&style=social)]() [![](https://img.shields.io/github/stars/DropsDevopsOrg/ECommerceCrawlers.svg?style=social)]() [![](https://img.shields.io/github/watchers/DropsDevopsOrg/ECommerceCrawlers.svg?label=Watch&style=social)]()

## ECommerceCrawlers

多种电商商品数据 🐍 爬虫，整理收集爬虫练习。每个项目都是成员写的。通过实战项目练习解决一般爬虫中遇到的问题。

通过每个项目的 readme，了解爬取过程分析。

对于精通爬虫的 pyer，这将是一个很好的例子减少重复收集轮子的过程。项目经常更新维护，确保即下即用，减少爬取的时间。

对于小白通过 ✍️ 实战项目，了解爬虫的从无到有。爬虫知识构建可以移步[项目 wiki](https://github.com/DropsDevopsOrg/ECommerceCrawlers/wiki/%E7%88%AC%E8%99%AB%E5%88%B0%E5%BA%95%E8%BF%9D%E6%B3%95%E5%90%97%3F)。爬虫可能是一件非常复杂、技术门槛很高的事情，但掌握正确的方法，在短时间内做到能够爬取主流网站的数据，其实非常容易实现，但建议从一开始就要有一个具体的目标。

在目标的驱动下，你的学习才会更加精准和高效。那些所有你认为必须的前置知识，都是可以在完成目标的过程中学到的 😁😁😁。

需要进阶学习爬虫技巧，推荐王平大师傅的[猿人学·爬虫逆向高阶课](https://j.youzan.com/zF-n-2)，报AJay13推荐，可享受内部优惠价格。

欢迎大家对本项目的不足加以指正，⭕️Issues 或者 🔔Pr

> 在之前上传的大文件贯穿了 3/4 的 commits，发现每次 clone 达到 100M，这与我们最初的想法违背，我们不能很有效的删除每一个文件（太懒），将重新进行初始化仓库的 commit。并在今后不上传爬虫数据，优化仓库结构。

## About

- 码云仓库链接:[AJay13/ECommerceCrawlers](https://gitee.com/AJay13/ECommerceCrawlers)
- Github 仓库链接:[DropsDevopsOrg/ECommerceCrawlers](https://github.com/DropsDevopsOrg/ECommerceCrawlers)
- 项目展示平台链接:[http://wechat.doonsec.com](http://wechat.doonsec.com)

## Income

几乎 80%的项目都是帮客户写的爬虫，在添加到仓库之前已经经过客户同意可开源原则。

<details>
<summary>收益表</summary>

| 项目            | 收益 |             备注             |
| :-------------- | ---: | :--------------------------: |
| DianpingCrawler |  200 |
| TaobaoCrawler   | 2000 |
| SohuNewCrawler  | 2500 |
| WechatCrawler   | 6000 |                              |
| 某省药监局      |   80 |
| fofa            |  700 |
| baidu           | 1000 |
| 蜘蛛泛目录      | 1000 |
| 更多……          |   …… | 另部分程序未得到客户开源认可 |

</details>

## CrawlerDemo

- [x] [DianpingCrawler](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/DianpingCrawler)：大众点评爬取
- [x] [East_money](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/East_money)：scrapy 爬取东方财富网
- [x] [📛TaobaoCrawler(new)](<https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/TaobaoCrawler(new)>)：阿里系全自主平台(淘宝、天猫、咸鱼、菜鸟裹裹、飞猪等)信息爬取 免 cookie, 理论上不被反爬虫机制(只提供淘宝，其他思路一样，加密方式一样)，
- [x] [📛SIPO 专利审查](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/SIPO专利审查)：SIPO 专利审查 自动化客户端
- [x] [📛QiChaCha](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/QiChaCha)：企查查 全国工业园区及企业信息
- [x] [TaobaoCrawler](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/TaobaoCrawler)：淘宝商品爬取
- [x] [📛ZhaopinCrawler](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/ZhaopinCrawler)：各大招聘网站爬取
- [x] [ShicimingjuCrawleAndDisplayr](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/ShicimingjuCrawleAndDisplay)：诗词名家句网站爬取展示
- [x] [XianyuCrawler](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/XianyuCrawler)：闲鱼商品爬取
- [x] [SohuNewCrawler](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/SohuNewCrawler)：新闻网爬取
- [x] [WechatCrawler](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/WechatCrawler)：微信公众号爬取
- [x] [cnblog](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/cnblog)：scrapy 博客园爬取
- [x] [WeiboCrawler](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/WeiboCrawler)：微博数据爬取免 cookie
- [x] [OtherCrawlers](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler)：一些有趣的爬虫例子
  - [x] [0x01 百度贴吧](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler#0x01baidutieba)
  - [x] [0x02 豆瓣电影](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler#0x02doubanmovie)
  - [x] [0x03 阿里任务](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler#0x03alitask)
  - [x] [0x04 包图网视频](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler#0x04baotu)
  - [x] [0x05 全景网图片](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler#0x05quanjing)
  - [x] [0x06 豆瓣音乐](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler#0x06douban_music)
  - [x] [0x07 某省药监局](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler#0x07gdfda_pharmacy)
  - [x] [0x08 fofa](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler#0x08fofa)
  - [ ] [0x09 汽车之家](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler#0x09autohome)
  - [ ] [0x010 国家统计局]()
  - [x] [0x10 baidu](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler/0x10baidu)
  - [x] [0x11 蜘蛛泛目录](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler/0x11zzc)
  - [x] [0x12 今日头条](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler/0x12toutiao)
  - [x] [0x13 豆瓣影评分析](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler/0x13douban_yingping)
  - [x] [0x14 协程评论爬取](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler/0x14ctrip_crawler)
  - [x] [0x15 小米应用商店爬取](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler/0x15xiaomiappshop)
  - [x] [0x16 酷安app信息采集](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler/0x16kuanappshop)
  - [ ] [0x17 知乎信息采集](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler/0x17zhihu)
  - [x] [0x18 必应图片采集](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler/0x18bing_img)
  - [x] [0x19 安居客信息采集](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler/0x19anjuke)
  - [x] [0x20 途家民宿信息采集](https://github.com/DropsDevopsOrg/ECommerceCrawlers/tree/master/OthertCrawler/0x20tujiaminsu)
## Contribution👏

| <a  href="https://gitee.com/joseph31"><img class="avatar" src="https://avatars3.githubusercontent.com/u/47005658?s=460&v=4" width="48" height="48" alt="@joseph31"></a> | <a  href="https://github.com/Joynice"><img class="avatar" src="https://avatars0.githubusercontent.com/u/22851022?s=96&amp;v=4" width="48" height="48" alt="@Joynice"></a> | <a href="https://github.com/liangweiyang"><img class="avatar" src="https://avatars0.githubusercontent.com/u/37971213?s=96&amp;v=4" width="48" height="48" alt="@liangweiyang"></a> | <a href="https://github.com/Hatcat123"><img class="avatar" src="https://avatars0.githubusercontent.com/u/28727970?s=96&amp;v=4" width="48" height="48" alt="@Hatcat123"></a> | <a href="https://github.com/jihu9"><img class="avatar" src="https://avatars0.githubusercontent.com/u/17663102?s=96&amp;v=4" width="48" height="48" alt="@jihu9"></a> | <a href="https://github.com/ctycode"><img class="avatar" src="https://avatars3.githubusercontent.com/u/56985178?s=96&amp;v=4" width="48" height="48" alt="@ctycode"></a> |<a href="https://github.com/sparkyuyuanyuan"><img class="avatar" src="https://avatars3.githubusercontent.com/u/50583631?s=96&amp;v=4" width="48" height="48" alt="@sparkyuyuanyuan"></a> |
| :---------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------: |:------------------------------:|
|    [joseph31](https://gitee.com/joseph31)                                                                  |        [Joynice](https://github.com/Joynice)  |    [liangweiyang](https://github.com/liangweiyang)    |         [Hatcat123](https://github.com/Hatcat123)                                                                   |                                                                  [jihu9](https://github.com/jihu9)                                                                   |                                                                  [ctycode](https://github.com/ctycode)                                                                   |                                                                  [sparkyuyuanyuan](https://github.com/sparkyuyuanyuan)                                                                   |


> wait for you

## What You Learn ?

本项目使用了哪些有用的技术

- 数据分析
  - [x] chrome Devtools
  - [x] Fiddler
  - [x] Firefox
  - [ ] appnium
  - [x] anyproxy
  - [x] mitmproxy
- 数据采集
  - [x] [urllib]()
  - [x] [requests](https://2.python-requests.org//zh_CN/latest/user/quickstart.html)
  - [x] scrapy
  - [x] selenium
  - [ ] pypputeer
- 数据解析
  - [x] re
  - [x] beautifulsoup
  - [x] xpath
  - [x] pyquery
  - [x] css
- 数据保存
  - [x] txt 文本
  - [x] csv
  - [x] excel
  - [x] mysql
  - [x] redis
  - [x] mongodb
- 反爬验证
  - [x] mitmproxy 绕过淘宝检测
  - [x] js 数据解密
  - [x] js 数据生成对应指纹库
  - [x] 文字混淆
  - [ ] 穿插脏数据
- 效率爬虫
  - [x] 单线程
  - [x] 多线程
  - [x] 多进程
  - [x] 异步协成
  - [x] 生产者消费者多线程
  - [x] 分布式爬虫系统

> _链接标识官方文档或推荐例子_

## What`s Spider 🕷？

**[ECommerceCrawlerswiki](https://github.com/DropsDevopsOrg/ECommerceCrawlers/wiki)**

### 🙋0x01 爬虫简介

**爬虫**

爬虫是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。

**[爬虫到底违法吗？](https://github.com/DropsDevopsOrg/ECommerceCrawlers/wiki/%E7%88%AC%E8%99%AB%E5%88%B0%E5%BA%95%E8%BF%9D%E6%B3%95%E5%90%97%3F)**

**爬虫作用**

- 市场分析：电商分析、商圈分析、一二级市场分析等
- 市场监控：电商、新闻、房源监控等
- 商机发现：招投标情报发现、客户资料发掘、企业客户发现等

**网页介绍**

- url
- html
- css
- js

**Roobots 协议**

无规矩不成方圆，Robots 协议就是爬虫中的规矩，它告诉爬虫和搜索引擎哪些页面可以抓取，哪些不可以抓取。
通常是一个叫作 robots.txt 的文本文件，放在网站的根目录下。

### 🙋0x02 爬取过程

**获取数据**

**模拟获取数据**

### 🙋0x03 解析数据

**re**

**beautifulsoup**

**xpath**

**pyquery**

**css**

### 🙋0x04 存储数据

小规模数据存储（文本）

- txt 文本
- csv
- excel

大规模数据存储（数据库）

- mysql
- redis
- mongodb

### 🙋0x05 反爬措施

反爬

反反爬

### 🙋0x06 效率爬虫

多线程

多进程

异步协程

scrapy 框架

### 🙋0x07 可视化处理

flask Web

django Web

tkinter

echarts

electron

## Padding

…………

## Awesome-Example😍:

- [CriseLYJ/awesome-python-login-model](https://github.com/CriseLYJ/awesome-python-login-model)

- [lb2281075105/Python-Spider](https://github.com/lb2281075105/Python-Spider)

- [SpiderCrackDemo](https://github.com/wkunzhi/SpiderCrackDemo)

添加这位大佬的微信，回复‘爬虫’，拉你进爬虫讨论微信群

![输入图片说明](https://images.gitee.com/uploads/images/2019/1021/210109_aa0ac4f0_5355334.jpeg "@(]01MR)D76(1N2KFX`R(YG.jpg")