# crawler4zb

**Repository Path**: tufeiping/crawler4zb

## Basic Information

- **Project Name**: crawler4zb
- **Description**: 招投标网站爬虫
- **Primary Language**: NodeJS
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 12
- **Forks**: 5
- **Created**: 2016-10-10
- **Last Updated**: 2025-04-01

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

﻿# crawler4zb (招标网爬虫)

### 这个系统是专门为公司的市场人员监控招投标网站信息服务的，由于在github上没有找到合适的，所以只好自己写一个简单的。

* 系统采用mysql数据库，在使用时候，先创建crawler4zb数据库，然后创建数据表，建库建表脚本见 db/script.sql 文件 
* 数据库连接配置为 db/index.js 文件

* 开发环境: node v6.5.0[v6.9.0], visule studio code v1.7.1
* 测试运行环境: `Linux Ubuntu 16.04.1 LTS` 和 `Windows 8`



--------------

> npm install 
> node index.js --help
> node index.js -w # 以web形式启动
> node index.js -n # 获取新的招投标信息
> node index.js -s # 将招投标信息发送到登记的邮件列表中
> node index.js -a # 获取招投标信息并发送邮件到列表中
> node index.js -i # 全部重新构建索引

--------------

## 系统支持Web方式浏览和基本查询

-------------

> node index.js -w  # 以web 服务方式启动

* http://localhost:3000  # 默认监听3000端口
* http://localhost:3000/admin # 系统管理后台

-------------

## 部署方式

* 将 node index.js -w 启动后，就可以通过网页访问
* 将 node index.js -a 放在定时任务（Windows）或者crontab（Linux），按指定间隔定时执行


## 代码结构说明

<pre>
    
    +-- db 数据库支持模块(mysql)
    |
    +-- mail 邮件发送模块，其中的config.js由于隐私原因去除了，可以自己按照index.js里面的说明写一个
    |
    +-- modules 里面放置所有的爬虫模块，每个爬虫都继承自base模块，然后实现其中一两个方法即可
    |
    +-- schdl 调度模块，是整个系统核心部分
    |
    +-- spider 爬虫封装，封装了request，使用更简单
    |
    +-- web 网页浏览查询支持模块
    |
    +-- index 全文检索支持
    |
    index.js 启动文件，支持多种参数启动
    package.json 包定义文件，系统所有的依赖都在这里定义
    test.js 测试文件，在开发爬虫的时候，需要运行 node test.js module-name 来测试爬虫是否可以工作
    solrtest.js Solr全文检索测试文件
    managed-schema Solr配置文件

</pre>

## 爬虫的编写和调试

* 爬虫必须放在 modules 目录中，必须从 Base 类继承，大部分情况下进需要完成 getPageUrl 和 getItems 两个方法，getPage 返回指定页码的URL，getItems 按照jQuery的方式解析页面内容，返回条目对象数组，每个对象包含 url source 三个属性，然后返回条目数组即可(见文末示例)。 Base 类还提供一些高级接口可以继承 (getItemWithPage)，但大部分情况下不需要。

* 测试爬虫的方法 `node test.js module-name` 其中 module-name 就是爬虫文件名（不需要js后缀）


<img src="http://save.tufeiping.cn/PIC_2016-11-14_09-20-20.png"></img>


## 全文检索支持

为了提供更好的检索体验，系统在运行一个月之后，加入全文检索支持，全文检索服务是基于Solr构建的，请先下载Solr最新版本（6.6.2 由于之前使用的是6.3版本怕版本跨度太大，暂时不下载7开头版本 ）
http://mirrors.hust.edu.cn/apache/lucene/solr/

然后解压，运行 `bin/solr start`（windows下进入解压完文件的的bin目录下，进入cmd输入命令)
启动浏览器，访问 http://localhost:8983/solr 查看Solr是否正常启动
使用命令行方式创建一个core `bin/solr create -c zb_core`
将文件managed-schema 拷贝到 $solr_home/server/solr/zb_core/conf中，然后 `bin/solr stop` 之后再启动，就可以使配置文件生效（windows下 solr stop -all)


> node index.js -i # 重新构建索引

Solr默认采用8983端口，可以自己指定端口，如果指定端口，请修改 index/index.js 中的 solrConfig 对象的相关属性

**注意，运行环境由于需要支持Solr 6.3.0，所以除了需要Nodejs自持，还必须安装Java 8 (JDK 1.8)**

## 爬虫的编写

每个爬虫，如果默认都是爬取固定页面（去重工作系统将自动完成），并且页面采用GET方法获取，可以用最简单的爬虫模板，如下所示：

<pre>
"use strict"

/**
 * 某采购网爬虫
 *  
 */

var Base = require("./base"); // 继承的父类
var xx_spider = new Base();
xx_spider.name = "某采购网"; // 爬虫名称，会作为数据来源写入数据库
xx_spider.page = 1; // 需要爬取的页面数量，默认为5
xx_spider.encode = "utf-8"; // 爬取页面的编码，默认为-8

// 重写下面两个方法即可
xx_spider.getPageUrl = function (pageIndex) { // 根据传入的页码，返回页面对应的url
    return "http://www.xxcg.gov.cn/pg=" + pageIndex; 
};

xx_spider.getItems = function ($) { // 获取页面中的招标项目信息，$是一个类似jQuery的对象(cheerio)，可以直接按照jQuery方式使用
    var items = [];
    $("ul li a").each(function (index, ele) {
        var e = $(ele);
        var item = {};
        item.title = e.text();
        item.url = "http://www.xxcg.gov.cn/" + e.attr("href");
        item.source = xx_spider.name;
        items.push(item);
    });
    return items;
};

exports = module.exports = xx_spider;
</pre>

## 邮件自动发送

> 使用数据库后台将邮件发送用户的信息录入到 mailconfig 数据表中，为了安全起见，系统不提供编辑功能界面
> insert into mailconfig(host, secureConnection, fromaddr, username, userpass, subject, port) values ('mail.xxxx.com', 1, '招标信息反馈<xx@xxxx.com>', 'xx@xxxx.com', 'password', '[重要] 新的招标信息动态', 25)
> host: 邮件服务器 secureConnection: [0,1] 是否使用ssl连接 fromaddr: 邮件发送者 email 地址 username: 登录名 userpass: 登录密码 subject: 邮件标题 port: smtp端口

> Sunny <tufeiping@gmail.com> 开发 2016-10-10 have fun!!