# CrawlerJS **Repository Path**: linzaida/CrawlerJS ## Basic Information - **Project Name**: CrawlerJS - **Description**: crawler / 爬虫 Includes sofang, anjuke, 58.com and baidu api examples - **Primary Language**: Unknown - **License**: ISC - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 16 - **Created**: 2015-12-30 - **Last Updated**: 2024-10-13 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README [![CrawlerJS](assets/crawlerjs.jpg)](https://github.com/CrawlerJS/CrawlerJS) --- [![Coverage Status](https://img.shields.io/coveralls/CrawlerJS/CrawlerJS.svg)](https://coveralls.io/r/CrawlerJS/CrawlerJS) [![Build Status](https://secure.travis-ci.org/CrawlerJS/CrawlerJS.png)](https://travis-ci.org/CrawlerJS/CrawlerJS) [![Dependencies Status](https://david-dm.org/CrawlerJS/CrawlerJS.png)](https://david-dm.org/CrawlerJS/CrawlerJS) [![DevDependencies Status](https://david-dm.org/CrawlerJS/CrawlerJS/dev-status.png)](https://david-dm.org/CrawlerJS/CrawlerJS) Open source module for crawling web pages for information and data. This module is extremely extensible and pluggable, feel free to play with it. [Visit crawlerjs.org for more info][0]. ### Examples In this example we will be searching StackOverflow for the most recently questions about Node.js, other examples can be [found here](examples/). ```javascript var Crawler = require('crawler-js'); var manifest = { target: { // Here we need some infos to acess the website url: 'https://stackoverflow.com/questions/tagged/node.js', headers: ['User-Agent', 'CrawlerJS'] }, extractors: [{ // The name for this extractor name: 'questions', // Add a root key, so you can iterate over every element // found with this xPath expression root: '//div[@class="question-summary"]', fields: { // Those xPath expressions will be based // on the root question: '//a[1]/text()', link: '//a[1]/@href' } }, { name: 'count', fields: { // We don't have a root element, so this expression // will have all the document as the root and will // not iterate count: '//*[@id="sidebar"]/div[1]/div[1]/text()', } }] }; // Create a new job for your manifest var job = new Crawler(manifest); // Listen for data extracted job.on('data', function(data) { console.log('Data extracted for %s:', data.name); console.log(data.data); }); // If something bad happens.. job.on('error', function(err) { throw err; }); // This way we know that the job is done job.on('end', function() { console.log('The job is done'); }); // Start our job job.start(); ``` ### Testing Testing is easy, you need to have `grunt-cli` installed globally, clone this repository, `npm install` inside the folder and run `grunt test`. ### Todos * Add inline comments inside [the code](lib/). * Propper documentation. * API reference. * More tests. ### License Copyright (c) 2014, Rodrigo Matheus Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies. THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. [0]: http://crawlerjs.org