# dirbot **Repository Path**: enemy_705/dirbot ## Basic Information - **Project Name**: dirbot - **Description**: Scrapy project to scrape public web directories (educational) - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2016-10-19 - **Last Updated**: 2020-12-18 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ====== dirbot ====== This is a Scrapy project to scrape websites from public web directories. This project is only meant for educational purposes. Items ===== The items scraped by this project are websites, and the item is defined in the class:: dirbot.items.Website See the source code for more details. Spiders ======= This project contains one spider called ``dmoz`` that you can see by running:: scrapy list Spider: dmoz ------------ The ``dmoz`` spider scrapes the Open Directory Project (dmoz.org), and it's based on the dmoz spider described in the `Scrapy tutorial`_ This spider doesn't crawl the entire dmoz.org site but only a few pages by default (defined in the ``start_urls`` attribute). These pages are: * http://www.dmoz.org/Computers/Programming/Languages/Python/Books/ * http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/ So, if you run the spider regularly (with ``scrapy crawl dmoz``) it will scrape only those two pages. .. _Scrapy tutorial: http://doc.scrapy.org/en/latest/intro/tutorial.html Pipelines ========= This project uses a pipeline to filter out websites containing certain forbidden words in their description. This pipeline is defined in the class:: dirbot.pipelines.FilterWordsPipeline