# nutch2.2-eclipse-win64 **Repository Path**: brade/nutch2-2-eclipse-win64 ## Basic Information - **Project Name**: nutch2.2-eclipse-win64 - **Description**: win64环境下nutch2.2导入eclipse - **Primary Language**: Java - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2017-05-08 - **Last Updated**: 2024-05-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README #本项目是win64+nutch2.2+mysql+eclipse执行版本,解决了jar包冲突 1、 |--在官网下载nutch2.2并解压到本地,修改ivy/ivy.xml; |--修改为 ,0.3-->0.2.1 |--解除注释 2、 |--修改gora.properties,增加mysql数据库配置: gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://192.168.142.128:3306/nutch?createDatabaseIfNotExist=true gora.sqlstore.jdbc.user=root gora.sqlstore.jdbc.password=123456 |--修改nutch-site.xml,增加以下配置: ``` http.agent.name nutch-brade HTTP 'User-Agent' request header. MUST NOT be empty http.robots.agents nutch-brade,* The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* plugin.folders ./src/plugin Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used- as is. If relative, it is searched for on the classpath. parser.character.encoding.default utf-8 The character encoding to fall back to when no other information is available storage.data.store.class org.apache.gora.sql.store.SqlStore The Gora DataStore class for storing and retrieving data. http.content.limit -1 The length limit for downloaded content using the http protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. generate.batch.id * ``` 3、cmd窗口进行解压目录执行ant eclipse(以上第2步,也可以执行ant后再修改,ant/ivy环境省略,百度配置) 4、ant编译时间比较长,可以多试几次,成功后会在目录生成.project/.classpath文件 5、 导入eclipse,目录如下所示: ![nutch导入eclipse后目录](https://git.oschina.net/uploads/images/2017/0508/144439_226508e4_133059.png "nutch2.2目录") 6、nutch目录下创建urls/seed.txt文件,写入要爬的网址,一行只能写一个 7、 修改conf/regex-urlfilter.txt,这里是配置爬取规则(正则) 8、 在eclipse执行参数中配置: |--program arguments -topN 3 -depth 5 |--VM arguments -Xms64m -Xmx512m -Dhadoop.log.dir=logs -Dhadoop.log.file=Hadoop.log |--完成后如下所示 ![配置eclipse的run/debug执行参数](https://git.oschina.net/uploads/images/2017/0508/144850_78c5a65b_133059.png "执行参数配置") 9、修改hadoop-core-1.0.1.jar中的org.apache.hadoop.fs.FileUtil类中的checkReturnValue方法,直接注销验证(不修改会有异常:Failed to set permissions of path:\tmp\hadoop-test\mapred\staging\test2083949620\.staging to 0700) 10、 修改 gora-sql-mapping.xml ,将字段content/text修改为10240或者其它长度,utf8下长度问题,可百度 11、 执行run/debug,完成后查看数据库表,数据如下: ![爬取数据数据](https://git.oschina.net/uploads/images/2017/0508/151551_e5927595_133059.png "爬取数据")