# nutch2.2-eclipse-win64 **Repository Path**: brade/nutch2-2-eclipse-win64 ## Basic Information - **Project Name**: nutch2.2-eclipse-win64 - **Description**: win64环境下nutch2.2导入eclipse - **Primary Language**: Java - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2017-05-08 - **Last Updated**: 2024-05-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README #本项目是win64+nutch2.2+mysql+eclipse执行版本，解决了jar包冲突 1、 |--在官网下载nutch2.2并解压到本地，修改ivy/ivy.xml; |--修改为，0.3-->0.2.1 |--解除注释 2、 |--修改gora.properties，增加mysql数据库配置： gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://192.168.142.128:3306/nutch?createDatabaseIfNotExist=true gora.sqlstore.jdbc.user=root gora.sqlstore.jdbc.password=123456 |--修改nutch-site.xml,增加以下配置： ``` http.agent.name nutch-brade HTTP 'User-Agent' request header. MUST NOT be empty http.robots.agents nutch-brade,* The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* plugin.folders ./src/plugin Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used- as is. If relative, it is searched for on the classpath. parser.character.encoding.default utf-8 The character encoding to fall back to when no other information is available storage.data.store.class org.apache.gora.sql.store.SqlStore The Gora DataStore class for storing and retrieving data. http.content.limit -1 The length limit for downloaded content using the http protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. generate.batch.id * ``` 3、cmd窗口进行解压目录执行ant eclipse(以上第2步，也可以执行ant后再修改，ant/ivy环境省略，百度配置) 4、ant编译时间比较长，可以多试几次，成功后会在目录生成.project/.classpath文件 5、导入eclipse,目录如下所示： ![nutch导入eclipse后目录](https://git.oschina.net/uploads/images/2017/0508/144439_226508e4_133059.png "nutch2.2目录") 6、nutch目录下创建urls/seed.txt文件，写入要爬的网址，一行只能写一个 7、修改conf/regex-urlfilter.txt，这里是配置爬取规则(正则) 8、在eclipse执行参数中配置： |--program arguments -topN 3 -depth 5 |--VM arguments -Xms64m -Xmx512m -Dhadoop.log.dir=logs -Dhadoop.log.file=Hadoop.log |--完成后如下所示 ![配置eclipse的run/debug执行参数](https://git.oschina.net/uploads/images/2017/0508/144850_78c5a65b_133059.png "执行参数配置") 9、修改hadoop-core-1.0.1.jar中的org.apache.hadoop.fs.FileUtil类中的checkReturnValue方法，直接注销验证(不修改会有异常：Failed to set permissions of path:\tmp\hadoop-test\mapred\staging\test2083949620\.staging to 0700) 10、修改 gora-sql-mapping.xml ，将字段content/text修改为10240或者其它长度，utf8下长度问题，可百度 11、执行run/debug，完成后查看数据库表，数据如下： ![爬取数据数据](https://git.oschina.net/uploads/images/2017/0508/151551_e5927595_133059.png "爬取数据")