# nutch2.2-eclipse-win64
**Repository Path**: brade/nutch2-2-eclipse-win64
## Basic Information
- **Project Name**: nutch2.2-eclipse-win64
- **Description**: win64环境下nutch2.2导入eclipse
- **Primary Language**: Java
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 1
- **Created**: 2017-05-08
- **Last Updated**: 2024-05-29
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
#本项目是win64+nutch2.2+mysql+eclipse执行版本,解决了jar包冲突
1、
|--在官网下载nutch2.2并解压到本地,修改ivy/ivy.xml;
|--修改为 ,0.3-->0.2.1
|--解除注释
2、
|--修改gora.properties,增加mysql数据库配置:
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://192.168.142.128:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=123456
|--修改nutch-site.xml,增加以下配置:
```
http.agent.name
nutch-brade
HTTP 'User-Agent' request header. MUST NOT be empty
http.robots.agents
nutch-brade,*
The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
plugin.folders
./src/plugin
Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used-
as is. If relative, it is searched for on the classpath.
parser.character.encoding.default
utf-8
The character encoding to fall back to when no other
information is available
storage.data.store.class
org.apache.gora.sql.store.SqlStore
The Gora DataStore class for storing and retrieving data.
http.content.limit
-1
The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content
longer than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
generate.batch.id
*
```
3、cmd窗口进行解压目录执行ant eclipse(以上第2步,也可以执行ant后再修改,ant/ivy环境省略,百度配置)
4、ant编译时间比较长,可以多试几次,成功后会在目录生成.project/.classpath文件
5、
导入eclipse,目录如下所示:

6、nutch目录下创建urls/seed.txt文件,写入要爬的网址,一行只能写一个
7、 修改conf/regex-urlfilter.txt,这里是配置爬取规则(正则)
8、 在eclipse执行参数中配置:
|--program arguments
-topN 3 -depth 5
|--VM arguments
-Xms64m -Xmx512m -Dhadoop.log.dir=logs -Dhadoop.log.file=Hadoop.log
|--完成后如下所示

9、修改hadoop-core-1.0.1.jar中的org.apache.hadoop.fs.FileUtil类中的checkReturnValue方法,直接注销验证(不修改会有异常:Failed to set permissions of path:\tmp\hadoop-test\mapred\staging\test2083949620\.staging to 0700)
10、 修改 gora-sql-mapping.xml ,将字段content/text修改为10240或者其它长度,utf8下长度问题,可百度
11、 执行run/debug,完成后查看数据库表,数据如下:
