# docker-hadoop-spark **Repository Path**: c-quan/docker-hadoop-spark ## Basic Information - **Project Name**: docker-hadoop-spark - **Description**: docker构建hadoop、spark、hive、presto服务 自选 - **Primary Language**: Docker - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2021-10-04 - **Last Updated**: 2021-10-04 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README [![Gitter chat](https://badges.gitter.im/gitterHQ/gitter.png)](https://gitter.im/big-data-europe/Lobby) # Docker multi-container environment with Hadoop, Spark and Hive This is it: a Docker multi-container environment with Hadoop (HDFS), Spark and Hive. But without the large memory requirements of a Cloudera sandbox. (On my Windows 10 laptop (with WSL2) it seems to consume a mere 3 GB.) The only thing lacking, is that Hive server doesn't start automatically. To be added when I understand how to do that in docker-compose. ## Quick Start To deploy an the HDFS-Spark-Hive cluster, run: ``` docker-compose up ``` `docker-compose` creates a docker network that can be found by running `docker network list`, e.g. `docker-hadoop-spark-hive_default`. Run `docker network inspect` on the network (e.g. `docker-hadoop-spark-hive_default`) to find the IP the hadoop interfaces are published on. Access these interfaces with the following URLs: * Namenode: http://:9870/dfshealth.html#tab-overview * History server: http://:8188/applicationhistory * Datanode: http://:9864/ * Nodemanager: http://:8042/node * Resource manager: http://:8088/ * Spark master: http://:8080/ * Spark worker: http://:8081/ * Hive: http://:10000 ## Quick Start HDFS Copy breweries.csv to the namenode. ``` docker cp breweries.csv namenode:breweries.csv ``` Go to the bash shell on the namenode with that same Container ID of the namenode. ``` docker exec -it namenode bash ``` Create a HDFS directory /data//openbeer/breweries. ``` hdfs dfs -mkdir /data hdfs dfs -mkdir /data/openbeer hdfs dfs -mkdir /data/openbeer/breweries ``` Copy breweries.csv to HDFS: ``` hdfs dfs -put breweries.csv /data/openbeer/breweries/breweries.csv ``` ## Quick Start Spark (PySpark) Go to http://:8080 or http://localhost:8080/ on your Docker host (laptop) to see the status of the Spark master. Go to the command line of the Spark master and start PySpark. ``` docker exec -it spark-master bash /spark/bin/pyspark --master spark://spark-master:7077 ``` Load breweries.csv from HDFS. ``` brewfile = spark.read.csv("hdfs://namenode:8020/data/openbeer/breweries/breweries.csv") brewfile.show() +----+--------------------+-------------+-----+---+ | _c0| _c1| _c2| _c3|_c4| +----+--------------------+-------------+-----+---+ |null| name| city|state| id| | 0| NorthGate Brewing | Minneapolis| MN| 0| | 1|Against the Grain...| Louisville| KY| 1| | 2|Jack's Abby Craft...| Framingham| MA| 2| | 3|Mike Hess Brewing...| San Diego| CA| 3| | 4|Fort Point Beer C...|San Francisco| CA| 4| | 5|COAST Brewing Com...| Charleston| SC| 5| | 6|Great Divide Brew...| Denver| CO| 6| | 7| Tapistry Brewing| Bridgman| MI| 7| | 8| Big Lake Brewing| Holland| MI| 8| | 9|The Mitten Brewin...| Grand Rapids| MI| 9| | 10| Brewery Vivant| Grand Rapids| MI| 10| | 11| Petoskey Brewing| Petoskey| MI| 11| | 12| Blackrocks Brewery| Marquette| MI| 12| | 13|Perrin Brewing Co...|Comstock Park| MI| 13| | 14|Witch's Hat Brewi...| South Lyon| MI| 14| | 15|Founders Brewing ...| Grand Rapids| MI| 15| | 16| Flat 12 Bierwerks| Indianapolis| IN| 16| | 17|Tin Man Brewing C...| Evansville| IN| 17| | 18|Black Acre Brewin...| Indianapolis| IN| 18| +----+--------------------+-------------+-----+---+ only showing top 20 rows ``` ## Quick Start Spark (Scala) Go to http://:8080 or http://localhost:8080/ on your Docker host (laptop) to see the status of the Spark master. Go to the command line of the Spark master and start spark-shell. ``` docker exec -it spark-master bash spark/bin/spark-shell --master spark://spark-master:7077 ``` Load breweries.csv from HDFS. ``` val df = spark.read.csv("hdfs://namenode:8020/data/openbeer/breweries/breweries.csv") df.show() +----+--------------------+-------------+-----+---+ | _c0| _c1| _c2| _c3|_c4| +----+--------------------+-------------+-----+---+ |null| name| city|state| id| | 0| NorthGate Brewing | Minneapolis| MN| 0| | 1|Against the Grain...| Louisville| KY| 1| | 2|Jack's Abby Craft...| Framingham| MA| 2| | 3|Mike Hess Brewing...| San Diego| CA| 3| | 4|Fort Point Beer C...|San Francisco| CA| 4| | 5|COAST Brewing Com...| Charleston| SC| 5| | 6|Great Divide Brew...| Denver| CO| 6| | 7| Tapistry Brewing| Bridgman| MI| 7| | 8| Big Lake Brewing| Holland| MI| 8| | 9|The Mitten Brewin...| Grand Rapids| MI| 9| | 10| Brewery Vivant| Grand Rapids| MI| 10| | 11| Petoskey Brewing| Petoskey| MI| 11| | 12| Blackrocks Brewery| Marquette| MI| 12| | 13|Perrin Brewing Co...|Comstock Park| MI| 13| | 14|Witch's Hat Brewi...| South Lyon| MI| 14| | 15|Founders Brewing ...| Grand Rapids| MI| 15| | 16| Flat 12 Bierwerks| Indianapolis| IN| 16| | 17|Tin Man Brewing C...| Evansville| IN| 17| | 18|Black Acre Brewin...| Indianapolis| IN| 18| +----+--------------------+-------------+-----+---+ only showing top 20 rows ``` How cool is that? Your own Spark cluster to play with. ## Quick Start Hive Go to the command line of the Hive server and start hiveserver2 ``` docker exec -it hive-server bash hiveserver2 ``` Maybe a little check that something is listening on port 10000 now ``` netstat -anp | grep 10000 tcp 0 0 0.0.0.0:10000 0.0.0.0:* LISTEN 446/java ``` Okay. Beeline is the command line interface with Hive. Let's connect to hiveserver2 now. ``` beeline !connect jdbc:hive2://127.0.0.1:10000 scott tiger ``` Didn't expect to encounter scott/tiger again after my Oracle days. But there you have it. Definitely not a good idea to keep that user on production. Not a lot of databases here yet. ``` show databases; +----------------+ | database_name | +----------------+ | default | +----------------+ 1 row selected (0.335 seconds) ``` Let's change that. ``` create database openbeer; use openbeer; ``` And let's create a table. ``` CREATE EXTERNAL TABLE IF NOT EXISTS breweries( NUM INT, NAME CHAR(100), CITY CHAR(100), STATE CHAR(100), ID INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE location '/data/openbeer/breweries'; ``` And have a little select statement going. ``` select name from breweries limit 10; +----------------------------------------------------+ | name | +----------------------------------------------------+ | name | | NorthGate Brewing | | Against the Grain Brewery | | Jack's Abby Craft Lagers | | Mike Hess Brewing Company | | Fort Point Beer Company | | COAST Brewing Company | | Great Divide Brewing Company | | Tapistry Brewing | | Big Lake Brewing | +----------------------------------------------------+ 10 rows selected (0.113 seconds) ``` There you go: your private Hive server to play with. ## Configure Environment Variables The configuration parameters can be specified in the hadoop.env file or as environmental variables for specific services (e.g. namenode, datanode etc.): ``` CORE_CONF_fs_defaultFS=hdfs://namenode:8020 ``` CORE_CONF corresponds to core-site.xml. fs_defaultFS=hdfs://namenode:8020 will be transformed into: ``` fs.defaultFShdfs://namenode:8020 ``` To define dash inside a configuration parameter, use triple underscore, such as YARN_CONF_yarn_log___aggregation___enable=true (yarn-site.xml): ``` yarn.log-aggregation-enabletrue ``` The available configurations are: * /etc/hadoop/core-site.xml CORE_CONF * /etc/hadoop/hdfs-site.xml HDFS_CONF * /etc/hadoop/yarn-site.xml YARN_CONF * /etc/hadoop/httpfs-site.xml HTTPFS_CONF * /etc/hadoop/kms-site.xml KMS_CONF * /etc/hadoop/mapred-site.xml MAPRED_CONF If you need to extend some other configuration file, refer to base/entrypoint.sh bash script.