# docker-hadoop-workbench **Repository Path**: webbabyking/docker-hadoop-workbench ## Basic Information - **Project Name**: docker-hadoop-workbench - **Description**: A Hadoop cluster based on Docker, including Hive and Spark. - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 2 - **Forks**: 1 - **Created**: 2025-01-08 - **Last Updated**: 2025-05-08 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Docker Hadoop Workbench A Hadoop cluster based on Docker, including Hive and Spark. ## Introduction This repository uses [Docker Compose](https://docs.docker.com/compose/) to initialize a Hadoop cluster including the following: - Hadoop - Hive - Spark Please note that this project is built on top of [Big Data Europe](https://github.com/big-data-europe) works. Please check their [Docker Hub](https://hub.docker.com/u/bde2020/) for latest images. This project is based on the following Docker versions: ``` Client: Version: 20.10.2 Server: Docker Engine - Community Engine: Version: 20.10.6 docker-compose version 1.29.1, build c34c88b2 ``` ## Quick Start To start the cluster simply run: ``` ./start_demo.sh ``` Alternatively, you can use `v2`, which is built on top of my own `spark-master` and `spark-history-server`: ``` ./start_demo_v2.sh ``` You can stop the cluster using `./stop_demo.sh` or `./stop_demo_v2.sh`. Also you can modify `DOCKER_COMPOSE_FILE` in `start_demo.sh` and `stop_demo.sh` to use other YAML files. ## Interfaces - Namenode: http://localhost:9870/dfshealth.html#tab-overview - Datanode: http://localhost:9864/ - ResourceManager: http://localhost:8088/cluster - NodeManager: http://localhost:8042/node - HistoryServer: http://localhost:8188/applicationhistory - HiveServer2: http://localhost:10002/ - Spark Master: http://localhost:8080/ - Spark Worker: http://localhost:8081/ - Spark Job WebUI: http://localhost:4040/ (only works when Spark job is running on `spark-master`) - Presto WebUI: http://localhost:8090/ - Spark History Server:http://localhost:18080/ ## Connections Use `hdfs dfs` to connect to `hdfs://localhost:9000/` (Please make sure you have [Hadoop](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html) installed first): ``` hdfs dfs -ls hdfs://localhost:9000/ ``` Use Beeline to connect to HiveServer2 (Please make sure you have [Hive](https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation) installed first): ``` beeline -u jdbc:hive2://localhost:10000/default -n hive -p hive ``` Use `spark-shell` to connect to Hive Metastore via thrift protocol (Please make sure you have [Spark](https://spark.apache.org/downloads.html) installed first): ``` $ spark-shell Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.2 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.11) scala> :paste // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.SparkSession val spark = SparkSession.builder.master("local") .config("hive.metastore.uris", "thrift://localhost:9083") .enableHiveSupport.appName("thrift-test").getOrCreate spark.sql("show databases").show // Exiting paste mode, now interpreting. +---------+ |namespace| +---------+ | default| +---------+ import org.apache.spark.sql.SparkSession spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@1223467f ``` Use [Presto CLI](https://prestodb.io/docs/current/installation/cli.html) to connect to Presto and query Hive data: ```bash wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.255/presto-cli-0.255-executable.jar mv presto-cli-0.255-executable.jar presto chmod +x presto ./presto --server localhost:8090 --catalog hive --schema default ``` ## Run MapReduce Job `WordCount` This part is based on [Big Data Europe's Hadoop Docker](https://github.com/big-data-europe/docker-hadoop) project. First run `hadoop-base` as a helper container: ```bash docker run -d --network hadoop --env-file hadoop.env --name hadoop-base bde2020/hadoop-base:2.0.0-hadoop3.2.1-java8 tail -f /dev/null ``` Then run the following: ```bash docker exec -it hadoop-base hdfs dfs -mkdir -p /input/ docker exec -it hadoop-base hdfs dfs -copyFromLocal -f /opt/hadoop-3.2.1/README.txt /input/ docker exec -it hadoop-base mkdir jars docker cp jars/WordCount.jar hadoop-base:jars/WordCount.jar docker exec -it hadoop-base /bin/bash hadoop jar jars/WordCount.jar WordCount /input /output ``` You should be able to see the job in http://localhost:8088/cluster/apps and http://localhost:8188/applicationhistory (when finished). After the job is finished, check the result: ```bash hdfs dfs -cat /output/* ``` Then type `exit` to exit the container. ## Run Hive Job Make sure `hadoop-base` is running. Then prepare the data: ```bash docker exec -it hadoop-base hdfs dfs -mkdir -p /test/ docker exec -it hadoop-base mkdir test docker cp data hadoop-base:test/data docker exec -it hadoop-base /bin/bash hdfs dfs -put test/data/* /test/ hdfs dfs -ls /test exit ``` Then create the table: ```bash docker cp scripts/hive-beers.q hive-server:hive-beers.q docker exec -it hive-server /bin/bash cd / hive -f hive-beers.q exit ``` Then play with data using Beeline: ``` beeline -u jdbc:hive2://localhost:10000/test -n hive -p hive 0: jdbc:hive2://localhost:10000/test> select count(*) from beers; ``` You should be able to see the job in http://localhost:8088/cluster/apps and http://localhost:8188/applicationhistory (when finished). ## Run Spark Shell Make sure you have prepared the data and created the table in the previous step. ``` docker exec -it spark-master spark/bin/spark-shell scala> spark.sql("show databases").show +---------+ |namespace| +---------+ | default| | test| +---------+ scala> val df = spark.sql("select * from test.beers") df: org.apache.spark.sql.DataFrame = [id: int, brewery_id: int ... 11 more fields] scala> df.count res0: Long = 7822 ``` You should be able to see the Spark Shell session in http://localhost:8080/ and your job in http://localhost:4040/jobs/. If you encounter the following warning when running `spark-shell`: ``` WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources ``` Please check the logs of `spark-master` using `docker logs -f spark-master`. If the following exists, please restart your `spark-worker` using `docker-compose restart spark-worker`: ``` WARN Master: Got heartbeat from unregistered worker worker-20210622022950-xxx.xx.xx.xx-xxxxx. This worker was never registered, so ignoring the heartbeat. ``` Similarly, to run `spark-sql`, use `docker exec -it spark-master spark/bin/spark-sql`. ## Run Spark Submit ```bash docker exec -it spark-master /spark/bin/spark-submit --class org.apache.spark.examples.SparkPi /spark/examples/jars/spark-examples_2.12-3.1.1.jar 100 ``` You should be able to see Spark Pi in http://localhost:8080/ and your job in http://localhost:4040/jobs/. ## Configuration Files Some configuration file locations are listed below. The non-empty configuration files are also copied to `conf` folder for future reference. - `namenode`: - `/etc/hadoop/core-site.xml` CORE_CONF - `/etc/hadoop/hdfs-site.xml` HDFS_CONF - `/etc/hadoop/yarn-site.xml` YARN_CONF - `/etc/hadoop/httpfs-site.xml` HTTPFS_CONF - `/etc/hadoop/kms-site.xml` KMS_CONF - `/etc/hadoop/mapred-site.xml` MAPRED_CONF - `hive-server`: - `/opt/hive/hive-site.xml` HIVE_CONF