# hive_learning

**Repository Path**: sun81911/hive_learning

## Basic Information

- **Project Name**: hive_learning
- **Description**: hive知识点学习
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 2
- **Created**: 2020-06-12
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

##  Hive 概述

### Hive 基本概念

Hive：由 Facebook 开源用于解决海量「结构化日志」的数据统计

Hive 是基于 Hadoop 的一个数据仓库工具，可以将结构化的数据文件映射为一张表，并提供「类SQL」查询功能

本质是：**将 HQL 转化成 MapReduce 程序**

![](imgs/hql转mr流程.png)

1. Hive 处理的数据存储在 HDFS
2. Hive 分析数据底层的默认实现是 MapReduce
3. 执行程序运行在 Yarn 上

### Hive 优缺点

#### 优点

1. 操作接口采用类 SQL 语法，提供快速开发的能力
2. 避免了去写 MapReduce，减少开发人员的学习成本
3. Hive 的执行延迟比较高，因此 Hive 常用于「数据分析」，对实时性要求不高的场合
4. Hive 优势在于处理大数据，对于处理小数据没有优势，因为 Hive 的执行延迟比较高
5. Hive 支持用户自定义函数，用户可以根据自己的需求来实现自己的函数

#### 缺点

1. Hive 的 HQL 表达能力有限，迭代式算法无法表达，数据挖掘方面不擅长
2. Hive 的效率比较低，Hive 自动生成的 MapReduce 作业，通常情况下不够智能化，Hive 调优比较困难，粒度较粗

### Hive 架构原理

![](imgs/hive架构.png)

1. **Client**

   用户接口，CLI（Hive Shell）、JDBC/ODBC(java 访问 hive)、WEBUI（浏览器访问 Hive） 

2. **Metastore**

   元数据包括表名、表所属的数据库（默认是 default）、表的拥有者、列/分区字段、表的类型（是否是外部表）、表的数据所在目录等

   默认存储在自带的 `derby` 数据库中，推荐使用 MySQL 存储 Metastore

3. **Hadoop**

   使用 HDFS 进行存储，使用 MapReduce 进行计算

4. **Driver**

   1. **解析器（SQL Parser）**将 SQL 字符串转换成「抽象语法树AST」，这一步一般都用第三方工具库完成，比如 antlr，对 AST 进行语法分析，比如表是否存在、字段是否存在、SQL 语义是否有误
   2. **编译器（Physical Plan）**将 AST 编译生成逻辑执行计划
   3. **优化器（Query Optimizer）**对逻辑执行计划进行优化
   4. **执行器（Execution）**把逻辑执行计划转换成可以运行的物理计划。对于 Hive 来说，就是 MapReduce/Spark

![](imgs/hive运行机制.png)

Hive 通过给用户提供的一系列交互接口，接收到用户的指令(SQL)，使用自己的 Driver，结合元数据(MetaStore)，将这些指令翻译成 MapReduce，提交到 Hadoop 中执行，最后将执行返回的结果输出到用户交互接口

### Hive与数据库比较

由于 Hive 采用了类似 SQL 的查询语言 HQL(Hive Query Language)，因此很容易将 Hive 理解为数据库。其实从结构上来看，Hive 和数据库除了拥有类似的查询语言，再无类似之处，数据库可以用在 Online 的应用中，但是 Hive 是为「数据仓库」而设计的

#### 查询语言

由于 SQL 被广泛的应用在数据仓库中，因此，专门针对 Hive 的特性设计了类 SQL 的查询语言 HQL

#### 数据存储位置

Hive 是建立在 Hadoop 之上的，所有 Hive 的数据都是存储在 HDFS 中的

而数据库则可以将数据保存在块设备或者本地文件系统中

#### 数据更新

由于 Hive 是针对数据仓库应用设计的，而数据仓库的内容是「读多写少」的。因此，Hive中不建议对数据的改写，所有的数据都是在加载的时候确定好的

而数据库中的数据通常是需要经常进行修改的 ，因此可以使用`INSERT INTO … VALUES`添加数据，使用`UPDATE … SET`修改数据

#### 索引

Hive 在加载数据的过程中不会对数据进行任何处理，甚至不会对数据进行扫描，因此也没有对数据中的某些 Key 建立索引。Hive 要访问数据中满足条件的特定值时，需要暴力扫描整个数据，因此访问延迟较高。由于 MapReduce 的引入， Hive 可以并行访问数据，因此即使没有索引，对于大数据量的访问，Hive 仍然可以体现出优势。由于数据的访问延迟较高，决定了 Hive 不适合在线数据查询

数据库中，通常会针对一个或者几个列建立索引，因此对于少量的特定条件的数据的访问，数据库可以有很高的效率、较低的延迟

#### 执行

Hive 中大多数查询的执行是通过 Hadoop 提供的 MapReduce 来实现的

而数据库通常有自己的执行引擎

#### 执行延迟

Hive 在查询数据的时候，由于没有索引，需要扫描整个表，因此延迟较高。另外一个导致 Hive 执行延迟高的因素是 MapReduce 框架，由于 MapReduce 本身具有较高的延迟，因此在利用 MapReduce 执行 Hive 查询时，也会有较高的延迟

相对来说，数据库的执行延迟较低，这个低是有条件的，即数据规模较小的时候，当数据规模大到超过数据库的处理能力的时候，Hive 的并行计算显然能体现出优势

#### 可扩展性

由于 Hive 是建立在 Hadoop 之上的，因此 Hive 的可扩展性是和 Hadoop 的可扩展性是一致的

而数据库由于 ACID 语义的严格限制，扩展行非常有限

#### 数据规模

由于 Hive 建立在集群上并可以利用 MapReduce 进行并行计算，因此可以支持很大规模的数据

对应的，数据库适合支持的数据规模较小

## Hive 安装

### Hive 安装部署

#### Hive 安装及配置

1. 把 `apache-hive-1.2.1-bin.tar.gz` 上传到 `/opt/software` 目录下

2. 解压 `apache-hive-1.2.1-bin.tar.gz` 到 `/opt/module/` 目录下面

   ```shell
   tar -zxvf apache-hive-1.2.1-bin.tar.gz -C /opt/module/
   ```

3. 将`apache-hive-1.2.1-bin`更名为`hive-1.2.1`

   ```shell
   mv apache-hive-1.2.1-bin hive-1.2.1
   ```

4. 修改`/opt/module/hive-1.2.1/conf `目录下的` hive-env.sh.template` 名称为 `hive-env.sh`

   ```shell
   cp hive-env.sh.template hive-env.sh
   ```

5. 配置 `hive-env.sh` 文件

   ```shell
   # Set HADOOP_HOME to point to a specific hadoop install directory
   export HADOOP_HOME=/opt/module/hadoop-2.7.2
   # Hive Configuration Directory can be controlled by:
   export HIVE_CONF_DIR=/opt/module/hive-1.2.1/conf
   ```

#### Hadoop 集群配置

启动 HDFS 和 Yarn

```shell
start-dfs.sh
start-yarn.sh
```

#### Hive 基本操作

1. 启动 hive

   ```shell
   bin/hive
   ```

2. 查看数据库

   ```sql
   hive> show databases;
   ```

3. 打开默认数据库

   ```sql
   hive> use default;
   ```

4. 显示 default 数据库中的表

   ```sql
   hive> show tables;
   ```

5. 创建一张表

   ```sql
   hive> create table student(id int, name string);
   ```

6. 查看表的结构

   ```sql
   hive> desc student;
   ```

7. 查询表中数据

   ```sql
   hive> select * from student;
   ```

8. 向表中插入数据

   ```sql
   hive> insert into student values(1, 'roger');
   ```

9. 退出 hive

   ```sql
   hive> quit;
   ```

### 将本地文件导入 Hive

#### 数据准备

`/opt/module/hive-1.2.1/input/students.txt`

```
1       hello
2       hadoop
3       hive
```

**注意**：以 `tab` 键间隔

#### 具体操作

1. 加载 `/opt/module/hive-1.2.1/input/students.txt` 文件到 `student` 数据库表中

   ```sql
   hive> load data local inpath '/opt/module/hive-1.2.1/input/students.txt' into table student;
   ```

2. 此时查看数据

   ```sql
   hive> select * from student;
   ```

   由于分割符不同，导致数据导入成`NULL`

   ```
   1       roger
   2       hello
   NULL    NULL
   NULL    NULL
   NULL    NULL
   ```

3. 删除已创建的 student 表

   ```sql
   hive> drop table student;
   ```

4. 创建 student 表, 并声明文件分隔符`\t`

   ```sql
   hive> create table student(id int, name string) row format delimited fields terminated by '\t';
   ```

5. 此时再加载 `/opt/module/hive-1.2.1/input/students.txt` 文件到 `student` 数据库表中

   ```sql
   hive> load data local inpath '/opt/module/apache-hive-1.2.1-bin/input/students.txt' into table student;
   ```

   ```
   1       hello
   2       hadoop
   3       hive
   ```

   **补充**：若想使用HDFS文件系统路径，可以直接使用`inpath`不需要`local`，在HDFS文件路径中导入数据为「移动」行为，即文件数据移动到hive表的对应目录下，如`/user/hive/warehouse/student`中，而本地文件系统导入数据则是「复制」行为

#### 问题产生

再打开一个客户端窗口启动 hive，会产生` java.sql.SQLException` 异常

原因是 MetaStore 默认存储在自带的 derby 数据库中，推荐使用 MySQL 存储 MetaStore

### MySQL 安装

#### 添加MySQL Yum存储库

首先将MySQL Yum存储库添加到系统的存储库列表中，可以通过安装MySQL提供的RPM来执行

1. 转到MySQL开发人员区域的[下载MySQL Yum储存库页面](https://dev.mysql.com/downloads/repo/yum/)

2. 选择并下载对应平台的发布包

3. 使用以下命令安装下载的发布包，将平台和版本特定的包名替换为下载的RPM包名

   基于`el7`的系统:

   ```shell
   yum localinstall mysql57-community-release-el7-{version-number}.noarch.rpm
   ```

4. 但由于官网提供下载的rpm为`mysql80-community-release-el7-3.noarch.rpm`版本，故使用

   ```shell
   yum localinstall mysql80-community-release-el7-3.noarch.rpm
   ```

   补充：官网默认提供该最新版本，而在该版本的MySQL Yum存储库中也包含了「MySQL 5.7」在内的多个版本系列的的最新MySQL版本

5. 可以通过以下命令检查MySQL Yum存储库是否已成功添加

   ```shell
   yum repolist enabled | grep "mysql.*-community.*"
   ```

   ```
   mysql-connectors-community/x86_64       MySQL Connectors Community           153
   mysql-tools-community/x86_64            MySQL Tools Community                110
   mysql80-community/x86_64                MySQL 8.0 Community Server           177
   ```

#### 选择发布系列

在MySQL Yum存储库中，MySQL社区服务器的不同版本系列驻留在不同的子存储库中

最新的GA系列（目前是`MySQL 8.0`）的子存储库在默认情况下是启用的，所有其他系列（例如`MySQL 5.7`系列）的子存储库在默认情况下是禁用的

使用此命令查看MySQL Yum存储库中的所有子存储库，并查看启用或禁用了哪些子存储库

```shell
yum repolist all | grep mysql
```

```
mysql-cluster-7.5-community/x86_64 MySQL Cluster 7.5 Community      禁用
mysql-cluster-7.5-community-source MySQL Cluster 7.5 Community - So 禁用
mysql-cluster-7.6-community/x86_64 MySQL Cluster 7.6 Community      禁用
mysql-cluster-7.6-community-source MySQL Cluster 7.6 Community - So 禁用
mysql-cluster-8.0-community/x86_64 MySQL Cluster 8.0 Community      禁用
mysql-cluster-8.0-community-source MySQL Cluster 8.0 Community - So 禁用
mysql-connectors-community/x86_64  MySQL Connectors Community       启用:    153
mysql-connectors-community-source  MySQL Connectors Community - Sou 禁用
mysql-tools-community/x86_64       MySQL Tools Community            启用:    110
mysql-tools-community-source       MySQL Tools Community - Source   禁用
mysql-tools-preview/x86_64         MySQL Tools Preview              禁用
mysql-tools-preview-source         MySQL Tools Preview - Source     禁用
mysql55-community/x86_64           MySQL 5.5 Community Server       禁用
mysql55-community-source           MySQL 5.5 Community Server - Sou 禁用
mysql56-community/x86_64           MySQL 5.6 Community Server       禁用
mysql56-community-source           MySQL 5.6 Community Server - Sou 禁用
mysql57-community/x86_64           MySQL 5.7 Community Server       禁用
mysql57-community-source           MySQL 5.7 Community Server - Sou 禁用
mysql80-community/x86_64           MySQL 8.0 Community Server       启用:    177
mysql80-community-source           MySQL 8.0 Community Server - Sou 禁用
```

要安装最新GA系列的最新版本，不需要配置。要安装除最新GA系列之外的特定系列的最新版本，**请在运行安装命令之前，禁用最新GA系列的子存储库并启用特定系列的子存储库**

可以通过手动编辑`/etc/yum.repos.d/mysql-community.repo`来选择发布系列

```shell
vim /etc/yum.repos.d/mysql-community.repo
```

找到要配置的子存储库条目，并编辑`enabled`选项

指定`enabled=0`以「禁用」子存储库，或指定`enabled=1`以「启用」子存储库

例如，要安装MySQL 5.7，请确保对默认（MySQL 8.0）的子存储库条目`enabled=0`，对 5.7 系列的条目`enabled=1`

```
# Enable to use MySQL 5.7
[mysql57-community]
name=MySQL 5.7 Community Server
baseurl=http://repo.mysql.com/yum/mysql-5.7-community/el/7/$basearch/
# 启用
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-mysql
```

```
[mysql80-community]
name=MySQL 8.0 Community Server
baseurl=http://repo.mysql.com/yum/mysql-8.0-community/el/7/$basearch/
# 禁用
enabled=0
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-mysql
```

应该在任何时候「只为一个」发布系列启用子存储库，若启用「多个」发布系列的子存储库时，Yum将使用「最新」的系列

补充：由于MySQL的服务器在海外，使用官方下载链接时可能会很慢，甚至无法下载安装完成，在此可以选择使用[清华大学的MySQL镜像](https://mirror.tuna.tsinghua.edu.cn/help/mysql/)使用`https://mirrors.tuna.tsinghua.edu.cn/mysql/yum/mysql-5.7-community-el7-$basearch/`替换原服务器的`http://repo.mysql.com/yum/mysql-5.7-community/el/7/$basearch/`，即修改为

```
# Enable to use MySQL 5.7
[mysql57-community]
name=MySQL 5.7 Community Server
baseurl=https://mirrors.tuna.tsinghua.edu.cn/mysql/yum/mysql-5.7-community-el7-$basearch/
# 启用
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-mysql
```

通过运行以下命令并检查其输出，验证是否启用和禁用了正确的子存储库

```shell
yum repolist enabled | grep mysql
```

```
mysql-connectors-community/x86_64       MySQL Connectors Community           153
mysql-tools-community/x86_64            MySQL Tools Community                110
mysql57-community/x86_64                MySQL 5.7 Community Server           424
```

#### 安装 MySQL

通过以下命令安装MySQL

```shell
yum install mysql-community-server
```

这个安装包MySQL服务器(`mysql-community-server`)也包括了运行服务器所需的组件，包括：

1. 客户端包 `mysql-community-client`
2. 常见的错误消息和客户端/服务端的字符集 `mysql-community-common`
3. 共享客户端库 `mysql-community-libs`

#### 启动 MySQL

使用以下命令启动MySQL服务器

```shell
service mysqld start
```
若出现错误，建议「完全卸载」MySQL重装

```shell
rpm -qa | grep mysql
```

```shell
rpm -e --nodeps mysql-xxx
```

```shell
find / -name mysql
```

```shell
rm -rf xxx/mysql
```

可以使用以下命令检查MySQL服务器的状态

```shell
service mysqld status
```

设置`root`用户的临时密码并将其存储在错误日志文件中，要显示它使用以下命令

```shell
grep 'temporary password' /var/log/mysqld.log
```

使用生成的临时密码登录并为`root`用户帐户设置自定义密码

```shell
mysql -uroot -p
```

```sql
ALTER USER 'root'@'localhost' IDENTIFIED BY 'MyNewPass4!';
```

**注意**：`validate_password`是默认安装的，`validate_password`实现的默认密码策略要求密码至少包含

1. 1个大写字母
2. 1个小写字母
3. 1个数字
4. 1个特殊字符
5. 总密码长度至少为8个字符

#### 配置主机

配置只要是 `root` 用户及其密码，在任何主机上都能登录 MySQL 数据库

```mysql
grant all privileges on *.* to 'root'@'%' identified by 'Roger123;';
```

刷新权限

```mysql
flush privileges;
```

### Hive 元数据配置到 MySql

#### 驱动拷贝

解压 `mysql-connector-java-5.1.48.tar.gz ` 驱动包

```shell
tar -zxvf mysql-connector-java-5.1.48.tar.gz 
```

拷贝`mysql-connector-java-5.1.48`中的 `mysql-connector-java-5.1.48-bin.jar` 到 `/opt/module/hive-1.2.1/lib/`

```shell
cp mysql-connector-java-5.1.48-bin.jar /opt/module/hive-1.2.1/lib/
```

#### 配置 MetaStore 到 MySql

在`/opt/module/hive-1.2.1/conf` 目录下创建一个 `hive-site.xml`

```shell
cp hive-default.xml.template hive-site.xml
```

拷贝数据到 `hive-site.xml` 文件中

```xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <!-- 连接URL -->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://hadoop198:3306/metastore?createDatabaseIfNotExist=true</value>
        <description>JDBC connect string for a JDBC metastore</description>
    </property>
    <!-- 数据库驱动 -->
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
    <!-- 用户名 -->
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
        <description>username to use against metastore database</description>
    </property>
    <!-- 密码 -->
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>Roger123;</value>
        <description>password to use against metastore database</description>
    </property>
</configuration>
```

### HiveJDBC 访问

#### 启动 hiveserver2 服务

```shell
bin/hiveserver2
```

#### 启动 beeline

```shell
bin/beeline
```

#### 连接 hiveserver2

```shell
beeline> !connect jdbc:hive2://hadoop198:10000
Connecting to jdbc:hive2://hadoop198:10000
Enter username for jdbc:hive2://hadoop198:10000: root
Enter password for jdbc:hive2://hadoop198:10000:
```

### Hive 常用交互命令

```
usage: hive
 -d,--define <key=value>          Variable subsitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
    --database <databasename>     Specify the database to use
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -H,--help                        Print help information
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable subsitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)
```

1. `-e`不进入 hive 的交互窗口执行 sql 语句

   ```shell
   hive -e "select * from aa"
   ```

2. `-f`执行脚本中 sql 语句

   创建编辑 `aa.hql` 文件

   ```shell
   vim aa.hql
   ```

   ```sql
   select * from aa;
   ```

   ```shell
   hive -f aa.hql
   ```

   也可以将执行文件中的 sql 语句并将结果写入文件中

   ```shell
   hive -f aa.hql > /opt/module/hive-1.2.1/output/aa_result.txt
   ```

### Hive 其他命令操作

1. 在 Hive CLI 命令窗口中查看 HDFS 文件系统

   ```shell
   hive> dfs -ls /;
   ```

2. 在 Hive CLI 命令窗口中查看本地文件系统

   ```shell
   hive> ! ls /opt/module;
   ```

3. 查看在 Hive 中输入的所有历史命令

   进入到当前用户的根目录，查看`.hivehistory` 文件

   ```shell
   cat .hivehistory 
   ```

### Hive 常见属性配置

#### Hive 数据仓库位置配置

`default` 数据仓库的最原始位置是在 hdfs 上的：`/user/hive/warehouse` 路径下

在仓库目录下，没有对默认的数据库 `default` 创建文件夹。如果某张表属于 default 数据库，直接在数据仓库目录下创建一个文件夹

到 `hive-site.xml` 文件中修改 default 数据仓库原始位置

```xml
<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
    <description>location of default database for the warehouse</description>
</property>
```

#### 查询后信息显示配置

在 `hive-site.xml` 文件中添加如下配置信息，就可以实现显示**当前数据库**以及**查询表的头信息**配置

```xml
<!-- 查询表的头信息 -->
<property>
    <name>hive.cli.print.header</name>
    <value>true</value>
</property>
<!-- 显示当前数据库 -->
<property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
</property>
```

```sql
hive (default)> select * from aa;
```

```
aa.id
1
2
3
4
5
6
```

此时即显示了当前数据库`default`以及查询表头信息`aa.id`

#### Hive 运行日志信息配置

Hive 的 log 默认存放在`/tmp/root/hive.log` 目录下，`root`为当前用户名

修改 hive 的 log 存放日志到 `/opt/module/hive-1.2.1/logs`

1. 修改 `conf/hive-log4j.properties.template` 文件名称为 `hive-log4j.properties`

   ```shell
   cp hive-log4j.properties.template hive-log4j.properties
   ```

2. 在 `hive-log4j.properties` 文件中修改 log 存放位置

   ```properties
   hive.log.dir=/opt/module/hive-1.2.1/logs
   ```

#### 参数配置方式

查看当前所有的配置信息

```sql
hive> set;
```

参数的配置三种方式

1. **配置文件方式**

   默认配置文件：`hive-default.xml`

   用户自定义配置文件：`hive-site.xml`

   **注意**：用户自定义配置会覆盖默认配置。另外，Hive 也会读入 Hadoop 的配置，因为 Hive 是作为 Hadoop 的客户端启动的，Hive 的配置会覆盖 Hadoop 的配置。配置文件的设定对本机启动的所有 Hive 进程都有效

2. **命令行参数方式**

   启动 Hive 时，可以在命令行添加`-hiveconf param=value` 来设定参数，例如：

   查看`mapred.reduce.tasks`参数值

   ```sql
   hive (default)> set mapred.reduce.tasks;
   ```

   ```
   mapred.reduce.tasks=-1
   ```

   启动时设定该参数值

   ```shell
   hive -hiveconf mapred.reduce.tasks=10
   ```

   此时再查看参数值

   ```sql
   hive (default)> set mapred.reduce.tasks;
   ```

   ```
   mapred.reduce.tasks=10
   ```

   **注意**：该方法仅对本次 Hive 启动有效

3. 参数声明方式

   可以在 HQL 中使用 SET 关键字设定参数，例如：

   ```sql
   hive (default)> set mapred.reduce.tasks=100;
   ```

   查看参数设置

   ```sql
   hive (default)> set mapred.reduce.tasks;
   ```
   
```
   mapred.reduce.tasks=100
   ```
   
   **注意**：该方法仅对本次 Hive 启动有效

上述三种设定方式的优先级「依次递增」，即`配置文件 < 命令行参数 < 参数声明`

注意某些「系统级」的参数，例如 log4j 相关的设定，必须用前两种方式设定，因为那些参数的读取在会话建立以前已经完成了

##  Hive 数据类型

### 基本数据类型

| Hive 数据类型 | Java 数据类型 | 长度                      | 例子                                 |
| ------------- | ------------- | ------------------------- | ------------------------------------ |
| TINYINT       | byte          | 1byte 有符号整数          | 20                                   |
| SMALINT       | short         | 2byte 有符号整数          | 20                                   |
| INT           | int           | 4byte 有符号整数          | 20                                   |
| BIGINT        | long          | 8byte 有符号整数          | 20                                   |
| BOOLEAN       | boolean       | 布尔类型，true 或者 false | TRUE FALSE                           |
| FLOAT         | float         | 单精度浮点数              | 3.14159                              |
| DOUBLE        | double        | 双精度浮点数              | 3.14159                              |
| STRING        | string        | 字符系列                  | 'now is the time' "for all good men" |
| TIMESTAMP     |               | 时间类型                  |                                      |
| BINARY        |               | 字节数组                  |                                      |

对于 Hive 的 `string` 类型相当于数据库的 `varchar `类型，该类型是一个**可变的字符串**，不过它不能声明其中最多能存储多少个字符，理论上它可以存储 2GB 的字符数

### 集合数据类型

Hive 有三种复杂数据类型 `ARRAY`、`MAP` 和 `STRUCT`

ARRAY 和 MAP 与 Java 中的Array 和 Map 类似，而 STRUCT 与 C 语言中的 Struct 类似，它封装了一个命名字段集合，复杂数据类型允许任意层次的嵌套

#### 案例演示

假设某表有如下一行，用 JSON 格式来表示其数据结构

```json
{
    "name":"songsong",
    "friends":[
        "bingbing",
        "lili"
    ],
    "children":{
        "xiao song":18,
        "xiaoxiao song":19
    },
    "address":{
        "street":"hui long guan",
        "city":"beijing"
    }
}
```

基于上述数据结构，在 Hive 里创建对应的表，并导入数据

创建本地测试文件 `test.txt`

```
songsong,bingbing_lili,xiao song:18_xiaoxiao song:19,hui long guan_beijing
yangyang,caicai_susu,xiao yang:18_xiaoxiao yang:19,chao yang_beijing
```

MAP、STRUCT 和 ARRAY 里的「元素间关系」都可以用同一个字符表示，这里用`_`

「不同字段」间使用`,`分隔

MAP 中的 key 与 value 用`:`分隔

Hive 上创建测试表 `test`

```sql
hive (default)> create table test(
name string,
friends array<string>,
children map<string, int>,
address struct<street:string, city:string>
)
row format delimited fields terminated by ','
collection items terminated by '_'
map keys terminated by ':'
lines terminated by '\n';
```

字段解释

| 语句                                            | 作用                                |
| ----------------------------------------------- | ----------------------------------- |
| `row format delimited fields terminated by ','` | 列分隔符                            |
| `collection items terminated by '_'`            | MAP、STRUCT 和 ARRAY 元素间的分隔符 |
| `map keys terminated by ':'`                    | MAP 中的 key 与 value 的分隔符      |
| `lines terminated by '\n'`                      | 行分隔符                            |

导入文本数据到测试表

```sql
hive (default)> load data local inpath "/opt/module/hive-1.2.1/input/test.txt" into table test;
```

访问三种集合列里的数据，分别是 ARRAY，MAP，STRUCT 的访问方式

```sql
hive (default)> select friends[1], children['xiao song'], address.city from test where name="songsong";
```

```
_c0     _c1     city
lili    18      beijing
```

### 类型转化

Hive 的原子数据类型是可以进行隐式转换的，类似于 Java 的类型转换，例如某表达式使用 INT 类型，TINYINT 会自动转换为 INT 类型，但是 Hive 不会进行反向转化，例如，某表达式使用 TINYINT 类型，INT 不会自动转换为 TINYINT 类型，它会返回错误，除非使用 `CAST` 操作（**强转**）

**隐式类型转换规则如下**

1. 任何整数类型都可以隐式地转换为一个范围更广的类型，如 TINYINT 可以转换 成 INT，INT 可以转换成 BIGINT
2. 所有整数类型、FLOAT 和 STRING 类型（注意此时的string型的内容必须符合double类型的规则，如"11.11"则可以顺利转换，"aa.11"则无法顺利转换）都可以隐式地转换成 DOUBLE
3. TINYINT、SMALLINT、INT 都可以转换为 FLOAT
4. BOOLEAN 类型不可以转换为任何其它的类型

**可以使用 `CAST` 操作显示进行数据类型转换**

例如使用 `CAST('1' AS INT)` 将把字符串`'1'` 转换成整数 `1`

```sql
hive (default)> select cast('1' as int);
```

```
_c0
1
```

如果强制类型转换失败，如执行 `CAST('X' AS INT)`，表达式返回空值 `NULL`

```sql
hive (default)> select cast('x' as int);
```

```
_c0
NULL
```

## DDL 数据定义

### 创建数据库

创建一个数据库，数据库在 HDFS 上的默认存储路径是`/user/hive/warehouse/*.db`

```sql
hive (default)> create database hive;
```

此时的数据库路径为`/user/hive/warehouse/hive.db`

创建一个数据库，指定数据库在 HDFS 上存放的位置

```sql
hive (hive)> create database hive2 location '/hive2.db';
```

此时HDFS的根路径下就「创建」数据库hive2的存放位置

```
/hive2.db
```

也可以指定一个「已存在的路径」作为数据库存放位置，这样不会额外再创建新目录，而是直接使用该路径作为存放位置

### 查询数据库

#### 显示数据库

显示数据库

```sql
hive (hive)> show databases;
```

过滤显示查询的数据库

```sql
hive (hive)> show databases like 'hive*';
```

#### 查看数据库详情

显示数据库信息

```sql
hive (hive)> desc database hive;
```

```
db_name comment location        owner_name      owner_type      parameters
hive            hdfs://hadoop198:9000/user/hive/warehouse/hive.db       root    USER
```

显示数据库详细信息

```sql
hive (hive)> desc database extended hive;
```

```
db_name comment location        owner_name      owner_type      parameters
hive            hdfs://hadoop198:9000/user/hive/warehouse/hive.db       root    USER
```

#### 切换当前数据库

```sql
hive (hive)> use hive2;
```

### 修改数据库

用户可以使用 `ALTER DATABASE` 命令为某个数据库的 `DBPROPERTIES` 设置「键值对」属性值，来描述这个数据库的属性信息

数据库的其他元数据信息都是不可更改的，包括**数据库名和数据库所在的目录位置**

```sql
hive (hive)> alter database hive set dbproperties('createtime'='20170830');
```

```sql
hive (hive)> desc database extended hive;
```

```
db_name comment location        owner_name      owner_type      parameters
hive            hdfs://hadoop198:9000/user/hive/warehouse/hive.db       root    USER    {createtime=20170830}
```

### 删除数据库

删除空数据库

```sql
hive (hive)> drop database hive2;
```

如果删除的数据库不存在，最好采用 `if exists` 判断数据库是否存在

```sql
hive (hive)> drop database if exists hive2;
```

如果数据库不为空，可以采用 `cascade` 命令，强制删除

```sql
hive (hive)> drop database hive cascade;
```

### 创建表

建表语法

```
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
```

字段解释说明

1. `CREATE TABLE` 创建一个指定名字的表，如果相同名字的表已经存在，则抛出异常

   可以用 `IF NOT EXISTS` 选项来忽略这个异常

2. `EXTERNAL` 关键字可以创建一个「外部表」，在建表的同时指定一个指向实际数据的路径（`LOCATION`）

   Hive 创建「内部表」时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变

   在删除表的时候，「内部表」的**元数据和数据会被一起删除**，而「外部表」**只删除元数据不删除数据**

3. `COMMENT` 为表和列添加注释

4. `PARTITIONED BY` 创建分区表

5. `CLUSTERED BY` 创建分桶表

6. `SORTED BY` 

7. `ROW FORMAT`

   在建表的时候可以自定义 `SerDe` 或者使用自带的 `SerDe`

   如果没有指定 `ROW FORMAT` 或者 `ROW FORMAT DELIMITED`，将会使用自带的 SerDe。在建表的时候还需要为表指定列，用户在指定表的列的同时也会指定自定义的 SerDe，Hive 通过 SerDe确定表的具体的列的数据

   `SerDe` 是 `Serialize/Deserilize` 的简称，目的是用于**序列化和反序列化**

8. `STORED AS` 指定存储文件类型

   常用的存储文件类型：`SEQUENCEFILE`（二进制序列文件）、`TEXTFILE`（文本）、`RCFILE`（列式存储格式文件）

   如果文件数据是**纯文本**，可以使用 `STORED AS TEXTFILE`，如果**数据需要压缩**，使用 `STORED AS SEQUENCEFILE`

9. `LOCATION` 指定表在 HDFS 上的存储位置

10. `LIKE `允许用户复制现有的表结构，但是不复制数据

#### 管理表

##### 概述

「默认」创建的表都是所谓的管理表，有时也被称为「内部表」。因为这种表 Hive 会（或多或少地）控制着数据的生命周期

Hive 默认情况下会将这些表的数据存储在由配置项`hive.metastore.warehouse.dir`(例如，`/user/hive/warehouse`)所定义的目录的子目录下，**当删除一个管理表时，Hive 也会删除这个表中数据**

管理表不适合和其他工具共享数据

##### 实际操作

普通创建表

```sql
hive (default)> create table if not exists student2(
id int,
name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/hive/warehouse/student2';
```

根据「查询结果」创建表，查询的结果会添加到新创建的表中

```sql
hive (default)> create table if not exists student3 as select id, name from student;
```

根据已经存在的表结构创建表

```sql
hive (default)> create table if not exists student4 like student;
```

查询表的类型

```sql
hive (default)> desc formatted student2;
```

#### 外部表

##### 概述

因为表是外部表，所以 Hive 并非认为其完全拥有这份数据。**删除该表并不会删除掉这份数据，不过描述表的元数据信息会被删除掉**

##### 管理表和外部表的使用场景

每天将收集到的网站日志定期流入 HDFS 文本文件。在外部表（原始日志表）的基础上做大量的统计分析，用到的中间表、结果表使用内部表存储，数据通过 `SELECT`和`INSERT`进入内部表

##### 实际操作

分别创建部门和员工外部表，并向表中导入数据

**原始数据**

`dept.txt`

```
10	ACCOUNTING	1700
20	RESEARCH	1800
30	SALES	1900
40	OPERATIONS	1700
```

`emp.txt`

```
7369	SMITH	CLERK	7902	1980-12-17	800.00		20
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.00	300.00	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.00	500.00	30
7566	JONES	MANAGER	7839	1981-4-2	2975.00		20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.00	1400.00	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.00		30
7782	CLARK	MANAGER	7839	1981-6-9	2450.00		10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.00		20
7839	KING	PRESIDENT		1981-11-17	5000.00		10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.00	0.00	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.00		20
7900	JAMES	CLERK	7698	1981-12-3	950.00		30
7902	FORD	ANALYST	7566	1981-12-3	3000.00		20
7934	MILLER	CLERK	7782	1982-1-23	1300.00		10
```

**建表语句**

创建部门表

```sql
hive (default)> create external table if not exists dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';
```

创建员工表

```sql
hive (default)> create external table if not exists emp(
empno int,
ename string,
job string,
mgr int,
hiredate string, 
sal double, 
comm double,
deptno int
)
row format delimited fields terminated by '\t';
```

**查看创建的表**

```sql
hive (default)> show tables;
```

**向外部表中导入数据**

导入数据

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/dept.txt' into table dept;
```

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/emp.txt' into table emp;
```

查询结果

```sql
hive (default)> select * from dept;
```

```sql
hive (default)> select * from emp;
```

**查看表格式化数据**

```sql
hive (default)> desc formatted dept;
```

#### 管理表与外部表互相转换

查询表的类型

```sql
hive (default)> desc formatted dept;
```

```
Table Type:             EXTERNAL_TABLE
```

修改外部表为内部表，外部表 -> 内部表

```sql
hive (default)> alter table dept set tblproperties('EXTERNAL'='FALSE');
```

```
Table Type:             MANAGED_TABLE 
```

修改内部表为外部表，内部表 -> 外部表

```sql
hive (default)> alter table dept set tblproperties('EXTERNAL'='TRUE');
```

```
Table Type:             EXTERNAL_TABLE 
```

**注意**：`('EXTERNAL'='TRUE')`和`('EXTERNAL'='FALSE')`为固定写法，**区分大小写**

### 分区表

#### 概述

分区表实际上就是对应一个 HDFS 文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive 中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过 WHERE 子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多

#### 基本操作

引入分区表（需要根据日期对日志进行管理）

**创建分区表**

```sql
hive (default)> create table dept_partition(
deptno int, 
dname string, 
loc string
)
partitioned by (month string)
row format delimited fields terminated by '\t';
```

**加载数据到分区表中**

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/dept.txt' into table dept_partition partition(month='201709');
```

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/dept.txt' into table dept_partition partition(month='201708');
```

**查询分区表中数据**

单分区查询

```sql
hive (default)> select * from dept_partition where month='201709';
```

多分区联合查询

```sql
hive (default)> select * from dept_partition where month='201709'
union
select * from dept_partition where month='201708';
```

**增加分区**

增加单个分区

```sql
hive (default)> alter table dept_partition add partition(month='201706');
```

增加多个分区，用空格` `隔开

```sql
hive (default)> alter table dept_partition add partition(month='201705') partition(month='201704');
```

**删除分区**

删除单个分区

```sql
hive (default)> alter table dept_partition drop partition (month='201704');
```

删除多个分区，用逗号隔开`,`

```sql
hive (default)> alter table dept_partition drop partition (month='201705'), partition (month='201706');
```

**查看分区表中分区**

```sql
hive (default)> show partitions dept_partition;
```

**查看分区表结构**

```sql
hive (default)> desc formatted dept_partition;
```

#### 事项补充

**创建二级分区表**

```sql
hive (default)> create table dept_partition2(
deptno int, 
dname string, 
loc string
)
partitioned by (month string, day string)
row format delimited fields terminated by '\t';
```

**正常的加载数据**

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/dept.txt' into table dept_partition2 partition(month='201709', day='13');
```

**查询分区数据**

```sql
hive (default)> select * from dept_partition2 where month='201709' and day='13';
```

**把数据直接上传到分区目录上，让分区表和数据产生关联的三种方式**

1. **方式一：上传数据后修复**

   上传数据

   ```sql
   hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=12;
   ```

   ```shell
   hdfs dfs -put dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=12
   ```

   创建表

   ```sql
   hive (default)> create table dept_partition2(
   deptno int, 
   dname string, 
   loc string
   )
   partitioned by (month string, day string)
   row format delimited fields terminated by '\t';
   ```

   执行修复命令

   ```sql
   hive (default)> msck repair table dept_partition2;
   ```

2. **方式二：上传数据后添加分区**

   上传数据

   ```sql
   hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=12;
   ```

   ```shell
   hdfs dfs -put dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=12
   ```

   创建表

   ```sql
   hive (default)> create table dept_partition2(
   deptno int, 
   dname string, 
   loc string
   )
   partitioned by (month string, day string)
   row format delimited fields terminated by '\t';
   ```

   执行添加分区

   ```sql
   hive (default)> alter table dept_partition2 add partition(month='201709', day='12');
   ```

3. **方式三：创建文件夹后 `load` 数据到分区**

   创建目录

   ```sql
   hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=12;
   ```

   创建表

   ```sql
   hive (default)> create table dept_partition2(
   deptno int, 
   dname string, 
   loc string
   )
   partitioned by (month string, day string)
   row format delimited fields terminated by '\t';
   ```
   
   上传数据
   
   ```sql
   hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/dept.txt' into table dept_partition2 partition(month='201709',day='12');
   ```

### 修改表

#### 重命名表

**语法**

```sql
ALTER TABLE table_name RENAME TO new_table_name;
```

**示例**

```sql
hive (default)> alter table dept_partition2 rename to dept_partition3;
```

#### 增加/修改/替换列信息

**语法**

更新列

```sql
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name];
```

增加和替换列

```sql
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
```

**注意**：`ADD` 是代表新增字段，字段位置在所有列「后面」(分区列前)，`REPLACE `则是表示**替换表中所有字段**

**实例操作**

查询表结构

```sql
hive (default)> desc dept_partition;
```

添加列

```sql
hive (default)> alter table dept_partition add columns(deptdesc string);
```

更新列

```sql
hive (default)> alter table dept_partition change column deptdesc desc int;
```

替换列

```sql
hive (default)> alter table dept_partition replace columns(deptno string, dname string, loc string);
```

#### 删除表

```sql
hive (default)> drop table dept_partition;
```

## DML 数据操作

### 数据导入

#### 向表中装载数据 Load

**语法**

```sql
load data [local] inpath 'student.txt' [overwrite] into table student [partition (partcol1=val1,…)];
```

1. `load data`：表示加载数据
2. `local`：表示从「本地」加载数据到 Hive 表，否则从「HDFS」加载数据到 Hive 表
3. `inpath`：表示加载数据的路径
4. `overwrite`：表示「覆盖」表中已有数据，否则表示「追加」
5. `into table`：表示加载到哪张表
6. `student`：表示具体的表
7. `partition`：表示上传到指定分区

**实例操作**

创建一张表

```sql
hive (default)> create table student(id string, 
name string
) 
row format delimited fields terminated by '\t';
```

加载本地的文件

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/students.txt' into table student;
```

加载HDFS文件系统的文件

```sql
hive (default)> load data inpath 'input/students.txt' into table student;
```

加载数据「覆盖」表中已有的数据

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/students.txt' overwrite into table student;
```

#### 向表中插入数据 Insert

创建一张表

```sql
hive (default)> create table student(
id int, 
name string) 
partitioned by (month string) 
row format delimited fields terminated by '\t';
```

基本插入数据

```sql
hive (default)> insert into table student partition(month='201709') values(1, 'wangwu');
```

覆盖插入数据，根据单张表查询结果

```sql
hive (default)> insert overwrite table student parition(month='201709') select id, name from student where month='201710';
```

多插入模式，根据多张表查询结果

```sql
hive (default)> from student
insert overwrite table student partition(month='201707')
select id, name where month='201709'
insert overwrite table student partition(month='201706')
select id, name where month='201709';
```

#### 创建表并加载数据 As Select

根据查询结果创建表，查询的结果会添加到新创建的表中

```sql
hive (default)> create table if not exists student3 as select id, name from student;
```

#### 创建表并指定加载数据路径 Location

创建目录

```shell
hdfs dfs -mkdir -p /user/hive/warehouse/student5;
```

上传数据到 hdfs 上 

```sql
hive (default)> dfs -put student.txt /user/hive/warehouse/student5;
```

创建表，并指定在 hdfs 上的位置

```sql
hive (default)> create table if not exists student5(
id int, 
name string
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/student5';
```

####  Import 数据到指定 Hive 表

```sql
hive (default)> import table student2 partition(month='201709') from '/user/hive/warehouse/export/student';
```

**注意**：先用 `export` 导出后，再将数据导入

### 数据导出

#### Insert 导出

将查询的结果导出到本地

```sql
hive (default)> insert overwrite local directory '/opt/module/hive-1.2.1/output/student' select * from student;
```

将查询的结果「格式化」导出到本地

```sql
hive (default)> insert overwrite local directory '/opt/module/hive-1.2.1/output/student2'
row format delimited fields terminated by '\t'
select * from student;
```

将查询的结果导出到 HDFS 上

```sql
hive (default)> insert overwrite directory 'student2'
row format delimited fields terminated by '\t'
select * from student;
```

#### Hadoop 命令导出

```shell
hdfs dfs -get /user/hive/warehouse/student/month=201709/000000_0 /opt/module/hive-1.2.1/output/student3.txt;
```

#### Hive Shell 命令导出

基本语法

```
hive -f|-e 执行语句或者脚本 > file
```

示例

```shell
hive -e 'select * from default.student;' > /opt/module/hive-1.2.1/output/student4.txt;
```

#### Export 导出到 HDFS

```sql
hive (default)> export table student to '/user/hive/warehouse/export/student';
```

### 清除表中数据

```sql
hive (default)> truncate table student;
```

**注意**：`Truncate` 只能删除「管理表」，不能删除「外部表」中数据

## 查询

### 基本查询

#### 全表/特定列查询

```sql
hive (default)> select * from emp;
```

```sql
hive (default)> select empno, ename from emp;
```

#### 列别名

查询名称和部门

```sql
hive (default)> select ename name, deptno dn from emp;
```

#### 算术运算符

| 运算符 | 描述              |
| ------ | ----------------- |
| A + B  | A 和 B 相加       |
| A - B  | A 减去 B          |
| A * B  | A 和 B 相乘       |
| A / B  | A 除以 B          |
| A % B  | A 对 B 取余       |
| A & B  | A 和 B 按位取与   |
| A \| B | A 和 B 按位取或   |
| A ^ B  | A 和 B 按位取异或 |
| ~A     | A 按位取反        |

查询出所有员工的薪水后加 1 显示

```sql
hive (default)> select sal+1 from emp;
```

#### 常用函数

| 函数  | 描述   |
| ----- | ------ |
| count | 总行数 |
| max   | 最大值 |
| min   | 最小值 |
| sum   | 总和   |
| avg   | 平均值 |

**求总行数**

```sql
hive (default)> select count(*) cnt from emp;
```

**求工资的最大值**

```sql
hive (default)> select max(sal) max_sal from emp;
```

### Limit 语句

典型的查询会返回多行数据，LIMIT 子句用于限制返回的行数

```sql
hive (default)> select * from emp limit 5;
```

### Where 语句

使用 WHERE 子句，将不满足条件的行过滤掉

WHERE 子句紧随 FROM 子句

#### 比较运算符

| 操作符                  | 数据类型     | 描述                                                         |
| ----------------------- | ------------ | ------------------------------------------------------------ |
| A = B                   | 基本数据类型 | 如果 A 等于 B 则返回 TRUE，反之返回 FALSE                    |
| A <=> B                 | 所有数据类型 | 安全等于，可比较 NULL                                        |
| A <> B, A != B          | 基本数据类型 | 如果 A 不等于 B，则返回 TRUE，反之返回 FALSE                 |
| A < B                   | 基本数据类型 | 如果 A 小于 B，则返回 TRUE，反之返回 FALSE                   |
| A <= B                  | 基本数据类型 | 如果 A 小于等于 B，则返回 TRUE，反之返回 FALSE               |
| A > B                   | 基本数据类型 | 如果 A 大于 B，则返回 TRUE，反之返回 FALSE                   |
| A >= B                  | 基本数据类型 | 如果 A 大于等于 B，则返回 TRUE，反之返回 FALSE               |
| A [NOT] BETWEEN B AND C | 基本数据类型 | 若 A ∈ [B, C] 则返回 TRUE，反之返回 FALSE，使用 NOT 则反效果 |
| A IS NULL               | 所有数据类型 | 如果 A 等于 NULL，则返回 TRUE，反之返回 FALSE                |
| A IS NOT NULL           | 所有数据类型 | 如果 A 不等于 NULL，则返回 TRUE，反之返回 FALSE              |
| A IN (B, C)             | 所有数据类型 | 如果 A 存在于列表中，则返回 TRUE，反之返回 FALSE             |
| A [NOT] LIKE B          | STRING 类型  | B 是一个 SQL 的简单正则表达式，如果 A 与其匹配的话，则返回 TRUE；反之返回 FALSE |
| A RLIKE B, A REGEXP B   | STRING 类型  | B 是一个正则表达式，如果 A 与其匹配，则返回 TRUE；反之返回 FALSE |

查询出薪水等于 5000 的所有员工

```sql
hive (default)> select * from emp where sal = 5000;
```

#### Like 和 RLike

使用 LIKE 运算选择类似的值，选择条件可以包含字符或数字

1. `%` 代表任意多个字符
2. `_` 代表1个字符

`RLIKE` 子句是 Hive 中这个功能的一个扩展，其可以通过 Java 的正则表达式这个更强大的语言来指定匹配条件

**查找薪水中含有 2 的员工信息**

```sql
hive (default)> select * from emp where sal like '%2%';
```

```sql
hive (default)> select * from emp where sal rlike '[2]';
```

#### 逻辑运算符

| 操作符 | 描述   |
| ------ | ------ |
| AND    | 逻辑并 |
| OR     | 逻辑或 |
| NOT    | 逻辑否 |

查询除了 20 部门和 30 部门以外的员工信息

```sql
hive (default)> select * from emp where deptno not in (30, 20);
```

### 分组

#### Group By 语句

GROUP BY 语句通常会和**聚合函数**一起使用，按照一个或者多个列队结果进行分组，然后对每个组执行聚合操作

**求每个部门的平均工资**

```sql
hive (default)> select deptno, avg(sal) avg_sal from emp group by deptno;
```

#### Having 语句

**having 与 where 不同点**

1. where 针对表中的列发挥作用，查询数据；having 针对查询结果中的列发挥作用，筛选数据
2. where 后面不能写聚合函数，而 having 后面可以使用聚合函数
3. having 只用于 group by 分组统计语句

**求每个部门的平均薪水大于 2000 的部门**

```sql
hive (default)> select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;
```

### Join 语句

#### 等值 Join

Hive 支持通常的 SQL JOIN 语句，但是**只支持等值连接，不支持非等值连接**

根据员工表和部门表中的部门编号相等，查询员工编号、员工名称和部门名称

```sql
hive (default)> select e.empno, e.ename, d.deptno, d.dname from emp e 
join dept d on e.deptno = d.deptno;
```

#### 表别名

合并员工表和部门表

```sql
hive (default)> select e.empno, e.ename, d.deptno from emp e 
join dept d on e.deptno = d.deptno;
```

#### 内连接

只有进行连接的两个表中都存在与连接条件相匹配的数据才会被保留下来

```sql
hive (default)> select e.empno, e.ename, d.deptno from emp e 
join dept d on e.deptno = d.deptno;
```

#### 左外连接

JOIN 操作符「左边」表中符合 WHERE 子句的所有记录将会被返回

```sql
hive (default)> select e.empno, e.ename, d.deptno from emp e 
left join dept d on e.deptno = d.deptno;
```

#### 右外连接

JOIN 操作符「右边」表中符合 WHERE 子句的所有记录将会被返回

```sql
hive (default)> select e.empno, e.ename, d.deptno from emp e 
right join dept d on e.deptno = d.deptno;
```

#### 满外连接

将会返回所有表中符合 WHERE 语句条件的所有记录。如果任一表的指定字段没有符合条件的值的话，那么就使用 NULL 值替代

```sql
hive (default)> select e.empno, e.ename, d.deptno from emp e 
full join dept d on e.deptno = d.deptno;
```

#### 多表连接

连接 `n` 个表，至少需要 `n-1` 个连接条件

**创建位置表**

```sql
hive (default)> create table if not exists location(
loc int,
loc_name string
)
row format delimited fields terminated by '\t';
```

**导入数据**

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/location.txt' into table location;
```

**多表连接查询**

```sql
hive (default)> select e.ename, d.deptno, l.loc_name from emp e 
join dept d on d.deptno = e.deptno
join location l on d.loc = l.loc;
```

大多数情况下，Hive 会对每对 JOIN 连接对象启动一个 MapReduce 任务

本例中会首先启动一个 MapReduce Job 对`表e` 和`表d` 进行连接操作

然后会再启动一个 MapReduce Job 将第一个 MapReduce Job 的`输出结果集`和`表l` 进行连接操作

Hive 总是按照**从左到右**的顺序执行的

#### 笛卡尔积

```sql
hive (default)> select empno, dname from emp, dept;
```

#### 连接谓词不支持 or

### 排序

#### 全局排序 Order By

全局排序，只使用一个 Reducer

**使用 ORDER BY 子句排序**

1. ASC（ascend）: 升序，默认
2. DESC（descend）: 降序

ORDER BY 子句在 SELECT 语句的结尾

**示例操作**

查询员工信息按工资升序排列

```sql
hive (default)> select * from emp order by sal;
```

查询员工信息按工资降序排列

```sql
hive (default)> select * from emp order by sal desc;
```

#### 按照别名排序

按照员工薪水的 2 倍排序

```sql
hive (default)> select ename, sal*2 twosal from emp order by twosal;
```

#### 多个列排序

按照部门和工资升序排序

```sql
hive (default)> select ename, deptno, sal from emp order by deptno, sal;
```

#### 内部排序 Sort By

每个 Reducer 内部进行排序，对全局结果集来说不是排序

首先设置 reduce 个数

```sql
hive (default)> set mapreduce.job.reduces=3;
```

查看设置 reduce 个数

```sql
hive (default)> set mapreduce.job.reduces;
```

根据部门编号降序查看员工信息

```sql
hive (default)> select * from emp sort by empno desc;
```

将查询结果导入到文件中，按照部门编号降序排序

```sql
hive (default)> insert overwrite local directory '/opt/module/hive-1.2.1/output/sortby-result'
row format delimited fields terminated by '\t'
select * from emp sort by deptno desc;
```

#### 分区排序 Distribute By

类似 MR 中 partition 进行分区，结合 sort by 使用

Hive 要求 DISTRIBUTE BY 语句要写在 SORT BY 语句「之前」

对于 distribute by 进行测试，一定要**分配多个 reduce 进行处理**，否则无法看到 distribute by 的效果

**示例操作**

设置 reduce 个数

```sql
hive (default)> set mapreduce.job.reduces=3;
```

先按照部门编号分区，再按照员工编号降序排序

```sql
hive (default)> insert overwrite local directory '/opt/module/hive-1.2.1/output/distribute-result'
row format delimited fields terminated by '\t'
select * from emp distribute by deptno sort by empno desc;
```

####  Cluster By

当 distribute by 和 sorts by 字段相同时，可以使用 cluster by 方式

cluster by 除了具有 distribute by 的功能外还兼具 sort by 的功能。但是排序只能是「升序」排序，**不能指定排序规则为 ASC 或者 DESC**

```sql
hive (default)> select * from emp cluster by deptno;
```

等价于

```sql
hive (default)> select * from emp distribute by deptno sort by deptno;
```

按照部门编号分区，不一定就是固定死的数值，可以是 20 号和 30 号部门分到一个分区里面去

### 分桶及抽样查询

#### 分桶表数据存储

分区针对的是数据的「存储路径」，分桶针对的是「数据文件」

分区提供一个隔离数据和优化查询的便利方式。不过，并非所有的数据集都可形成合理的分区，特别是要确定合适的划分大小这个疑虑

分桶是将数据集分解成更容易管理的若干部分的另一个技术

**创建分桶表**

```sql
hive (default)> create table loc_buck(loc int, loc_name string)
clustered by(loc) into 3 buckets
row format delimited fields terminated by '\t';
```

**查看表结构**

```sql
hive (default)> desc formatted loc_buck;
```

```
Num Buckets:            3
```

**属性设置**

```sql
hive (default)> set hive.enforce.bucketing=true;
```

```sql
hive (default)> set mapreduce.job.reduces=-1;
```

**导入数据到分桶表中**

```sql
hive (default)> insert into table loc_buck
select loc, loc_name from location;
```

#### 分桶抽样查询

对于非常大的数据集，有时需要使用的是一个具有代表性的查询结果而不是全部结果。Hive 可以通过对表进行「抽样」来满足这个需求

抽样查询表 loc_buck 中的数据

```sql
hive (default)> select * from loc_buck tablesample(bucket 1 out of 3 on loc);
```

`tablesample` 是抽样语句，语法为 `TABLESAMPLE(BUCKET x OUT OF y)`

`y` 必须是数据表总桶数的**倍数或者因子**，hive 根据 y 的大小，决定抽样的比例

例如，数据表总共分了 `4` 桶

当 `y = 2` 时，抽取 `4 / 2 = 2` 个桶的数据

当 `y = 8` 时，抽取 `4 / 8 =  1/2` 个桶的数据

`x` 表示从哪个桶开始抽取，如果需要取多个桶，以后的桶号为当前桶号加上 `y`

例如，数据表总桶数为 `4`，`tablesample(bucket 1 out of 2)`

表示总共抽取 `4 / 2 = 2` 个桶的数据，抽取第 `1(x)` 个和第 `3(x + y)` 个桶的数据

**注意**：x 的值必须小于等于 y 的值

### 其他常用查询函数

#### 空字段赋值

**函数说明**

NVL 给值为 NULL 的数据赋值

语法为 `NVL(string1, replace_with)`

功能为如果 string1 为 NULL，则 NVL 函数返回 replace_with 的值，否则返回 string1 的值，如果两个参数都为 NULL ，则返回 NULL

**如果员工的 comm 为 NULL，则用 -1 代替**

```sql
hive (default)> select nvl(comm,-1) from emp;
```

#### 时间类

`date_format` 格式化时间

```sql
hive (default)> select date_format('2019-6-29','yyyy-MM-dd');
```

`date_add` 时间跟天数相加

```sql
hive (default)> select date_add('2019-06-29', 5);
```

```sql
hive (default)> select date_add('2019-06-29', -5);
```

`datediff` 两个时间相减

```sql
hive (default)> select datediff('2019-06-29','2019-06-24');
```

字符替换，构造hive能识别的时间格式(`-`)

```sql
hive (default)> select regexp_replace('2019/07/04','/','-');
```

#### CASE WHEN

**数据准备**

```
悟空	A	男
大海	A	男
宋宋	B	男
凤姐	A	女
婷姐	B	女
婷婷	B	女
```

求出不同部门男女各多少人

**创建 hive 表**

```sql
hive (default)> create table emp_sex(
name string, 
dept_id string, 
sex string) 
row format delimited fields terminated by "\t";
```

**加载导入数据**

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/emp_sex.txt' into table emp_sex;
```

**按需求查询数据**

```sql
hive (default)> select 
dept_id,
sum(case sex when '男' then 1 else 0 end) male_count,
sum(case sex when '女' then 1 else 0 end) female_count
from emp_sex group by dept_id;
```

使用if函数实现

```sql
hive (default)> select
dept_id,
sum(if(sex = '男', 1, 0)) male_count,
sum(if(sex = '女', 1, 0)) female_count
from emp_sex group by dept_id;
```

#### 转行

**相关函数说明**

1. `CONCAT(string A/col, string B/col, ...)`

   返回输入字符串连接后的结果，支持任意个输入字符串

2. `CONCAT_WS(separator, str1, str2, ...)`

   其是一个特殊形式的 `CONCAT()`

   第一个参数剩余参数间的分隔符，分隔符可以是与剩余参数一样的字符串

   如果分隔符是 NULL，返回值也将为 NULL

   这个函数会跳过分隔符参数后的任何 NULL 和空字符串，分隔符将被加到被连接的字符串之间

3. `COLLECT_SET(col)`

   函数只接受基本数据类型，它的主要作用是将某字段的值进行「去重」汇总，产生 array 类型字段

**数据准备**

```
孙悟空	白羊座	A
大海	射手座	A
宋宋	白羊座	B
猪八戒	白羊座	A
凤姐	射手座	A
```

**创建 hive 表**

```sql
hive (default)> create table person_info(
name string, 
constellation string, 
blood_type string) 
row format delimited fields terminated by "\t";
```

**导入数据**

```sql
hive (default)> load data local inpath "/opt/module/hive-1.2.1/input/person_info.txt" into table person_info;
```

**按需求查询数据**

```sql
hive (default)> select t1.base, concat_ws('|', collect_set(t1.name)) name
from (
select name, concat(constellation, ",", blood_type) base
from person_info) t1
group by t1.base;
```

```
t1.base name
射手座,A        大海|凤姐
白羊座,A        孙悟空|猪八戒
白羊座,B        宋宋
```

#### 转列

**相关函数说明**

1. `EXPLODE(col)`

   将 hive 一列中复杂的 array 或者 map 结构拆分成多行

2. `LATERAL VIEW`

   语法为 `LATERAL VIEW udtf(expression) tableAlias AS columnAlias`

   用于和 split, explode 等 UDTF 一起使用，它能够将一行数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合

**数据准备**

```
《疑犯追踪》	悬疑,动作,科幻,剧情
《Lie to me》	悬疑,警匪,动作,心理,剧情
《战狼 2》	战争,动作,灾难
```

**创建 hive 表**

```sql
hive (default)> create table movie_info(
movie string,
category array<string>)
row format delimited fields terminated by "\t"
collection items terminated by ",";
```

**导入数据**

```sql
hive (default)> load data local inpath "/opt/module/hive-1.2.1/input/movie.txt" into table movie_info;
```

**按需求查询数据**

```sql
hive (default)> select movie, category_name from movie_info
lateral view explode(category) table_tmp as category_name;
```

```
movie   category_name
《疑犯追踪》    悬疑
《疑犯追踪》    动作
《疑犯追踪》    科幻
《疑犯追踪》    剧情
《Lie to me》   悬疑
《Lie to me》   警匪
《Lie to me》   动作
《Lie to me》   心理
《Lie to me》   剧情
《战狼 2》      战争
《战狼 2》      动作
《战狼 2》      灾难
```

#### 窗口函数

**相关函数说明**

1. `OVER()`

   指定分析函数工作的数据窗口大小，这个数据窗口大小可能会随着行的变化而变化

2. `CURRENT ROW`

   当前行

3. `n PRECEDING`

   往前 n 行数据

4. `n FOLLOWING`

   往后 n 行数据

5. `UNBOUNDED`

   起点，`UNBOUNDED PRECEDING` 表示从前面的起点，`UNBOUNDED FOLLOWING` 表示到后面的终点

6. `LAG(col, n)`

   往前第 n 行数据

7. `LEAD(col, n)`

   往后第 n 行数据

8. `NTILE(n)`

   把有序分区中的行分发到指定数据的组中，各个组有编号，编号从 1 开始，对于每一行，NTILE 返回此行所属的组的编号

   注意：n 必须为 int 类型

9. `RANK`

   1. `RANK()` 排序相同时会重复，总数不会变
   2. `DENSE_RANK()` 排序相同时会重复，总数会减少
   3. `ROW_NUMBER()` 会根据顺序计算

**数据准备**

```
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
```

**创建 hive 表**

```sql
hive (default)> create table business(
name string,
orderdate string,
cost int
) 
row format delimited fields terminated by ',';
```

**导入数据**

```sql
hive (default)> load data local inpath "/opt/module/hive-1.2.1/input/business.txt" into table business;
```

**按需求查询数据**

查询在 2017 年 4 月份购买过的顾客

```sql
hive (default)> select name
from business where substring(orderdate,1,7) = '2017-04' group by name;
```

```
name
jack
mart
```

查询在 2017 年 4 月份购买过的顾客及其购买次数

```sql
hive (default)> select name, count(*)
from business where substring(orderdate,1,7) = '2017-04' group by name;
```

```
name    _c1
jack    1
mart    4
```

查询在 2017 年 4 月份购买过的顾客及总人数

```sql
hive (default)> select name, count(*) over ()
from business where substring(orderdate,1,7) = '2017-04' group by name;
```

```
name    count_window_0
mart    2
jack    2
```

查询顾客的购买明细及购买总额

```sql
hive (default)> select name, 
orderdate, 
cost, 
sum(cost) over() 
from business;
```

```
name    orderdate       cost    sum_window_0
mart    2017-04-13      94      661
neil    2017-06-12      80      661
mart    2017-04-11      75      661
neil    2017-05-10      12      661
mart    2017-04-09      68      661
mart    2017-04-08      62      661
jack    2017-01-08      55      661
tony    2017-01-07      50      661
jack    2017-04-06      42      661
jack    2017-01-05      46      661
tony    2017-01-04      29      661
jack    2017-02-03      23      661
tony    2017-01-02      15      661
jack    2017-01-01      10      661
```

查询顾客的购买明细及月购买总额

```sql
hive (default)> select name, 
orderdate, 
cost, 
sum(cost) over(partition by month(orderdate)) 
from business;
```

```
name    orderdate       cost    sum_window_0
jack    2017-01-01      10      205
jack    2017-01-08      55      205
tony    2017-01-07      50      205
jack    2017-01-05      46      205
tony    2017-01-04      29      205
tony    2017-01-02      15      205
jack    2017-02-03      23      23
mart    2017-04-13      94      341
jack    2017-04-06      42      341
mart    2017-04-11      75      341
mart    2017-04-09      68      341
mart    2017-04-08      62      341
neil    2017-05-10      12      12
neil    2017-06-12      80      80
```

查询顾客的购买明细以及将 cost 按照日期进行累加

```sql
hive (default)> select name, 
orderdate, 
cost, 
sum(cost) over(partition by name order by orderdate) 
from business;
```

```
name    orderdate       cost    sum_window_0
jack    2017-01-01      10      10
jack    2017-01-05      46      56
jack    2017-01-08      55      111
jack    2017-02-03      23      134
jack    2017-04-06      42      176
mart    2017-04-08      62      62
mart    2017-04-09      68      130
mart    2017-04-11      75      205
mart    2017-04-13      94      299
neil    2017-05-10      12      12
neil    2017-06-12      80      92
tony    2017-01-02      15      15
tony    2017-01-04      29      44
tony    2017-01-07      50      94
```

查询所有订单信息并按时间顺序划分成5份分区

```sql
hive (default)> select name,
orderdate,
cost, 
ntile(5) over(order by orderdate) sorted
from business;
```

```
name    orderdate       cost    sorted
jack    2017-01-01      10      1
tony    2017-01-02      15      1
tony    2017-01-04      29      1
jack    2017-01-05      46      2
tony    2017-01-07      50      2
jack    2017-01-08      55      2
jack    2017-02-03      23      3
jack    2017-04-06      42      3
mart    2017-04-08      62      3
mart    2017-04-09      68      4
mart    2017-04-11      75      4
mart    2017-04-13      94      4
neil    2017-05-10      12      5
neil    2017-06-12      80      5
```

查询前 20% 时间的订单信息

```sql
hive (default)> select * from (
select name,
orderdate,
cost, 
ntile(5) over(order by orderdate) sorted
from business
) t
where sorted = 1;
```

```
t.name  t.orderdate     t.cost  t.sorted
jack    2017-01-01      10      1
tony    2017-01-02      15      1
tony    2017-01-04      29      1
```

#### RANK

**函数说明**

1. `RANK()` 排序相同时会重复，总数不会变
2. `DENSE_RANK()` 排序相同时会重复，总数会减少
3. `ROW_NUMBER()` 会根据顺序计算

**数据准备**

```
孙悟空,语文,87
孙悟空,数学,95
孙悟空,英语,68
大海,语文,94
大海,数学,56
大海,英语,84
宋宋,语文,64
宋宋,数学,86
宋宋,英语,84
婷婷,语文,65
婷婷,数学,85
婷婷,英语,78
```

**创建 hive 表**

```sql
hive (default)> create table score(
name string,
subject string, 
score int
)
row format delimited fields terminated by ",";
```

**导入数据**

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/score.txt' into table score;
```

**按需求查询数据**

```sql
hive (default)> select name,
subject,
score,
rank() over(partition by subject order by score desc) rp,
dense_rank() over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rmp
from score;
```

```
name    subject score   rp      drp     rmp
孙悟空  数学    95      1       1       1
宋宋    数学    86      2       2       2
婷婷    数学    85      3       3       3
大海    数学    56      4       4       4
宋宋    英语    84      1       1       1
大海    英语    84      1       1       2
婷婷    英语    78      3       2       3
孙悟空  英语    68      4       3       4
大海    语文    94      1       1       1
孙悟空  语文    87      2       2       2
婷婷    语文    65      3       3       3
宋宋    语文    64      4       4       4
```

#### 案例练习

##### 题目一

如下的用户访问数据

| userId | visitDate | visitCount |
| ------ | --------- | ---------- |
| u01    | 2017/1/21 | 5          |
| u02    | 2017/1/23 | 6          |
| u03    | 2017/1/22 | 8          |
| u04    | 2017/1/20 | 3          |
| u01    | 2017/1/23 | 6          |
| u01    | 2017/2/21 | 8          |
| u02    | 2017/1/23 | 6          |
| u01    | 2017/2/22 | 4          |

要求使用SQL统计出每个用户的累积访问次数，如下表所示

| 用户id | 月份    | 小计 | 累积 |
| ------ | ------- | ---- | ---- |
| u01    | 2017-01 | 11   | 11   |
| u01    | 2017-02 | 12   | 23   |
| u02    | 2017-01 | 12   | 12   |
| u03    | 2017-01 | 8    | 8    |
| u04    | 2017-01 | 3    | 3    |

**数据准备**

```
u01	2017/1/21	5
u02	2017/1/23	6
u03	2017/1/22	8
u04	2017/1/20	3
u01	2017/1/23	6
u01	2017/2/21	8
u02	2017/1/23	6
u01	2017/2/22	4
```

**创建hive表**

```sql
hive (default)> create table action(
userId string,
visitDate string,
visitCount int
) 
row format delimited fields terminated by "\t";
```

**加载数据**

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/action.txt' into table action;
```

**按需求查询数据**

```sql
hive (default)> select userId, date_format(regexp_replace(visitDate,'/','-'), 'yyyy-MM') ym, visitCount from action;
```

```
userid  ym      visitcount
u01     2017-01 5
u02     2017-01 6
u03     2017-01 8
u04     2017-01 3
u01     2017-01 6
u01     2017-02 8
u02     2017-01 6
u01     2017-02 4
```

```sql
hive (default)> select userid, ym, sum(visitcount) sum_visitcount
from (
         select userId, date_format(regexp_replace(visitDate, '/', '-'), 'yyyy-MM') ym, visitCount from action
     ) t1
group by userid, ym;
```

```
userid  ym      sum_visitcount
u01     2017-01 11
u01     2017-02 12
u02     2017-01 12
u03     2017-01 8
u04     2017-01 3
```

```sql
hive (default)> select userid,
       ym,
       sum_visitcount,
       sum(sum_visitcount) over (partition by userid order by ym) sum_2
from (
         select userid, ym, sum(visitcount) sum_visitcount
         from (
                  select userId, date_format(regexp_replace(visitDate, '/', '-'), 'yyyy-MM') ym, visitCount from action
              ) t1
         group by userid, ym
     ) t2;
```

```
userid  ym      sum_visitcount  sum_2
u01     2017-01 11      11
u01     2017-02 12      23
u02     2017-01 12      12
u03     2017-01 8       8
u04     2017-01 3       3
```

##### 题目二

有50W个店铺，每个顾客访问任何一个店铺的任何一个商品时都会产生一条访问日志

访问日志存储的表名为visit，访客的用户id为user_id，被访问的店铺名称为shop

**数据准备**

```
u1	a
u2	b
u1	b
u1	a
u3	c
u4	b
u1	a
u2	c
u5	b
u4	b
u6	c
u2	c
u1	b
u2	a
u2	a
u3	a
u5	a
u5	a
u5	a
```

**创建hive表**

```sql
hive (default)> create table visit(user_id string,
shop string
) 
row format delimited fields terminated by '\t';
```

**加载数据**

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/visit.txt' into table visit;
```

**按需求查询数据**

求每个店铺的访客数

**方法一**

```sql
hive (default)> select shop, count(distinct user_id) visitor_count from visit group by shop;
```

```
shop    visitor_count
a       4
b       4
c       3
```

**方法二**

```sql
hive (default)> select shop, user_id from visit group by shop, user_id;
```

```
shop    user_id
a       u1
a       u2
a       u3
a       u5
b       u1
b       u2
b       u4
b       u5
c       u2
c       u3
c       u6
```

```sql
hive (default)> select shop, count(*) visitor_count
from (
         select shop, user_id
         from visit
         group by shop, user_id
     ) t1
group by shop;
```

```
shop    visitor_count
a       4
b       4
c       3
```

每个店铺访问次数top3的访客信息，输出店铺名称、访客id、访问次数

```sql
hive (default)> select shop, user_id, count(*) count_visit from visit group by shop, user_id;
```

```
shop    user_id count_visit
a       u1      3
a       u2      2
a       u3      1
a       u5      3
b       u1      2
b       u2      1
b       u4      2
b       u5      1
c       u2      2
c       u3      1
c       u6      1
```

```sql
hive (default)> select shop,
       user_id,
       count_visit,
       row_number() over (partition by shop order by count_visit desc) rk
from (
         select shop, user_id, count(*) count_visit from visit group by shop, user_id
     ) t1;
```

```
shop    user_id count_visit     rk
a       u5      3       1
a       u1      3       2
a       u2      2       3
a       u3      1       4
b       u4      2       1
b       u1      2       2
b       u5      1       3
b       u2      1       4
c       u2      2       1
c       u6      1       2
c       u3      1       3
```

```sql
hive (default)> select shop, user_id, count_visit
from (
         select shop,
                user_id,
                count_visit,
                row_number() over (partition by shop order by count_visit desc) rk
         from (
                  select shop, user_id, count(*) count_visit from visit group by shop, user_id
              ) t1
     ) t2
where rk <= 3;
```

```
shop    user_id count_visit
a       u5      3
a       u1      3
a       u2      2
b       u4      2
b       u1      2
b       u5      1
c       u2      2
c       u6      1
c       u3      1
```

### 蚂蚁森林案例

#### 背景说明

**表名**：`user_low_carbon`

记录了用户每天的蚂蚁森林低碳生活领取的记录流水

**字段说明**

1. `user_id`：用户
2. `data_dt`：日期
3. `low_carbon`：减少碳排放（g）

**数据准备**

```
u_001	2017/1/1	10
u_001	2017/1/2	150
u_001	2017/1/2	110
u_001	2017/1/2	10
u_001	2017/1/4	50
......
```

**表名**：`plant_carbon`

蚂蚁森林植物换购表，用于记录申领环保植物所需要减少的碳排放量

**字段说明**

1. `plant_id`：植物编号
2. `plant_name`：植物名
3. `low_carbon`：换购植物所需要的碳

**数据准备**

```
p001	梭梭树	17
p002	沙柳	19
p003	樟子树	146
p004	胡杨	215
```

**创建hive表**

```sql
hive (default)> create table user_low_carbon(user_id String,
data_dt String,
low_carbon int
) 
row format delimited fields terminated by '\t';
```

```sql
hive (default)> create table plant_carbon(plant_id string,
plant_name String,
low_carbon int
) 
row format delimited fields terminated by '\t';
```

**导入数据**

```sql
hive (default)> load data local inpath "/opt/module/hive-1.2.1/input/user_low_carbon.txt" into table user_low_carbon;
```

```sql
hive (default)> load data local inpath "/opt/module/hive-1.2.1/input/plant_carbon.txt" into table plant_carbon;
```

**设置本地模式**

```sql
hive (default)> set hive.exec.mode.local.auto=true;
```

#### 查询需求

##### 需求一

**蚂蚁森林植物申领统计**

假设2017年1月1日开始记录低碳数据（`user_low_carbon`），假设2017年10月1日之前满足申领条件的用户都申领了一颗`p004-胡杨`，剩余的能量全部用来领取`p002-沙柳`

统计在10月1日累计申领`p002-沙柳`排名前10的用户信息以及他比后一名多领了几颗沙柳

步骤一：统计每个用户截止到2017-10-01日期总低碳量

```sql
hive (default)> select user_id, sum(low_carbon) sum_carbon
from user_low_carbon
where date_format(regexp_replace(data_dt, '/', '-'), 'yyyy-MM-dd') < '2017-10-01'
group by user_id;
```

```
user_id sum_carbon
u_001   475
u_002   659
u_003   620
u_004   640
u_005   1100
u_006   830
u_007   1470
u_008   1240
u_009   930
u_010   1080
u_011   960
u_012   250
u_013   1430
u_014   1060
u_015   290
```

步骤二：取出胡杨和沙柳的能量

```sql
hive (default)> select low_carbon from plant_carbon where plant_id='p004';
```

```sql
hive (default)> select low_carbon from plant_carbon where plant_id='p002';
```

步骤三：计算每个人申领沙柳的棵数

```sql
hive (default)> select user_id,
       floor((t1.sum_carbon - t2.low_carbon) / t3.low_carbon) count_p002
from (
         select user_id, sum(low_carbon) sum_carbon
         from user_low_carbon
         where date_format(regexp_replace(data_dt, '/', '-'), 'yyyy-MM-dd') < '2017-10-01'
         group by user_id
     ) t1,
     (
         select low_carbon
         from plant_carbon
         where plant_id = 'p004'
     ) t2,
     (
         select low_carbon
         from plant_carbon
         where plant_id = 'p002'
     ) t3;
```

```
user_id count_p002
u_001   13
u_002   23
u_003   21
u_004   22
u_005   46
u_006   32
u_007   66
u_008   53
u_009   37
u_010   45
u_011   39
u_012   1
u_013   63
u_014   44
u_015   3
```

步骤四：按照申领沙柳棵数排序，并将下一行数据中的count_p002放置当前行

```sql
hive (default)> select user_id,
       count_p002,
       lead(count_p002, 1) over (order by count_p002 desc) lead_1_p002
from (
         select user_id,
                floor((t1.sum_carbon - t2.low_carbon) / t3.low_carbon) count_p002
         from (
                  select user_id, sum(low_carbon) sum_carbon
                  from user_low_carbon
                  where date_format(regexp_replace(data_dt, '/', '-'), 'yyyy-MM-dd') < '2017-10-01'
                  group by user_id
              ) t1,
              (
                  select low_carbon
                  from plant_carbon
                  where plant_id = 'p004'
              ) t2,
              (
                  select low_carbon
                  from plant_carbon
                  where plant_id = 'p002'
              ) t3
     ) t4
limit 10;
```

```
user_id count_p002      lead_1_p002
u_007   66      63
u_013   63      53
u_008   53      46
u_005   46      45
u_010   45      44
u_014   44      39
u_011   39      37
u_009   37      32
u_006   32      23
u_002   23      22
```

步骤五：求相差的沙柳棵数

```sql
hive (default)> select user_id,
       count_p002,
       (count_p002 - lead_1_p002) diff_count
from (
         select user_id,
                count_p002,
                lead(count_p002, 1) over (order by count_p002 desc) lead_1_p002
         from (
                  select user_id,
                         floor((t1.sum_carbon - t2.low_carbon) / t3.low_carbon) count_p002
                  from (
                           select user_id, sum(low_carbon) sum_carbon
                           from user_low_carbon
                           where date_format(regexp_replace(data_dt, '/', '-'), 'yyyy-MM-dd') < '2017-10-01'
                           group by user_id
                       ) t1,
                       (
                           select low_carbon
                           from plant_carbon
                           where plant_id = 'p004'
                       ) t2,
                       (
                           select low_carbon
                           from plant_carbon
                           where plant_id = 'p002'
                       ) t3
              ) t4
         limit 10
     ) t5
order by count_p002 desc;
```

```
user_id count_p002      diff_count
u_007   66      3
u_013   63      10
u_008   53      7
u_005   46      1
u_010   45      1
u_014   44      5
u_011   39      2
u_009   37      5
u_006   32      9
u_002   23      1
```

##### 需求二

**蚂蚁森林低碳用户排名分析**

查询user_low_carbon表中每日流水记录

条件为用户在2017年，连续三天（或以上）的天数里，每天减少碳排放（low_carbon）都超过100g的用户低碳流水

需要查询返回满足以上条件的user_low_carbon表中的记录流水

**解法一**

步骤一：过滤出2017年且单日低碳量超过100g

```sql
hive (default)> select user_id,
       date_format(regexp_replace(data_dt, '/', '-'), 'yyyy-MM-dd') data_dt,
       sum(low_carbon)                                              sum_low_carbon
from user_low_carbon
where substring(data_dt, 1, 4) = '2017'
group by user_id, data_dt
having sum_low_carbon >= 100;
```

```
user_id data_dt     sum_low_carbon
u_001   2017-01-02      270
u_001   2017-01-06      135
u_002   2017-01-02      220
u_002   2017-01-03      110
u_002   2017-01-04      150
u_002   2017-01-05      101
u_003   2017-01-02      160
u_003   2017-01-03      160
u_003   2017-01-05      120
u_003   2017-01-07      120
u_004   2017-01-01      110
u_004   2017-01-03      120
u_004   2017-01-06      120
u_004   2017-01-07      130
......
```

步骤二：将前两行数据以及后两行数据的日期放置当前行

```sql
hive (default)> select user_id,
       data_dt,
       lag(data_dt, 2, '1970-01-01') over (partition by user_id order by data_dt)   lag2,
       lag(data_dt, 1, '1970-01-01') over (partition by user_id order by data_dt)   lag1,
       lead(data_dt, 1, '9999-01-01') over (partition by user_id order by data_dt)  lead1,
       lead(data_dt, 2, '9999-01-01') over (partition by user_id order by data_dt) lead2
from (
         select user_id,
                date_format(regexp_replace(data_dt, '/', '-'), 'yyyy-MM-dd') data_dt,
                sum(low_carbon)                                              sum_low_carbon
         from user_low_carbon
         where substring(data_dt, 1, 4) = '2017'
         group by user_id, data_dt
         having sum_low_carbon >= 100
     ) t1;
```

```
user_id data_dt lag2    lag1    lead1   lead2
u_001   2017-01-02      1970-01-01      1970-01-01      2017-01-06      9999-01-01
u_001   2017-01-06      1970-01-01      2017-01-02      9999-01-01      9999-01-01
u_002   2017-01-02      1970-01-01      1970-01-01      2017-01-03      2017-01-04
u_002   2017-01-03      1970-01-01      2017-01-02      2017-01-04      2017-01-05
u_002   2017-01-04      2017-01-02      2017-01-03      2017-01-05      9999-01-01
u_002   2017-01-05      2017-01-03      2017-01-04      9999-01-01      9999-01-01
u_003   2017-01-02      1970-01-01      1970-01-01      2017-01-03      2017-01-05
u_003   2017-01-03      1970-01-01      2017-01-02      2017-01-05      2017-01-07
u_003   2017-01-05      2017-01-02      2017-01-03      2017-01-07      9999-01-01
......
```

步骤三：计算当前日期跟前后两行时间的差值

```sql
hive (default)> select user_id,
       data_dt,
       datediff(data_dt, lag2)  lag2_diff,
       datediff(data_dt, lag1)  lag1_diff,
       datediff(data_dt, lead1) lead1_diff,
       datediff(data_dt, lead2) lead2_diff
from (
         select user_id,
                data_dt,
                lag(data_dt, 2, '1970-01-01') over (partition by user_id order by data_dt)  lag2,
                lag(data_dt, 1, '1970-01-01') over (partition by user_id order by data_dt)  lag1,
                lead(data_dt, 1, '9999-01-01') over (partition by user_id order by data_dt) lead1,
                lead(data_dt, 2, '9999-01-01') over (partition by user_id order by data_dt) lead2
         from (
                  select user_id,
                         date_format(regexp_replace(data_dt, '/', '-'), 'yyyy-MM-dd') data_dt,
                         sum(low_carbon)                                              sum_low_carbon
                  from user_low_carbon
                  where substring(data_dt, 1, 4) = '2017'
                  group by user_id, data_dt
                  having sum_low_carbon >= 100
              ) t1
     ) t2;
```

```
user_id data_dt lag2_diff       lag1_diff       lead1_diff      lead2_diff
u_001   2017-01-02      17168   17168   -4      -2915364
u_001   2017-01-06      17172   4       -2915360        -2915360
u_002   2017-01-02      17168   17168   -1      -2
u_002   2017-01-03      17169   1       -1      -2
u_002   2017-01-04      2       1       -1      -2915362
u_002   2017-01-05      2       1       -2915361        -2915361
u_003   2017-01-02      17168   17168   -1      -3
u_003   2017-01-03      17169   1       -2      -4
u_003   2017-01-05      3       2       -2      -2915361
......
```

步骤四：过滤出最少连续3天超过100g的用户

```sql
hive (default)> select user_id,
       data_dt
from (
         select user_id,
                data_dt,
                datediff(data_dt, lag2)  lag2_diff,
                datediff(data_dt, lag1)  lag1_diff,
                datediff(data_dt, lead1) lead1_diff,
                datediff(data_dt, lead2) lead2_diff
         from (
                  select user_id,
                         data_dt,
                         lag(data_dt, 2, '1970-01-01') over (partition by user_id order by data_dt)  lag2,
                         lag(data_dt, 1, '1970-01-01') over (partition by user_id order by data_dt)  lag1,
                         lead(data_dt, 1, '9999-01-01') over (partition by user_id order by data_dt) lead1,
                         lead(data_dt, 2, '9999-01-01') over (partition by user_id order by data_dt) lead2
                  from (
                           select user_id,
                                  date_format(regexp_replace(data_dt, '/', '-'), 'yyyy-MM-dd') data_dt,
                                  sum(low_carbon)                                              sum_low_carbon
                           from user_low_carbon
                           where substring(data_dt, 1, 4) = '2017'
                           group by user_id, data_dt
                           having sum_low_carbon >= 100
                       ) t1
              ) t2
     ) t3
where (lead1_diff = -1 and lead2_diff = -2)
   or (lag1_diff = 1 and lead1_diff = -1)
   or (lag2_diff = -2 and lag1_diff = -1);
```

```
user_id data_dt
u_002   2017-01-02
u_002   2017-01-03
u_002   2017-01-04
u_005   2017-01-02
u_005   2017-01-03
u_008   2017-01-04
u_008   2017-01-05
u_008   2017-01-06
u_009   2017-01-02
u_009   2017-01-03
u_010   2017-01-04
u_010   2017-01-05
u_010   2017-01-06
u_011   2017-01-01
u_011   2017-01-02
u_011   2017-01-03
u_011   2017-01-04
u_011   2017-01-05
u_011   2017-01-06
u_013   2017-01-02
u_013   2017-01-03
u_013   2017-01-04
u_014   2017-01-05
u_014   2017-01-06
```

步骤五：关联原表

```sql
hive (default)> select u.user_id,
       u.data_dt,
       u.low_carbon
from (
         select user_id,
                data_dt
         from (
                  select user_id,
                         data_dt,
                         datediff(data_dt, lag2)  lag2_diff,
                         datediff(data_dt, lag1)  lag1_diff,
                         datediff(data_dt, lead1) lead1_diff,
                         datediff(data_dt, lead2) lead2_diff
                  from (
                           select user_id,
                                  data_dt,
                                  lag(data_dt, 2, '1970-01-01') over (partition by user_id order by data_dt)  lag2,
                                  lag(data_dt, 1, '1970-01-01') over (partition by user_id order by data_dt)  lag1,
                                  lead(data_dt, 1, '9999-01-01') over (partition by user_id order by data_dt) lead1,
                                  lead(data_dt, 2, '9999-01-01') over (partition by user_id order by data_dt) lead2
                           from (
                                    select user_id,
                                           date_format(regexp_replace(data_dt, '/', '-'), 'yyyy-MM-dd') data_dt,
                                           sum(low_carbon)                                              sum_low_carbon
                                    from user_low_carbon
                                    where substring(data_dt, 1, 4) = '2017'
                                    group by user_id, data_dt
                                    having sum_low_carbon >= 100
                                ) t1
                       ) t2
              ) t3
         where (lead1_diff = -1 and lead2_diff = -2)
            or (lag1_diff = 1 and lead1_diff = -1)
            or (lag2_diff = -2 and lag1_diff = -1)
     ) t4
         join user_low_carbon u on t4.user_id = u.user_id and t4.data_dt = date_format(regexp_replace(u.data_dt, '/', '-'), 'yyyy-MM-dd');
```

```
u.user_id       u.data_dt       u.low_carbon
u_002   2017/1/2        150
u_002   2017/1/2        70
u_002   2017/1/3        30
u_002   2017/1/3        80
u_002   2017/1/4        150
u_005   2017/1/2        50
u_005   2017/1/2        80
u_005   2017/1/3        180
u_008   2017/1/4        260
u_008   2017/1/5        360
u_008   2017/1/6        160
u_009   2017/1/2        70
u_009   2017/1/2        70
u_009   2017/1/3        170
u_010   2017/1/4        90
u_010   2017/1/4        80
u_010   2017/1/5        90
u_010   2017/1/5        90
u_010   2017/1/6        190
u_011   2017/1/1        110
u_011   2017/1/2        100
u_011   2017/1/2        100
u_011   2017/1/3        120
u_011   2017/1/4        100
u_011   2017/1/5        100
u_011   2017/1/6        100
u_013   2017/1/2        150
u_013   2017/1/2        50
u_013   2017/1/3        150
u_013   2017/1/4        550
u_014   2017/1/5        250
u_014   2017/1/6        120
```

**解法二**

步骤一：过滤出2017年且单日低碳量超过100g

```sql
hive (default)> select user_id,
       date_format(regexp_replace(data_dt, '/', '-'), 'yyyy-MM-dd') data_dt,
       sum(low_carbon)                                              sum_low_carbon
from user_low_carbon
where substring(data_dt, 1, 4) = '2017'
group by user_id, data_dt
having sum_low_carbon >= 100;
```

```
user_id data_dt sum_low_carbon
u_001   2017-01-02      270
u_001   2017-01-06      135
u_002   2017-01-02      220
u_002   2017-01-03      110
u_002   2017-01-04      150
u_002   2017-01-05      101
u_003   2017-01-02      160
u_003   2017-01-03      160
u_003   2017-01-05      120
......
```

步骤二：按照日期进行排序，并给每一条数据一个标记

```sql
hive (default)> select user_id,
       data_dt,
       rank() over (partition by user_id order by data_dt) rk
from (
         select user_id,
                date_format(regexp_replace(data_dt, '/', '-'), 'yyyy-MM-dd') data_dt,
                sum(low_carbon)                                              sum_low_carbon
         from user_low_carbon
         where substring(data_dt, 1, 4) = '2017'
         group by user_id, data_dt
         having sum_low_carbon >= 100
     ) t1;
```

```
user_id data_dt rk
u_001   2017-01-02      1
u_001   2017-01-06      2
u_002   2017-01-02      1
u_002   2017-01-03      2
u_002   2017-01-04      3
u_002   2017-01-05      4
u_003   2017-01-02      1
u_003   2017-01-03      2
u_003   2017-01-05      3
u_003   2017-01-07      4
u_004   2017-01-01      1
u_004   2017-01-03      2
......
```

步骤三：将日期减去当前的rank值

```sql
hive (default)> select user_id,
       data_dt,
       date_sub(data_dt, rk) sub_rk
from (
         select user_id,
                data_dt,
                rank() over (partition by user_id order by data_dt) rk
         from (
                  select user_id,
                         date_format(regexp_replace(data_dt, '/', '-'), 'yyyy-MM-dd') data_dt,
                         sum(low_carbon)                                              sum_low_carbon
                  from user_low_carbon
                  where substring(data_dt, 1, 4) = '2017'
                  group by user_id, data_dt
                  having sum_low_carbon >= 100
              ) t1
     ) t2;
```

```
user_id data_dt sub_rk
u_001   2017-01-02      2017-01-01
u_001   2017-01-06      2017-01-04
u_002   2017-01-02      2017-01-01
u_002   2017-01-03      2017-01-01
u_002   2017-01-04      2017-01-01
u_002   2017-01-05      2017-01-01
u_003   2017-01-02      2017-01-01
u_003   2017-01-03      2017-01-01
u_003   2017-01-05      2017-01-02
u_003   2017-01-07      2017-01-03
......
```

步骤四：过滤出连续3天超过100g的用户

```sql
hive (default)> select user_id
from (
         select user_id,
                data_dt,
                date_sub(data_dt, rk) sub_rk
         from (
                  select user_id,
                         data_dt,
                         rank() over (partition by user_id order by data_dt) rk
                  from (
                           select user_id,
                                  date_format(regexp_replace(data_dt, '/', '-'), 'yyyy-MM-dd') data_dt,
                                  sum(low_carbon)                                              sum_low_carbon
                           from user_low_carbon
                           where substring(data_dt, 1, 4) = '2017'
                           group by user_id, data_dt
                           having sum_low_carbon >= 100
                       ) t1
              ) t2
     ) t3
group by user_id, sub_rk
having count(*) >= 3;
```

```
user_id
u_002
u_005
u_008
u_009
u_010
u_011
u_013
u_014
```

## 函数

### 系统内置函数

查看系统自带的函数

```sql
hive (default)> show functions;
```

显示自带的函数的用法

```sql
hive (default)> desc function upper;
```

详细显示自带的函数的用法

```sql
hive (default)> desc function extended upper;
```

### 自定义函数

Hive 自带了一些函数，比如：max/min 等，但是数量有限，自己可以通过自定义 UDF 来方便的扩展

当 Hive 提供的内置函数无法满足业务处理需要时，此时就可以考虑使用用户自定义函数（UDF：user-defined function）

根据用户自定义函数类别分为以下三种

1. UDF（User-Defined-Function） 一进一出
2. UDAF（User-Defined Aggregation Function） 聚集函数，多进一出，如 count/max/min
3. UDTF（User-Defined Table-Generating Functions） 一进多出，如 lateral view explore()

#### 自定义 UDF 函数

**导入依赖**

```xml
<dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>1.2.1</version>
    </dependency>
</dependencies>
```

**程序编写**

```java
// 1 继承 UDF
public class MyUDF extends UDF {
    // 2 实现 evaluate 函数，支持重载
    public String evaluate(String line) {
        return line.toLowerCase();
    }
}
```

**打成 jar 包上传至服务器**

```shell
scp my_lower.jar root@hadoop198:/opt/module/hive-1.2.1/lib
```

**将 jar 包添加到 hive 的 classpath**

```sql
hive (default)> add jar /opt/module/hive-1.2.1/lib/my_lower.jar;
```

**创建临时函数与 Java 类关联**

```sql
hive (default)> create temporary function mylower as "com.roger.MyUDF";
```

**在 hive 中使用自定义的函数**

```sql
hive (default)> select ename, mylower(ename) lowername from emp;
```

```
ename   lowername
SMITH   smith
ALLEN   allen
WARD    ward
JONES   jones
MARTIN  martin
BLAKE   blake
CLARK   clark
SCOTT   scott
KING    king
TURNER  turner
ADAMS   adams
JAMES   james
FORD    ford
MILLER  miller
```

**注意**：UDF 必须要有**返回类型**，可以返回 `null`，但是返回类型不能为 `void`

#### 自定义 UDTF 函数

自定义一个 UDTF 实现将一个任意分割符的字符串切割成独立的单词

**程序编写**

```java
public class MyUDTF extends GenericUDTF {
    List<String> dataList = new ArrayList<String>();
    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
        // 定义数据输出的列名
        List<String> sfn = new ArrayList<String>();
        sfn.add("word");
        // 定义数据输出的类型
        List<ObjectInspector> sfoi = new ArrayList<ObjectInspector>();
        sfoi.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        // List<String> structFieldNames, List<ObjectInspector> structFieldObjectInspectors
        return ObjectInspectorFactory.getStandardStructObjectInspector(sfn, sfoi);
    }
    public void process(Object[] args) throws HiveException {
        // 1 获取数据
        String line = args[0].toString();
        // 2 获取分隔符
        String splitKey = args[1].toString();
        // 3 切分数据
        String[] words = line.split(splitKey);
        // 4 遍历
        for (String word : words) {
            // 5 将数据放入集合中，匹配写出类型
            dataList.clear();
            dataList.add(word);
            // 6 循环写出
            forward(dataList);
        }
    }
    public void close() throws HiveException {
    }
}
```

**打成 jar 包上传到服务器**

```shell
scp my_udtf.jar root@hadoop198:/opt/module/hive-1.2.1/lib
```

**将 jar 包添加到 hive 的 classpath 下**

```sql
hive (default)> add jar /opt/module/hive-1.2.1/lib/my_udtf.jar;
```

**创建临时函数与 Java 类关联**

```sql
hive (default)> create temporary function myudtf as "com.roger.MyUDTF";
```

**在 hive 中使用自定义的函数**

```sql
hive (default)> select myudtf('hello,wrold,roger', ',');
```

```
word
hello
wrold
roger
```

## 压缩和存储

### 压缩

#### Map 输出阶段压缩

开启 map 输出阶段压缩可以减少 job 中 MapTask 和 ReduceTask 间数据传输量

1. 开启 hive 中间传输数据压缩功能

   ```sql
   hive (default)> set hive.exec.compress.intermediate=true;
   ```

2. 开启 mapreduce 中 map 输出压缩功能

   ```sql
   hive (default)> set mapreduce.map.output.compress=true;
   ```

3. 设置 mapreduce 中 map 输出数据的压缩方式

   ```sql
   hive (default)>set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
   ```

4. 执行查询语句

   ```sql
   hive (default)> select count(ename) name from emp;
   ```

#### Reduce 输出阶段压缩

当 Hive 将输出写入到表中时，输出内容同样可以进行压缩

属性 `hive.exec.compress.output` 控制着这个功能，默认值 `false`，这样默认的输出就是非压缩的纯文本文件了，可以通过在查询语句或执行脚本中设置这个值为 `true`，来开启输出结果压缩功能

1. 开启 hive 最终输出数据压缩功能

   ```sql
   hive (default)> set hive.exec.compress.output=true;
   ```

2. 开启 mapreduce 最终输出数据压缩

   ```sql
   hive (default)> set mapreduce.output.fileoutputformat.compress=true;
   ```

3. 设置 mapreduce 最终数据输出压缩方式

   ```sql
   hive (default)> set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
   ```

4. 设置 mapreduce 最终数据输出压缩为块压缩

   ```sql
   hive (default)> set mapreduce.output.fileoutputformat.compress.type=BLOCK;
   ```

5. 测试一下输出结果是否是压缩文件

   ```sql
   hive (default)> insert overwrite local directory '/opt/module/hive-1.2.1/output/distribute-result' select * from emp distribute by deptno sort by empno desc;
   ```

### 存储

#### 文件存储格式

Hive 支持的存储数的格式主要有：TEXTFILE、SEQUENCEFILE、ORC、PARQUET

##### 列式存储和行式存储

![](imgs/行式和列式存储.png)

**行存储的特点**

查询满足条件的一整行数据的时候，列存储则需要去每个聚集的字段找到对应的每个列的值，行存储只需要找到其中一个值，其余的值都在相邻地方，所以此时行存储查询的速度更快

**列存储的特点**

因为每个字段的数据聚集存储，在查询只需要少数几个字段的时候，能大大减少读取的数据量；每个字段的数据类型一定是相同的，列式存储可以针对性的设计更好的压缩算法

`TEXTFILE` 和 `SEQUENCEFILE` 的存储格式都是基于「行式存储」，`ORC` 和 `PARQUET` 是基于「列式存储」

##### TextFile 格式

默认格式，数据不做压缩，磁盘开销大，数据解析开销大。可结合 Gzip、Bzip2 使用，但使用 Gzip 这种方式，hive 不会对数据进行切分，从而无法对数据进行并行操作

##### Orc 格式

Orc (Optimized Row Columnar)是 Hive 0.11 版里引入的新的存储格式

每个 Orc 文件由 1 个或多个 `Stripe` 组成，每个 Stripe `250MB` 大小，这个 Stripe 实际相当于 RowGroup 概念，不过大小由 `4MB -> 250MB`，这样能提升顺序读的吞吐率

每个 Stripe 里有三部分组成，分别是 `Index Data`，`Row Data`，`Stripe Footer`

![](imgs/orc格式.png)

1. Index Data：一个轻量级的索引，默认是每隔 1W 行做一个索引。这里做的索引只是记录某行的各字段在 Row Data 中的偏移
2. Row Data：存的是具体的数据，先取部分行，然后对这些行按列进行存储。对每个列进行了编码，分成多个 Stream 来存储
3. Stripe Footer：存的是各个 Stream 的类型，长度等信息

每个文件有一个 `File Footer`，这里面存的是每个 Stripe 的行数，每个 Column 的数据类型信息等

每个文件的尾部是一个`PostScript`，这里面记录了整个文件的压缩类型以及FileFooter的长度信息等

在读取文件时，会 seek 到文件尾部读 PostScript，从里面解析到 File Footer长度，再读 FileFooter，从里面解析到各个 Stripe 信息，再读各个 Stripe，即「从后往前」读

##### Parquet 格式

Parquet 文件是以「二进制」方式存储的，所以是不可以直接读取的，文件中包括该文件的数据和元数据，因此 Parquet 格式文件是「自解析」的

通常情况下，在存储 Parquet 数据的时候会按照 Block 大小设置Row group的大小，由于一般情况下每一个 Mapper 任务处理数据的最小单位是一个 Block，这样可以把每一个Row group由一个 Mapper 任务处理，增大任务执行并行度

![](imgs/parquet格式.png)

一个File中可以存储多个Row group，文件的首位都是该文件的 Magic Code，用于校验它是否是一个 Parquet 文件

Footer length 记录了文件元数据的大小，通过该值和文件长度可以计算出元数据的偏移量，文件的元数据中包括每一个行组的元数据信息和该文件存储数据的 Schema 信息。除了文件中每一个行组的元数据，每一页的开始都会存储该页的元数据，

Parquet 中有三种类型的页：数据页、字典页和索引页

1. 数据页用于存储当前行组中该列的值
2. 字典页存储该列值的编码字典，每一个列块中最多包含一个字典页
3. 索引页用来存储当前行组下该列的索引

#### 文件存储格式对比

**TextFile 格式**

创建表，存储数据格式为 TEXTFILE

```sql
hive (default)> create table log_text (track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as textfile;
```

向表中加载数据

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/log.data' into table log_text;
```

查看表中数据大小

```shell
hdfs dfs -du -h /user/hive/warehouse/log_text;
```

```
18.1 M  /user/hive/warehouse/log_text/log.data
```

**Orc 格式**

创建表，存储数据格式为 ORC

```sql
hive (default)> create table log_orc(track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc;
```

向表中加载数据

```sql
hive (default)> insert into table log_orc select * from log_text;
```

查看表中数据大小

```shell
hdfs dfs -du -h /user/hive/warehouse/log_orc/;
```

```
2.8 M  /user/hive/warehouse/log_orc/000000_0
```

**Parquet 格式**

创建表，存储数据格式为 parquet

```sql
hive (default)> create table log_parquet(track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as parquet;
```

向表中加载数据

```sql
hive (default)> insert into table log_parquet select * from log_text;
```

查看表中数据大小

```shell
hdfs dfs -du -h /user/hive/warehouse/log_parquet/;
```

```
13.1 M  /user/hive/warehouse/log_parquet/000000_0
```

#### 文件查询速度对比

**TextFile 格式**

```sql
hive (default)> select count(*) from log_text;
```

```
_c0
100000
Time taken: 35.963 seconds, Fetched: 1 row(s)
```

**Orc 格式**

```sql
hive (default)> select count(*) from log_orc;
```

```
_c0
100000
Time taken: 35.161 seconds, Fetched: 1 row(s)
```

**Parquet 格式**

```sql
hive (default)> select count(*) from log_parquet;
```

```
_c0
100000
Time taken: 45.958 seconds, Fetched: 1 row(s)
```

### 存储和压缩结合

**修改 Hadoop 集群具有 Snappy 压缩方式**

1. 查看 hadoop checknative 命令使用

   ```
   hadoop checknative [-a|-h] check native hadoop and compression libraries availability
   ```

2. 查看 hadoop 支持的压缩方式

   ```shell
   hadoop checknative
   ```

3. 将编译好的支持 Snappy 压缩的 hadoop-2.7.2.tar.gz 包导入到 hadoop198 的 /opt/software 中

4. 解压 hadoop-2.7.2.tar.gz 到当前路径

   ```shell
   tar -zxvf hadoop-2.7.2.tar.gz
   ```

5. 进入到 /opt/software/hadoop-2.7.2/lib/native 路径可以看到支持 Snappy 压缩的动态链接库

6. 拷贝 /opt/software/hadoop-2.7.2/lib/native 里面的所有内容到开发集群的 /opt/module/hadoop-2.7.2/lib/native 路径上

   ```shell
   cp native/* /opt/module/hadoop-2.7.2/lib/native/
   ```

7. 分发集群

   ```shell
   xsync native/
   ```

8. 再次查看 hadoop 支持的压缩类型

   ```shell
   hadoop checknative
   ```

9. 重新启动 hadoop 集群和 hive

**测试存储和压缩**

**创建一个非压缩的的 ORC 存储方式**

1. 建表

   ```sql
   hive (default)> create table log_orc_none(track_time string,
   url string,
   session_id string,
   referer string,
   ip string,
   end_user_id string,
   city_id string
   )
   row format delimited fields terminated by '\t'
   stored as orc tblproperties ("orc.compress"="NONE");
   ```

2. 插入数据

   ```sql
   hive (default)> insert into table log_orc_none select * from log_text;
   ```

3. 查看插入后数据

   ```shell
   hdfs dfs -du -h /user/hive/warehouse/log_orc_none/;
   ```

   ```
   7.7 M /user/hive/warehouse/log_orc_none/000000_0
   ```

**创建一个 SNAPPY 压缩的 ORC 存储方式**

1. 建表

   ```sql
   hive (default)> create table log_orc_snappy(track_time string,
   url string,
   session_id string,
   referer string,
   ip string,
   end_user_id string,
   city_id string )
   row format delimited fields terminated by '\t'
   stored as orc tblproperties ("orc.compress"="SNAPPY");
   ```

2. 插入数据

   ```sql
   hive (default)> insert into table log_orc_snappy select * from log_text;
   ```

3. 查看插入后数据

   ```shell
   hdfs dfs -du -h /user/hive/warehouse/log_orc_snappy/;
   ```

   ```
   3.8 M /user/hive/warehouse/log_orc_snappy/000000_0
   ```

ORC 存储方式导入数据后的大小为2.8 M比 Snappy 压缩的还小，是因为 ORC 存储文件默认采用 ZLIB 压缩，比 Snappy 压缩的小

## 优化

###  Fetch 抓取

Fetch 抓取是指 Hive 中对某些情况的查询可以**不必使用 MapReduce 计算**

如`SELECT * FROM employees;` 这种情况下，Hive 可以简单地读取 employee 对应的存储目录下的文件，然后输出查询结果到控制台

在 `hive-default.xml.template` 文件中 `hive.fetch.task.conversion` 默认是 `more`，在全局查找、字段查找、limit 查找等都不走 MapReduce

```xml
<property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
        Expects one of [none, minimal, more].
        Some select queries can be converted to single FETCH task minimizing latency.
        Currently the query should be single sourced not having any subquery and should not haveany aggregations or distincts (which incurs RS), lateral views and joins.
        0. none : disable hive.fetch.task.conversion
        1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
        2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
    </description>
</property>
```

把 `hive.fetch.task.conversion` 设置成 `none`，然后执行查询语句都会执行 MapReduce 程序

### 本地模式

大多数的 Hadoop Job 是需要 Hadoop 提供的完整的可扩展性来处理大数据集的，不过有时 Hive 的输入数据量是非常小的，在这种情况下，为查询触发执行任务消耗的时间可能会比实际 job 的执行时间要多的多

对于多数这种情况，Hive 可以通过「本地模式」在单台机器上处理所有的任务，对于小数据集执行时间可以明显被缩短

可以通过设置 `hive.exec.mode.local.auto` 的值为`true`，来让 Hive 在适当的时候自动启动这个优化

开启本地 mr

```sql
hive (default)> set hive.exec.mode.local.auto=true;
```

设置 local mr 的最大输入数据量，当输入数据量小于这个值时采用 local mr 的方式，默认为 `134217728`，即 128M

```sql
hive (default)> set hive.exec.mode.local.auto.inputbytes.max=50000000;
```

设置 local mr 的最大输入文件个数，当输入文件个数小于这个值时采用 local mr 的方式，默认为 `4`

```sql
hive (default)> set hive.exec.mode.local.auto.input.files.max=10;
```

### 表的优化

#### 小表大表 Join

将 key 相对分散，并且数据量小的表放在 join 的左边，这样可以有效减少内存溢出错误发生的几率；再进一步，可以使用 map join 让小的维度表（1000 条以下的记录条数）先进内存，在 map 端完成 reduce

**实例测试**

测试大表 JOIN 小表和小表 JOIN 大表的效率

建大表和小表

```sql
hive (default)> create table bigtable(id bigint, 
time bigint, 
uid string, 
keyword string, 
url_rank int, 
click_num int, 
click_url string
)
row format delimited fields terminated by '\t';
```

```sql
hive (default)> create table smalltable(id bigint, 
time bigint, 
uid string, 
keyword string, 
url_rank int, 
click_num int, 
click_url string
) 
row format delimited fields terminated by '\t';
```

分别向大表和小表中导入数据

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/bigtable' into table bigtable;
```

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/smalltable' into table smalltable;
```

关闭 map join 功能，其默认是打开的

```sql
hive (default)> set hive.auto.convert.join = false;
```

创建 join 后表

```sql
hive (default)> create table jointable(id bigint, 
time bigint, 
uid string, 
keyword string, 
url_rank int, 
click_num int, 
click_url string
) 
row format delimited fields terminated by '\t';
```

执行小表 JOIN 大表语句

```sql
hive (default)> insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from smalltable s
left join bigtable b on b.id = s.id;
```

执行大表 JOIN 小表语句

```sql
hive (default)> insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable b
left join smalltable s on s.id = b.id;
```

#### 大表 Join 大表

**空 KEY 过滤**

有时 join 超时是因为某些 key 对应的数据太多，而相同 key 对应的数据都会发送到相同的 reducer 上，从而导致内存不够。此时应该仔细分析这些异常的 key，很多情况下，这些 key 对应的数据是异常数据，需要在 SQL 语句中进行过滤，例如 key 对应的字段为空

配置历史服务器 `mapred-site.xml`

```xml
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop198:10020</value>
</property>
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop198:19888</value>
</property>
```

启动历史服务器

```shell
mr-jobhistory-daemon.sh start historyserver
```

查看 jobhistory

http://hadoop198:19888/jobhistory

创建原始数据表、空 id 表

```sql
hive (default)> create table ori(id bigint, 
time bigint, 
uid string, 
keyword string, 
url_rank int, 
click_num int, 
click_url string
) 
row format delimited fields terminated by '\t';
```

```sql
hive (default)> create table nullidtable(id bigint, 
time bigint, 
uid string, 
keyword string, 
url_rank int, 
click_num int, 
click_url string
) 
row format delimited fields terminated by '\t';
```

分别加载原始数据和空 id 数据到对应表中

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/SogouQ1.txt' into table ori;
```

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/nullid' into table nullidtable;
```

创建合并后数据表

```sql
hive (default)> create table jointable(id bigint, 
time bigint, 
uid string, 
keyword string, 
url_rank int, 
click_num int, 
click_url string
) 
row format delimited fields terminated by '\t';
```

测试不过滤空 id

```sql
hive (default)> insert overwrite table jointable 
select n.* from nullidtable n 
left join ori o on n.id = o.id;
```

测试过滤空 id

```sql
hive (default)> insert overwrite table jointable 
select n.* from (select * from nullidtable where id is not null) n 
left join ori o on n.id = o.id;
```

**空 KEY 转换**

有时虽然某个 key 为空对应的数据很多，但是相应的数据不是异常数据，必须要包含在join 的结果中，此时可以表 a 中 key 为空的字段赋一个随机的值，使得数据随机均匀地分不到不同的 reducer 上

**不随机分布空 null 值**

设置 5 个 reduce 个数

```sql
hive (default)> set mapreduce.job.reduces=5;
```

JOIN 两张表

```sql
hive (default)> insert overwrite table jointable 
select n.* from nullidtable n 
left join ori b on n.id = b.id;
```

此时会出现数据倾斜，某些 reducer 的资源消耗远大于其他 reducer

**随机分布空 null 值**

设置 5 个 reduce 个数

```sql
hive (default)> set mapreduce.job.reduces=5;
```

JOIN 两张表

```sql
hive (default)> insert overwrite table jointable
select n.* from nullidtable n 
full join ori o on case when n.id is null then concat('hive', rand()) else n.id end = o.id;
```

此时消除了数据倾斜，负载均衡 reducer 的资源消耗

#### Map Join

如果不指定 MapJoin 或者不符合 MapJoin 的条件，那么 Hive 解析器会将 Join 操作转换成 Common Join，在 Reduce 阶段完成 join，容易发生数据倾斜。可以用 Map Join 把小表全部加载到内存在 map 端进行 join，避免 reducer 处理

**开启 MapJoin 参数设置**

设置自动选择 MapJoin

```sql
hive (default)> set hive.auto.convert.join=true;
```

大表小表的阈值设置，默认 25000000 字节以下认为是小表

```sql
hive (default)> set hive.mapjoin.smalltable.filesize=25000000;
```

**MapJoin 工作机制**

![](imgs/mapjoin工作机制.png)

1. Task A，它是一个Local Task（在客户端本地执行的Task），负责扫描小表b的数据，将其转换成一个HashTable的数据结构，并写入本地的文件中 ，之后将该文件加载到DistributeCache中
2. Task B，该任务是一个没有Reduce的MR，启动MapTasks扫描大表a，在Map阶段，根据a的每一条记录去和DistributeCache中b表对应的HashTable关联，并直接输出结果
3. 由于MapJoin没有Reduce，所以由Map直接输出结果文件，有多少个Map Task，就有多少个结果文件

开启 Map Join 功能，默认为 true

```sql
hive (default)> set hive.auto.convert.join = true;
```

执行小表 JOIN 大表语句

```sql
hive (default)> insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url 
from smalltable s
join bigtable b on s.id = b.id;
```

执行大表 JOIN 小表语句

```sql
hive (default)> insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url 
from bigtable b
join smalltable s on s.id = b.id;
```

#### Group By

默认情况下，Map 阶段同一 Key 数据分发给一个 reduce，当一个 key 数据过大时就倾斜了

并不是所有的聚合操作都需要在 Reduce 端完成，很多聚合操作都可以先在 Map 端进行部分聚合，最后在 Reduce 端得出最终结果

**开启 Map 端聚合参数设置**

是否在 Map 端进行聚合，默认为 True

```sql
hive (default)> set hive.map.aggr=true;
```

在 Map 端进行聚合操作的条目数目

```sql
hive (default)> set hive.groupby.mapaggr.checkinterval=100000;
```

有数据倾斜的时候进行负载均衡，默认是 false

```sql
hive (default)> set hive.groupby.skewindata=true;
```

当选项设定为 true，生成的查询计划会有两个 MR Job

第一个 MR Job 中，Map 的输出结果会「随机」分布到 Reduce 中，每个 Reduce 做部分聚合操作，并输出结果，这样处理的结果是相同的 Group By Key 有可能被分发到不同的 Reduce 中，从而达到负载均衡的目的

第二个 MR Job 再根据预处理的数据结果按照 Group By Key 分布到 Reduce 中，这个过程可以保证相同的 Group By Key 被分布到同一个 Reduce 中，最后完成最终的聚合操作

#### Distinct 去重统计

数据量小的时候无所谓，数据量大的情况下，由于 COUNT DISTINCT 操作需要用一个Reduce Task 来完成，这一个 Reduce 需要处理的数据量太大，就会导致整个 Job 很难完成

一般 COUNT DISTINCT 使用先 GROUP BY 再 COUNT 的方式替换

创建一张大表

```sql
hive (default)> create table bigtable(id bigint, 
time bigint, 
uid string, 
keyword string, 
url_rank int, 
click_num int, 
click_url string
) 
row format delimited fields terminated by '\t';
```

加载数据

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/bigtable' into tablebigtable;
```

**采用 distinct 执行去重 id 查询**

```sql
hive (default)> select count(distinct id) from bigtable;
```

设置 5 个 reduce 个数

```sql
hive (default)> set mapreduce.job.reduces=5;
```

**采用 GROUP by 去重 id**

```sql
hive (default)> select count(id) from (
select id from bigtable group by id
) a;
```

#### 笛卡尔积

尽量避免笛卡尔积，join 的时候不加 on 条件，或者无效的 on 条件，Hive 只能使用 1 个 reducer 来完成笛卡尔积

#### 行列过滤

**列处理**：在 SELECT 中，只拿需要的列，如果有，尽量使用分区过滤，少用 SELECT *

**行处理**：在分区剪裁中，当使用外关联时，如果将副表的过滤条件写在 Where 后面，那么就会先全表关联，之后再过滤

先关联两张表，再用 where 条件过滤

```sql
hive (default)> select o.id from bigtable b
join ori o on o.id = b.id
where o.id <= 10;
```

通过子查询先过滤后，再关联表

```sql
hive (default)> select b.id from bigtable b
join (
select id from ori where id <= 10
) o on b.id = o.id;
```

#### 动态分区调整

关系型数据库中，对分区表 Insert 数据时候，数据库自动会根据分区字段的值，将数据插入到相应的分区中，Hive 中也提供了类似的机制，即动态分区(Dynamic Partition)，只不过使用 Hive 的动态分区，需要进行相应的配置

**开启动态分区参数设置**

开启动态分区功能，默认为 true 开启

```sql
hive (default)> set hive.exec.dynamic.partition=true;
```

设置为非严格模式，动态分区的模式，默认为 `strict`，表示必须指定至少一个分区为静态分区；`nonstrict` 模式表示允许所有的分区字段都可以使用动态分区

```sql
hive (default)> set hive.exec.dynamic.partition.mode=nonstrict;
```

在所有执行 MR 的节点上，最大一共可以创建多少个动态分区

```sql
hive (default)> set hive.exec.max.dynamic.partitions=1000;
```

在每个执行 MR 的节点上，最大可以创建多少个动态分区

该参数需要根据实际的数据来设定，比如源数据中包含了一年的数据，即 day 字段有 365 个值，那么该参数就需要设置成大于 365，如果使用默认值 100，则会报错

```sql
hive (default)> set hive.exec.max.dynamic.partitions.pernode=100;
```

整个 MR Job 中，最大可以创建多少个 HDFS 文件

```sql
hive (default)> set hive.exec.max.created.files=100000;
```

**实例操作**

将 ori 中的数据按照时间(如 20111230000008)，插入到目标表 ori_partitioned_target 的相应分区中

创建分区表

```sql
hive (default)> create table ori_partitioned(id bigint, 
time bigint, 
uid string, 
keyword string,
url_rank int, 
click_num int, 
click_url string
) 
partitioned by (p_time bigint) 
row format delimited fields terminated by '\t';
```

加载数据到分区表中

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/ds1' into table ori_partitioned partition(p_time='20111230000010');
```

```sql
hive (default)> load data local inpath '/opt/module/hive-1.2.1/input/ds2' into table ori_partitioned partition(p_time='20111230000011');
```

创建目标分区表

```sql
hive (default)> create table ori_partitioned_target(id bigint, 
time bigint, 
uid string,
keyword string, 
url_rank int, 
click_num int, 
click_url string
) 
partitioned by (p_time STRING) 
row format delimited fields terminated by '\t';
```

设置动态分区

```
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.exec.max.dynamic.partitions = 1000;
set hive.exec.max.dynamic.partitions.pernode = 100;
set hive.exec.max.created.files = 100000;
set hive.error.on.empty.partition = false;
```

插入数据实现动态分区

```sql
hive (default)> insert overwrite table ori_partitioned_target 
partition (p_time) 
select id, time, uid, keyword, url_rank, click_num, click_url, p_time 
from ori_partitioned;
```

**注意**：分区字段必须写在「最后面」

查看目标分区表的分区情况

```sql
hive (default)> show partitions ori_partitioned_target;
```

### MR 优化

#### 合理设置 Map 数

通常情况下，作业会通过 input 的目录产生一个或者多个 map 任务

主要的决定因素有：input 的文件总个数、input 的文件大小和集群设置的文件块大小

如果一个任务有很多小文件（远远小于块大小 128m），则每个小文件也会被当做一个块，用一个 map 任务来完成，而一个 map 任务启动和初始化的时间远远大于逻辑处理的时间，就会造成很大的资源浪费。而且同时可执行的 map 数是受限的

若有一个 127m 的文件，正常会用一个 map 去完成，但这个文件只有一个或者两个小字段，却有几千万的记录，如果 map 处理的逻辑比较复杂，用一个 map任务去做会比较耗时

#### 小文件进行合并

在 map 执行前合并小文件，减少 map 数

CombineHiveInputFormat 具有对小文件进行合并的功能，为系统默认的格式

HiveInputFormat 没有对小文件合并功能

```sql
hive (default)> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
```

#### 复杂文件增加 Map

当 input 的文件都很大，任务逻辑复杂，map 执行非常慢的时候，可以考虑增加 Map数，来使得每个 map 处理的数据量减少，从而提高任务的执行效率

```java
computeSliteSize(Math.max(minSize, Math.min(maxSize,blocksize)));
```

调整 maxSize 最大值，让 maxSize 最大值低于 blocksize 就可以增加 map 的个数

执行查询

```sql
hive (default)> select count(*) from emp;
```

设置最大切片值为 100 个字节

```sql
hive (default)> set mapreduce.input.fileinputformat.split.maxsize=100;
```

执行查询

```sql
hive (default)> select count(*) from emp;
```

#### 合理设置 Reduce 数

**方法一**

设置每个 job 的 Reduce 个数

```sql
hive (default)> set mapreduce.job.reduces=15;
```

**方法二**

每个 Reduce 处理的数据量默认是 256000000字节

```sql
hive (default)> set hive.exec.reducers.bytes.per.reducer=256000000;
```

每个任务最大的 reduce 数，默认为 1009

```sql
hive (default)> set hive.exec.reducers.max=1009;
```

计算 reducer 数的公式

```
N = min(hive.exec.reducers.max, 总输入数据量 / hive.exec.reducers.bytes.per.reducer)
```

过多的启动和初始化 reduce 也会消耗时间和资源

另外，有多少个 reduce，就会有多少个输出文件，如果生成了很多个小文件，那么如果这些小文件作为下一个任务的输入，则也会出现小文件过多的问题

在设置 reduce 个数的时候也需要考虑这两个原则

1. 处理大数据量利用合适的 reduce 数
2. 使单个 reduce 任务处理数据量大小要合适

### 并行执行

Hive 会将一个查询转化成一个或者多个阶段，这样的阶段可以是 MapReduce 阶段、抽样阶段、合并阶段、limit 阶段，或者 Hive 执行过程中可能需要的其他阶段

默认情况下，Hive 一次只会执行一个阶段。不过，某个特定的 job 可能包含众多的阶段，而这些阶段可能并非完全互相依赖的，也就是说有些阶段是可以并行执行的，这样可能使得整个 job 的执行时间缩短。如果有更多的阶段可以并行执行，那么 job 可能就越快完成

通过设置参数 `hive.exec.parallel` 值为 `true`，就可以开启并发执行。不过，在共享集群中，需要注意如果 job 中并行阶段增多，那么集群利用率就会增加

打开任务并行执行

```sql
hive (default)> set hive.exec.parallel=true;
```

同一个 sql 允许最大并行度，默认为 8

```sql
hive (default)> set hive.exec.parallel.thread.number=16;
```

### 严格模式

Hive 提供了一个严格模式，可以防止用户执行那些可能意想不到的不好的影响的查询

```xml
<property>
    <name>hive.mapred.mode</name>
    <value>strict</value>
    <description>
        The mode in which the Hive operations are being performed. 
        In strict mode, some risky queries are not allowed to run. 
        They include:
        Cartesian Product.
        No partition being picked up for a query.
        Comparing bigints and strings.
        Comparing bigints and doubles.
        Orderby without limit.
    </description>
</property>
```

通过设置属性 `hive.mapred.mode` 值为默认是非严格模式 `nonstrict `。开启严格模式需要修改 `hive.mapred.mode` 值为 `strict`，开启严格模式可以禁止 3 种类型的查询

1. 对于分区表，除非 where 语句中含有分区字段过滤条件来限制范围，否则不允许执行，就是用户不允许扫描所有分区

   进行这个限制的原因是，通常分区表都拥有非常大的数据集，而且数据增加迅速。没有进行分区限制的查询可能会消耗令人不可接受的巨大资源来处理这个表

2. 对于使用了 order by 语句的查询，要求必须使用 limit 语句

   因为 order by 为了执行排序过程会将所有的结果数据分发到同一个 Reducer 中进行处理，强制要求用户增加这个LIMIT 语句可以防止 Reducer 额外执行很长一段时间

3. 限制笛卡尔积的查询

### JVM 重用

JVM 重用是 Hadoop 调优参数的内容，其对 Hive 的性能具有非常大的影响，特别是对于很难避免小文件的场景或 task 特别多的场景，这类场景大多数执行时间都很短

Hadoop 的默认配置通常是使用派生 JVM 来执行 map 和 Reduce 任务的。这时 JVM 的启动过程可能会造成相当大的开销，尤其是执行的 job 包含有成百上千 task任务的情况。JVM重用可以使得 JVM 实例在同一个 job 中重新使用 N 次。N 的值可以在 Hadoop 的 `mapred-site.xml` 文件中进行配置

```xml
<property>
    <name>mapreduce.job.jvm.numtasks</name>
    <value>10</value>
    <description>
        How many tasks to run per jvm. If set to -1, there is no limit. 
    </description>
</property>
```

这个功能的缺点是，开启 JVM 重用将一直占用使用到的 task 插槽，以便进行重用，直到任务完成后才能释放。如果某个“不平衡的”job 中有某几个 reduce task 执行的时间要比其他 Reduce task 消耗的时间多的多的话，那么保留的插槽就会一直空闲着却无法被其他的 job使用，直到所有的 task 都结束了才会释放

### 推测执行

在分布式集群环境下，因为程序 Bug（包括 Hadoop 本身的 bug），负载不均衡或者资源分布不均等原因，会造成同一个作业的多个任务之间运行速度不一致，有些任务的运行速度可能明显慢于其他任务（比如一个作业的某个任务进度只有 50%，而其他所有任务已经运行完毕），则这些任务会拖慢作业的整体执行进度

为了避免这种情况发生，Hadoop 采用了推测执行（Speculative Execution）机制，它根据一定的法则推测出“拖后腿”的任务，并为这样的任务启动一个备份任务，让该任务与原始任务同时处理同一份数据，并最终选用最先成功运行完成任务的计算结果作为最终结果

设置开启推测执行参数，Hadoop 的 `mapred-site.xml` 文件中进行配置

```xml
<property>
    <name>mapreduce.map.speculative</name>
    <value>true</value>
    <description>
        If true, then multiple instances of some map tasks may be executed in parallel.
    </description>
</property>
<property>
    <name>mapreduce.reduce.speculative</name>
    <value>true</value>
    <description>
        If true, then multiple instances of some reduce tasks may be executed in parallel.
    </description>
</property>
```

不过 hive 本身也提供了配置项来控制 reduce-side 的推测执行

```xml
<property>
    <name>hive.mapred.reduce.tasks.speculative.execution</name>
    <value>true</value>
    <description>
        Whether speculative execution for reducers should be turned on. 
    </description>
</property>
```

如果用户对于运行时的偏差非常敏感的话，那么可以将这些功能关闭掉。如果用户因为输入数据量很大而需要执行长时间的 map 或者 Reduce task 的话，那么启动推测执行造成的浪费是非常巨大大

### 执行计划

**基本语法**

```
EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query
```

查看执行计划

```sql
hive (default)> explain select * from emp;
```

查看详细执行计划

```sql
hive (default)> explain extended select * from emp;
```

## 谷粒影音

### 需求描述

统计硅谷影音视频网站的常规指标以及各种 TopN 指标

1. 统计视频观看数 Top10
2. 统计视频类别热度 Top10
3. 统计视频观看数 Top20 所属类别以及类别包含的 Top20 的视频个数
4. 统计视频观看数 Top50 所关联视频的所属类别排名
5. 统计每个类别中的视频观看数 Top10
7. 统计上传视频最多的用户 Top10 以及他们上传的观看次数在前 20 视频

### 项目

#### 数据结构

**视频表**

| 字段      | 备注        | 详细描述                  |
| --------- | ----------- | ------------------------- |
| videoId   | 视频唯一 id | 11 位字符串               |
| uploader  | 视频上传者  | 上传视频的用户名 String   |
| age       | 视频年龄    | 视频在平台上的整数天      |
| category  | 视频类别    | 上传视频指定的视频分类    |
| length    | 视频长度    | 整形数字标识的视频长度    |
| views     | 观看次数    | 视频被浏览的次数          |
| rate      | 视频评分    | 满分 5 分                 |
| ratings   | 流量        | 视频的流量，整型数字      |
| comments  | 评论数      | 一个视频的整数评论数      |
| relatedId | 相关视频 id | 相关视频的 id，最多 20 个 |

**用户表**

| 字段     | 备注         | 字段类型 |
| -------- | ------------ | -------- |
| uploader | 上传者用户名 | string   |
| videos   | 上传视频数   | int      |
| friends  | 朋友数量     | int      |

#### ETL 原始数据

通过观察原始数据形式，可以发现视频可以有多个所属分类，每个所属分类用`&`符号分割，且分割的两边有空格字符，同时相关视频也是可以有多个元素，多个相关视频又用`\t` 进行分割

为了分析数据时方便对存在多个子素的数据进行操作，首先进行数据重组清洗操作，将所有的类别用`&`分割，同时去掉两边空格，多个相关视频 id 也使用`&`进行分割

**ETLUtil.etlString**

```java
/**
 * 1 过滤长度不够的，小于9个字段的
 * 2 去掉类别字段中的空格
 * 3 修改相关视频字段的分隔符，由"\t" -> "&"
 * @param line 原始数据
 * @return 过滤后的数据
 */
public static String etlString(String line) {
    StringBuilder sb = new StringBuilder();
    // 1 切割
    String[] fields = line.split("\t");
    // 2 对字段长度进行过滤
    if (fields.length < 9) {
        return null;
    }
    // 3 去掉类别字段中的空格
    fields[3] = fields[3].replaceAll(" ", "");
    // 4 修改相关视频字段的分隔符，由"\t" -> "&"
    for (int i = 0; i < fields.length; i++) {
        // 将之前的字段数据加入到sb中
        if (i < 9) {
            sb.append(fields[i]);
            // \t 分割
            if (i != fields.length - 1)
                sb.append("\t");
        } else {
            sb.append(fields[i]);
            if (i != fields.length - 1)
                sb.append("&");
        }
    }
    // 5 返回结果
    return sb.toString();
}
```

**ETLMapper**

```java
public class ETLMapper extends Mapper<LongWritable, Text, NullWritable, Text> {
    Text v = new Text();
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 1 获取数据
        String line = value.toString();
        // 2 过滤数据
        String etlStr = ETLUtil.etlString(line);
        // 判断是否为空
        if (etlStr == null) {
            return;
        }
        // 3 写出数据
        v.set(etlStr);
        context.write(NullWritable.get(), v);
    }
}
```

**ETLDriver**

```java
public class ETLDriver implements Tool {
    private Configuration configuration;
    public static void main(String[] args) {
        // 1 构建配置信息
        Configuration conf = new Configuration();
        // 2 运行run
        try {
            int run = ToolRunner.run(conf, new ETLDriver(), args);
            System.out.println(run);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    public int run(String[] args) throws Exception {
        // 1 获取job对象
        Job job = Job.getInstance(configuration);
        // 2 设置jar包路径
        job.setJarByClass(ETLDriver.class);
        // 3 关联mapper
        job.setMapperClass(ETLMapper.class);
        // 设置reduce个数为0
        job.setNumReduceTasks(0);
        // 4 设置mapper输出类型
        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);
        // 5 设置最终输出类型
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);
        // 6 设置输入输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        // 7 提交任务
        boolean result = job.waitForCompletion(true);
        return result ? 0 : 1;
    }
    public void setConf(Configuration configuration) {
        this.configuration = configuration;
    }
    public Configuration getConf() {
        return configuration;
    }
}
```

**执行 ETL**

```shell
yarn jar guli_video.jar com.roger.gulivideo.etl.ETLDriver gulivideo/input/video/2008/0222 gulivideo/output/video_output
```

### 准备工作

#### 创建表

**原始表**：`gulivideo_ori`，`gulivideo_user_ori`

**gulivideo_ori**

```sql
hive (default)> create table gulivideo_ori(videoId string, 
uploader string, 
age int, 
category array<string>, 
length int, 
views int, 
rate float, 
ratings int, 
comments int,
relatedId array<string>
)
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as textfile;
```

**gulivideo_user_ori**

```
hive (default)> create table gulivideo_user_ori(uploader string,
videos int,sql
friends int
)
row format delimited fields terminated by "\t" 
stored as textfile;
```

**Orc表**：`gulivideo_orc`，`gulivideo_user_orc`

把原始数据插入到 orc 表中

**gulivideo_orc**

```sql
hive (default)> create table gulivideo_orc(videoId string, 
uploader string, 
age int, 
category array<string>,
length int, 
views int, 
rate float, 
ratings int, 
comments int,
relatedId array<string>
)
row format delimited fields terminated by "\t" 
collection items terminated by "&" 
stored as orc;
```

**gulivideo_user_orc**

```sql
hive (default)> create table gulivideo_user_orc(uploader string,
videos int,
friends int
)
row format delimited fields terminated by "\t" 
stored as orc;
```

#### 导入 ETL 后的数据

**gulivideo_ori**

```sql
hive (default)> load data inpath "/user/root/gulivideo/output/video_output" into table gulivideo_ori;
```

**gulivideo_user_ori**

```sql
hive (default)> load data inpath "/user/root/gulivideo/input/user/2008/0903" into table gulivideo_user_ori;
```

#### 向 ORC 表插入数据

**gulivideo_orc**

```sql
hive (default)> insert into table gulivideo_orc select * from gulivideo_ori;
```

**gulivideo_user_orc**

```sql
hive (default)> insert into table gulivideo_user_orc select * from gulivideo_user_ori;
```

### 业务分析

#### 统计视频观看数 Top10

使用 order by 按照 views 字段做一个全局排序，同时设置只显示前 10 条

```sql
hive (default)> select videoId, 
views
from gulivideo_orc 
order by views desc 
limit 10;
```

```
videoid views
dMH0bHeiRNg     42513417
0XxI-hvPRRA     20282464
1dmVU08zVpA     16087899
RB-wUgnyGv0     15712924
QjA5faZF1A8     15256922
-_CSo1gOd48     13199833
49IDp76kjPw     11970018
tYnn51C3X_w     11823701
pv5zWaTEVkI     11672017
D2kJZOfq7zk     11184051
```

#### 统计视频类别热度 Top10

**思路**

统计每个类别有多少个视频，显示出包含视频最多的前 10 个类别

因为当前表结构为一个视频对应一个或多个类别，所以如果要 group by 类别，需要先将类别进行展开，然后再进行 count

按照类别 group by 聚合，然后 count 组内的 videoId 个数

最后按照热度排序，显示前 10 条

**实现**

将类别字段的数据进行展开

```sql
hive (default)> select videoId,
category_name 
from gulivideo_orc 
lateral view explode(category) t_category as category_name;
```

```
videoid category_name
1xbSFrHzFQ0     Film
1xbSFrHzFQ0     Animation
4VP4qSjDNQs     Film
4VP4qSjDNQs     Animation
RJgGeYiJrj0     Entertainment
......
```

统计各类别视频数量进行排序，取前10名

```sql
hive (default)> select category_name, count(*) category_count
from (
         select videoId, category_name
         from gulivideo_orc 
         lateral view explode(category) t_category as category_name
     ) t1
group by category_name
order by category_count desc
limit 10;
```

```
category_name   category_count
Music   179049
Entertainment   127674
Comedy  87818
Animation       73293
Film    73293
Sports  67329
Gadgets 59817
Games   59817
Blogs   48890
People  48890
```

#### 统计视频观看数最高的 20 个视频的所属类别以及类别包含Top20视频的个数

**思路**

先找到观看数最高的 20 个视频所属条目的信息，降序排列

把这 20 条信息中的 category 展开

最后查询视频分类名称和该分类下有多少个 Top20 的视频

**实现**

统计视频观看数最高的 20 个视频

```sql
hive (default)> select videoId, 
views,
category
from gulivideo_orc 
order by views desc 
limit 20;
```

```
videoid views   category
dMH0bHeiRNg     42513417        ["Comedy"]
0XxI-hvPRRA     20282464        ["Comedy"]
1dmVU08zVpA     16087899        ["Entertainment"]
RB-wUgnyGv0     15712924        ["Entertainment"]
QjA5faZF1A8     15256922        ["Music"]
-_CSo1gOd48     13199833        ["People","Blogs"]
49IDp76kjPw     11970018        ["Comedy"]
tYnn51C3X_w     11823701        ["Music"]
pv5zWaTEVkI     11672017        ["Music"]
D2kJZOfq7zk     11184051        ["People","Blogs"]
vr3x_RRJdd4     10786529        ["Entertainment"]
lsO6D1rwrKc     10334975        ["Entertainment"]
5P6UU6m3cqk     10107491        ["Comedy"]
8bbTtPL1jRs     9579911 ["Music"]
_BuRwH59oAo     9566609 ["Comedy"]
aRNzWyD7C9o     8825788 ["UNA"]
UMf40daefsI     7533070 ["Music"]
ixsZy2425eY     7456875 ["Entertainment"]
MNxwAU_xAMk     7066676 ["Comedy"]
RUCZJVJ_M8o     6952767 ["Entertainment"]
```

展开这些视频类别字段数据

```sql
hive (default)> select videoid, category_name
from (
         select videoId,
                views,
                category
         from gulivideo_orc
         order by views desc
         limit 20
     ) t1 
lateral view explode(category) tmp as category_name;
```

```
videoid category_name
dMH0bHeiRNg     Comedy
0XxI-hvPRRA     Comedy
1dmVU08zVpA     Entertainment
RB-wUgnyGv0     Entertainment
QjA5faZF1A8     Music
-_CSo1gOd48     People
-_CSo1gOd48     Blogs
49IDp76kjPw     Comedy
tYnn51C3X_w     Music
pv5zWaTEVkI     Music
D2kJZOfq7zk     People
D2kJZOfq7zk     Blogs
vr3x_RRJdd4     Entertainment
lsO6D1rwrKc     Entertainment
5P6UU6m3cqk     Comedy
8bbTtPL1jRs     Music
_BuRwH59oAo     Comedy
aRNzWyD7C9o     UNA
UMf40daefsI     Music
ixsZy2425eY     Entertainment
MNxwAU_xAMk     Comedy
RUCZJVJ_M8o     Entertainment
```

统计这些类别包含Top20视频的个数

```sql
hive (default)> select category_name, count(*) video_count
from (
         select videoid, category_name
         from (
                  select videoId,
                         views,
                         category
                  from gulivideo_orc
                  order by views desc
                  limit 20
              ) t1 
              lateral view explode(category) tmp as category_name
     ) t2
group by category_name
order by video_count desc;
```

```
category_name   video_count
Entertainment   6
Comedy  6
Music   5
People  2
Blogs   2
UNA     1
```

#### 统计视频观看数 Top50 所关联视频的所属类别排名

**补充**：若此时出现「内存溢出」错误

```
Error: Java heap space
```

可以在 `yarn-site.xml` 中添加配置

```xml
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>2048</value>
</property>
<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>2048</value>
</property>
<property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>2.1</value>
</property>
<property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx1024m</value>
</property>
```

查询出观看数最多的前 50 个视频

```sql
hive (default)> select videoId, 
views,
relatedId
from gulivideo_orc 
order by views desc 
limit 50;
```

```
videoid views   relatedid
dMH0bHeiRNg     42513417        ["OxBtqwlTMJQ","1hX1LxXwdl8","NvVbuVGtGSE","Ft6fC6RI4Ms","plv1e3MvxFw","1VL-ShAEjmg","y8k5QbVz3SE","weRfgj_349Q","_MFpPziLP9o","0M-xqfP1ibo","n4Pr_iCxxGU","UrWnNAMec98","QoREX_TLtZo","I-cm3GF-jX0","doIQXfJvydY","6hD3gGg9jMk","Hfbzju1FluI","vVN_pLl5ngg","3PnoFu027hc","7nrpwEDvusY"]
0XxI-hvPRRA     20282464        ["ut5fFyTkKv4","cYmeG712dD0","aDiNeF5dqnA","lNFFR1uwPGo","5Iyw4y6QR14","N1NO0iLbEt0","YtmGrR0tR7E","GZltV9lWQL4","qUDLSsSrrRA","wpQ1llsQ7qo","u9w2z-xtmqY","txVJgU3n72g","M6KcfOAckmw","orkbRVgRys0","HSuSo9hG_RI","3H3kKJLQgPs","46EsU9PmPyk","nn4XzrI1LLk","VTpKh6jFS7M","xH4b9ydgaHk"]
1dmVU08zVpA     16087899        ["x0dzQeq6o5Q","BVvxtb0euBY","Tc4iq0IaPgE","caYdEBT36z0","Wch5akcVofs","FgN4E9-U82s","a0ffAHbxsLY","BaR9j3-radY","jbNCtXtAwUo","XJBfdkDlubU","c6JRE4ZBcuA","nRGZJ8GMg3g","BfR7iz2UqZY","cVHrwiP2vro","CowiFyYfcH4","uYxKs7xXopc","dzYaq2yOCb8","9o_D-M91Hhc","0O04jXoZmgY","XphZDHtt3D0"]
RB-wUgnyGv0     15712924        ["RB-wUgnyGv0","Bdgffnf8Hfw","YSLFsov31dA","KDmGXlOJPbQ","Hr-48XYy9Ns","6E1s0LDL-uM","0j3iXi0V3hk","uEXlbUV45pw","KvMsc6OdKWc","9kGIbR7dqyQ","pEu1muGrREA","DolERIvMbzM","gPtR2eSeDIw","3EpF4fRoT4U","Dl2roCEKffM","QERUjf8fbII","9oviIyGYolo","dblCjXdP7bo","IMPGIaXCnaA","TdGgKd4ZyuY"]
QjA5faZF1A8     15256922        ["O9mEKMz2Pvo","Ddn4MGaS3N4","CBgf2ZxIDZk","r2BOApUvFpw","dVUgd8ot6BE","OUi9-jqq_i0","AbndgwfG22k","K3fvB4QO1qo","6rIJJp6aMlA","9wItsn3r_kc","cueXmJDbbvU","Ua3hZXfNZOE","Z2Rl5BsnfdY","pZ9jrBg4Lwc","dt1fB62cGbo","idb2dUtTpuU","j01x2lAFRwk","LmcjAGJOPR0","kFhQM7R4yjM","rNNcMDZn2Qk"]
......
```

对relatedid进行展开并去重

```sql
hive (default)> select related_id
from (
         select videoId,
                views,
                relatedId
         from gulivideo_orc
         order by views desc
         limit 50
     ) t1 
lateral view explode(relatedid) tmp as related_id
group by related_id;
```

```
related_id
OxBtqwlTMJQ
1hX1LxXwdl8
NvVbuVGtGSE
Ft6fC6RI4Ms
plv1e3MvxFw
1VL-ShAEjmg
y8k5QbVz3SE
weRfgj_349Q
_MFpPziLP9o
......
```

取出related_id对应的类别

```sql
hive (default)> select o.category
from (
         select related_id
         from (
                  select videoId,
                         views,
                         relatedId
                  from gulivideo_orc
                  order by views desc
                  limit 50
              ) t1 
         lateral view explode(relatedid) tmp as related_id
         group by related_id
     ) t2
join gulivideo_orc o on t2.related_id = o.videoId;
```

```
o.category
["Comedy"]
["Music"]
["Comedy"]
["Entertainment"]
["Music"]
["Music"]
["Entertainment"]
["Entertainment"]
["Comedy"]
["Howto","DIY"]
["Comedy"]
......
```

对相关视频的category类别进行展开

```sql
hive (default)> select explode(category) category_name
from (
         select o.category
         from (
                  select related_id
                  from (
                           select videoId,
                                  views,
                                  relatedId
                           from gulivideo_orc
                           order by views desc
                           limit 50
                       ) t1 lateral view explode(relatedid) tmp as related_id
                  group by related_id
              ) t2
                  join gulivideo_orc o on t2.related_id = o.videoId
     ) t3;
```

```
category_name
Comedy
Music
Comedy
Entertainment
Music
Music
Entertainment
Entertainment
......
```

对这些类别进行分组统计按照记录数倒序排序

```sql
hive (default)> select category_name, count(*) count_category
from (
         select explode(category) category_name
         from (
                  select o.category
                  from (
                           select related_id
                           from (
                                    select videoId,
                                           views,
                                           relatedId
                                    from gulivideo_orc
                                    order by views desc
                                    limit 50
                                ) t1 lateral view explode(relatedid) tmp as related_id
                           group by related_id
                       ) t2
                           join gulivideo_orc o on t2.related_id = o.videoId
              ) t3
     ) t4
group by category_name
order by count_category desc;
```

```
category_name   count_category
Comedy  232
Entertainment   216
Music   195
Blogs   51
People  51
Film    47
Animation       47
News    22
Politics        22
Games   20
Gadgets 20
Sports  19
Howto   14
DIY     14
UNA     13
Places  12
Travel  12
Animals 11
Pets    11
Autos   4
Vehicles        4
```

#### 统计每个类别中的视频观看数 Top10

**思路**

创建一张表用于存放 category 展开的数据，以便复用

向 category 展开的表中插入数据

统计对应类别中的视频观看数

**实现**

创建表类别表

```sql
hive (default)> create table gulivideo_category(videoId string, 
uploader string, 
age int, 
categoryId string, 
length int, 
views int, 
rate float, 
ratings int, 
comments int, 
relatedId array<string>
)
row format delimited fields terminated by "\t" 
collection items terminated by "&" 
stored as orc;
```

向类别表中插入数据

```sql
hive (default)> insert into table gulivideo_category 
select videoId,
uploader,
age,
categoryId,
length,
views,
rate,
ratings,
comments,
relatedId 
from gulivideo_orc 
lateral view explode(category) tmp_category as categoryId;
```

将类别表中的数据使用rank函数打标签，按照类别进行分组，按照观看数降序排序

```sql
hive (default)> select videoId,
views,
categoryId,
row_number() over (partition by categoryId order by views desc) rk
from gulivideo_category;
```

统计不同类别的 Top10

```sql
hive (default)> select videoid, views, categoryid
from (
         select videoId,
                views,
                categoryId,
                row_number() over (partition by categoryId order by views desc) rk
         from gulivideo_category
     ) t1
where rk <= 10;
```

```
videoid views   categoryid
2GWPOPSXGYI     3660009 Animals
xmsV9R8FsDA     3164582 Animals
12PsUW-8ge4     3133523 Animals
OeNggIGSKH8     2457750 Animals
WofFb_eOxxA     2075728 Animals
AgEmZ39EtFk     1999469 Animals
a-gW3RbJd8U     1836870 Animals
8CL2hetqpfg     1646808 Animals
QmroaYVD_so     1645984 Animals
Sg9x5mUjbH8     1527238 Animals
sdUUx5FdySs     5840839 Animation
6B26asyGKDo     5147533 Animation
H20dhY01Xjk     3772116 Animation
55YYaJIrmzo     3356163 Animation
JzqumbhfxRo     3230774 Animation
eAhfZUZiwSE     3114215 Animation
h7svw0m-wO0     2866490 Animation
tAq3hWBlalU     2830024 Animation
AJzU3NjDikY     2569611 Animation
ElrldD02if0     2337238 Animation
......
```

#### 统计上传视频最多的用户 Top10 以及他们上传的观看次数在前 20 的视频

找到上传视频最多的 10 个用户的用户信息

```sql
hive (default)> select uploader, videos
from gulivideo_user_orc 
order by videos desc
limit 10;
```

```
uploader        videos
expertvillage   86228
TourFactory     49078
myHotelVideo    33506
AlexanderRodchenko      24315
VHTStudios      20230
ephemeral8      19498
HSN     15371
rattanakorn     12637
Ruchaneewan     10059
futifu  9668
```

通过 uploader 字段与 gulivideo_orc 表进行 join，得到的信息按照 views 观看次数进行排序，取前20

```sql
hive (default)> select v.videoId, v.views, v.uploader
from gulivideo_orc v
         join (
    select uploader, videos
    from gulivideo_user_orc
    order by videos desc
    limit 10
) u on v.uploader = u.uploader
order by v.views desc
limit 20;
```

```
v.videoid       v.views v.uploader
-IxHBW0YpZw     39059   expertvillage
BU-fT5XI_8I     29975   expertvillage
ADOcaBYbMl0     26270   expertvillage
yAqsULIDJFE     25511   expertvillage
vcm-t0TJXNg     25366   expertvillage
0KYGFawp14c     24659   expertvillage
j4DpuPvMLF4     22593   expertvillage
Msu4lZb2oeQ     18822   expertvillage
ZHZVj44rpjE     16304   expertvillage
foATQY3wovI     13576   expertvillage
-UnQ8rcBOQs     13450   expertvillage
crtNd46CDks     11639   expertvillage
D1leA0JKHhE     11553   expertvillage
NJu2oG1Wm98     11452   expertvillage
CapbXdyv4j4     10915   expertvillage
epr5erraEp4     10817   expertvillage
IyQoDgaLM7U     10597   expertvillage
tbZibBnusLQ     10402   expertvillage
_GnCHodc7mk     9422    expertvillage
hvEYlSlRitU     7123    expertvillage
```