大数据技术之Hive
鸡汤: 让人痛苦的往往不是失败,而是我本可以。
1. Hive入门¶
1.1 什么是Hive¶
Hive:由Facebook开源用于解决海量结构化日志的数据统计。
Hive:是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张表,并提供类SQL查询功能。
本质是:将HQL转化成MapReduce程序

- Hive处理的数据存储在HDFS
- Hive分析数据底层的实现是MapReduce
- 执行程序运行在Yarn上
1.2 HIVE 的优缺点¶
1.2.1 优点¶
- 操作接口采用类SQL语法,提供快速开发的能力(简单、容易上手)。
- 避免了去写MapReduce,减少开发人员的学习成本。
- Hive的执行延迟比较高,因此Hive常用于数据分析,对实时性要求不高的场合。
- Hive优势在于处理大数据,对于处理小数据没有优势,因为Hive的执行延迟比较高。
- Hive支持用户自定义函数,用户可以根据自己的需求来实现自己的函数。
1.2.2 缺点¶
- Hive的HQL表达能力有限
(1)迭代式算法无法表达 (2)数据挖掘方面不擅长
- Hive的效率比较低
(1)Hive自动生成的MapReduce作业,通常情况下不够智能化 (2)Hive调优比较困难,粒度较粗
1.3 Hive架构原理¶

CLI(hive shell)、JDBC/ODBC(java访问hive)、WEBUI(浏览器访问hive)
- 元数据:Metastore
1. 元数据包括:表名、表所属的数据库(默认是default)、表的拥有者、 列/分区字段、表的类型(是否是外部表)、表的数据所在目录等; 2. 默认存储在自带的derby数据库中,推荐使用MySQL存储Metastore
- Hadoop
使用HDFS进行存储,使用MapReduce进行计算。
- 驱动器:Driver
1. 解析器(SQL Parser):将SQL字符串转换成抽象语法树AST,这一步一般都用第三方工具库完成, 比如antlr;对AST进行语法分析,比如表是否存在、字段是否存在、SQL语义是否有误。 2. 编译器(Physical Plan):将AST编译生成逻辑执行计划。 3. 优化器(Query Optimizer):对逻辑执行计划进行优化。 4. 执行器(Execution):把逻辑执行计划转换成可以运行的物理计划。对于Hive来说,就是MR/Spark。

1.4 Hive和数据库比较¶
由于 Hive 采用了类似SQL 的查询语言 HQL(Hive Query Language),因此很容易将 Hive 理解为数据库。其实从结构上来看,Hive 和数据库除了拥有类似的查询语言,再无类似之处。本文将从多个方面来阐述 Hive 和数据库的差异。数据库可以用在 Online 的应用中,但是Hive 是为数据仓库而设计的,清楚这一点,有助于从应用角度理解 Hive 的特性。
1.4.1 查询语言¶
由于SQL被广泛的应用在数据仓库中,因此,专门针对Hive的特性设计了类SQL的查询语言HQL。熟悉SQL开发的开发者可以很方便的使用Hive进行开发。
1.4.2 数据存储位置¶
Hive 是建立在 Hadoop 之上的,所有 Hive 的数据都是存储在 HDFS 中的。而数据库则可以将数据保存在块设备或者本地文件系统中。
1.4.3 数据更新¶
由于Hive是针对数据仓库应用设计的,而 数据仓库的内容是读多写少的 。因此, Hive中不建议对数据的改写,所有的数据都是在加载的时候确定好的 。而数据库中的数据通常是需要经常进行修改的,因此可以使用 INSERT INTO … VALUES 添加数据,使用 UPDATE … SET修改数据。
1.4.4 索引¶
Hive在加载数据的过程中不会对数据进行任何处理,甚至不会对数据进行扫描,因此也没有对数据中的某些Key建立索引。Hive要访问数据中满足条件的特定值时,需要 暴力扫描整个数据 ,因此访问延迟较高。由于 MapReduce 的引入, Hive 可以并行访问数据,因此即使没有索引,对于大数据量的访问,Hive 仍然可以体现出优势。数据库中,通常会针对一个或者几个列建立索引,因此对于少量的特定条件的数据的访问,数据库可以有很高的效率,较低的延迟。由于数据的访问延迟较高,决定了 Hive 不适合在线数据查询。
1.4.5 执行¶
Hive中大多数查询的执行是通过 Hadoop 提供的 MapReduce 来实现的。而数据库通常有自己的执行引擎。
1.4.6 执行延迟¶
Hive 在查询数据的时候,由于没有索引,需要扫描整个表,因此延迟较高。另外一个导致 Hive 执行延迟高的因素是 MapReduce框架。由于MapReduce 本身具有较高的延迟,因此在利用MapReduce 执行Hive查询时,也会有较高的延迟。相对的,数据库的执行延迟较低。当然,这个低是有条件的,即数据规模较小,当数据规模大到超过数据库的处理能力的时候,Hive的并行计算显然能体现出优势。
1.4.7 可扩展性¶
由于Hive是建立在Hadoop之上的,因此Hive的可扩展性是和Hadoop的可扩展性是一致的(世界上最大的Hadoop 集群在 Yahoo!,2009年的规模在4000 台节点左右)。而数据库由于 ACID 语义的严格限制,扩展行非常有限。目前最先进的并行数据库 Oracle在理论上的扩展能力也只有100台左右。
1.4.8 数据规模¶
由于Hive建立在集群上并可以利用MapReduce进行并行计算,因此可以支持很大规模的数据;对应的,数据库可以支持的数据规模较小。
2.Hive安装¶
2.1 安装地址¶
- Hive官网地址
http://hive.apache.org/
- 文档查看地址
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
- 下载地址
http://archive.apache.org/dist/hive/
- github地址
https://github.com/apache/hive
2.2 安装部署¶
2.2.1 部署¶
参考本章节的 Hive 部署
2.2.2 Hive基本操作¶
1. 启动hive [root@master root]# hive 2. 查看数据库 hive> show databases; 3. 打开默认数据库 hive> use default; 4. 显示default数据库中的表 hive> show tables; 5. 创建一张表 hive> create table student(id int, name string); 6. 显示数据库中有几张表 hive> show tables; 7. 查看表的结构 hive> desc student; 8. 向表中插入数据 hive> insert into student values(1000,"ss"); 9. 查询表中数据 hive> select * from student; 10. 退出hive hive> quit;
2.3 HiveJDBC¶
Hive除了自身客户端连接还是支持第三方插件连接,一下使用JDBC方式去连接、
2.3.1 启动hiveserver2服务¶
[root@master root]# hiveserver2 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
前台运行,方便查看错误
2.3.2 启动beeline¶
[root@master ~]# beeline SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Beeline version 1.2.2 by Apache Hive beeline> show databases; No current connection
2.3.3 连接hiveserver2¶
beeline> !connect jdbc:hive2://master:10000 Connecting to jdbc:hive2://master:10000 Enter username for jdbc:hive2://master:10000: root Enter password for jdbc:hive2://master:10000: Connected to: Apache Hive (version 1.2.2) Driver: Hive JDBC (version 1.2.2) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://master:10000>
只有有权限操作hive数据即可,我使用root。
0: jdbc:hive2://master:10000> show databases; +----------------+--+ | database_name | +----------------+--+ | cmz | | default | | hivedemo | | loocha | | test | +----------------+--+ 5 rows selected (1.519 seconds) 0: jdbc:hive2://master:10000> use loocha; No rows affected (0.085 seconds) 0: jdbc:hive2://master:10000> show tables; +-----------+--+ | tab_name | +-----------+--+ | stu | | t1 | | test | +-----------+--+ 3 rows selected (0.062 seconds) 0: jdbc:hive2://master:10000> select * from stu; +-----------+------------+------------+--+ | stu.name | stu.xueke | stu.score | +-----------+------------+------------+--+ | zhangsan | 10 | 90 | | lisi | 80 | 66 | | wangwu | 66 | 55 | +-----------+------------+------------+--+ 3 rows selected (1.047 seconds)
2.4 Hive常用交互模式¶
[root@master ~]# hive -help SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] usage: hive -d,--define <key=value> Variable subsitution to apply to hive commands. e.g. -d A=B or --define A=B --database <databasename> Specify the database to use -e <quoted-query-string> SQL from command line -f <filename> SQL from files -H,--help Print help information --hiveconf <property=value> Use value for given property --hivevar <key=value> Variable subsitution to apply to hive commands. e.g. --hivevar A=B -i <filename> Initialization SQL file -S,--silent Silent mode in interactive shell -v,--verbose Verbose mode (echo executed SQL to the console)
2.4.1 -e 模式¶
类似mysql -e
,“-e”不进入hive的交互窗口执行sql语句.
hive -e 'show databases';
执行过程
[root@master ~]# hive -e 'show databases'; SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties OK database_name cmz default hivedemo loocha test Time taken: 2.069 seconds, Fetched: 5 row(s)
2.4.2 -f模式¶
“-f”执行脚本中sql语句。
echo 'show databases'> caimengzhi.hql hive -f caimengzhi.hql
执行过程
[root@master ~]# echo 'show databases'> caimengzhi.hql [root@master ~]# hive -f caimengzhi.hql SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties OK database_name cmz default hivedemo loocha test Time taken: 1.074 seconds, Fetched: 5 row(s)
2.5 其他命令¶
- 退出
hive(default)>exit; hive(default)>quit;
在新版的hive中没区别了,在以前的版本是有的:
-
exit:先隐性提交数据,再退出;
-
quit:不提交数据,退出;
-
HDFS操作
在hive cli命令窗口中如何查看hdfs文件系统.
hive (default)> dfs -ls /user; Found 2 items drwxr-xr-x - root supergroup 0 2019-06-24 19:03 /user/hive drwxr-xr-x - root supergroup 0 2019-07-04 17:10 /user/root
- 产看操作系统本地命令
hive (default)> !ifconfig; ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.186.10 netmask 255.255.255.0 broadcast 192.168.186.255 inet6 fe80::9d58:5651:daa8:880a prefixlen 64 scopeid 0x20<link> ether 00:0c:29:c6:79:90 txqueuelen 1000 (Ethernet) RX packets 649585 bytes 135398513 (129.1 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1720289 bytes 3534604662 (3.2 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 152309 bytes 39464571 (37.6 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 152309 bytes 39464571 (37.6 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 hive (default)> !date; Thu Aug 1 18:41:01 CST 2019
注意末尾的 ;
- 查看在hive中输入的所有历史命令
1. 进入到当前用户的家目录 2. 查看. hivehistory文件
详细步骤
[root@master ~]# cd [root@master ~]# cat .hivehistory|head create table bucket_table1 (id int) clustered by(id) into 8 buckets; create table bucket_table2 (id int); load data local inpath 'testbucket1.txt' into table bucket_table2; insert into table bucket_table1 select * from table bucket_table2; insert into table bucket_table1 select * from bucket_table2; select * from bucket_table1 tablesample(bucket 3 out of 4 on id); create table docs(line string); load data inpath '/cmz/*.txt' overwrite into table docs; create table word_count as select word,count(1) as count from
2.6 Hive 常见参数配置¶
2.6.1 数据仓库位置¶
- Default数据仓库的最原始位置是在hdfs上的:/user/hive/warehouse路径下。
- 在仓库目录下,没有对默认的数据库default创建文件夹。如果某张表属于default数据库,直接在数据仓库目录下创建一个文件夹。
- 修改default数据仓库原始位置(将hive-default.xml.template如下配置信息拷贝到hive-site.xml文件中)。
<property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> <description>location of default database for the warehouse</description> </property>
配置同组用户有执行权限
hdfs dfs -chmod g+w /user/hive/warehouse
2.6.2 查询后信息显示配置¶
- 在hive-site.xml文件中添加如下配置信息,就可以实现显示当前数据库,以及查询表的头信息配置
<property> <name>hive.cli.print.header</name> <value>true</value> </property> <property> <name>hive.cli.print.current.db</name> <value>true</value> </property>
- 重新启动hive,对比配置前后差异。
配置前, 显示如下:
hive > show databases; OK database_name cmz default hivedemo loocha test Time taken: 0.136 seconds, Fetched: 5 row(s)
配置之后重启进入
hive (default)> show databases; OK database_name cmz default hivedemo loocha test Time taken: 0.136 seconds, Fetched: 5 row(s) hive (default)> use test; hive (test)> select * from student1; OK student1.name student1.chinese student1.math student1.english zhangsan NULL 90 60 lisi 80 66 77 wangwu 66 55 80 Time taken: 0.067 seconds, Fetched: 3 row(s)
显示了当前库的名字。比如上面的库名default,test,且也显示表字段。
2.6.3 Hive运行日志信息配置¶
- Hive的log默认存放在/tmp/root/hive.log目录下(当前用户名下)
- 修改hive的log存放日志到/usr/local/src/logs/hive/logs
修改/opt/module/hive/conf/hive-log4j.properties.template文件名称为hive-log4j.properties
[root@master conf]# pwd /usr/local/hive/conf [root@master conf]# mv hive-log4j.properties.template hive-log4j.properties
- 在hive-log4j.properties文件中修改log存放位置
hive.log.dir=/usr/local/src/logs/hive/logs
2.6.4 参数配置方式¶
2.6.4.1 查看当前所有的配置信息¶
hive>set;
2.6.4.2 参数的配置三种方式¶
- 配置文件方式
默认配置文件:hive-default.xml
用户自定义配置文件:hive-site.xml
注意:用户自定义配置会覆盖默认配置。另外,Hive也会读入Hadoop的配置,因为Hive是作为Hadoop的客户端启动的,Hive的配置会覆盖Hadoop的配置。配置文件的设定对本机启动的所有Hive进程都有效。
2.6.4.3 命令行参数方式¶
- 启动Hive时,可以在命令行添加-hiveconf param=value来设定参数。
hive -hiveconf hive.cli.print.current.db=true;
详细过程
[root@master conf]# hive -hiveconf hive.cli.print.current.db=true SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties hive (default)> [root@master ~]# hive -hiveconf hive.cli.print.current.db=false; SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.5/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties hive>
对比. ==hive (default)> == ==hive> ==
注意:仅对本次hive启动有效
- 参数声明方式
可以在HQL中使用SET关键字设定参数
hive> set hive.cli.print.current.db=true; hive (default)>
注意:仅对本次hive启动有效。
查看之前当前的参数设置
hive (default)> set hive.cli.print.current.db; hive.cli.print.current.db=true
要是查看所有就
hive (default)> set; 省略
上述三种设定方式的优先级依次递增。即配置文件<命令行参数<参数声明。注意某些系统级的参数,例如log4j相关的设定,必须用前两种方式设定,因为那些参数的读取在会话建立以前已经完成了。
3. Hive数据类型¶
3.1 基本数据类型¶
Hive数据类型 | Java数据类型 | 长度 | 例子 |
---|---|---|---|
TINYINT | byte | 1byte有符号整数 | 20 |
SMALINT | short | 2byte有符号整数 | 20 |
INT | int | 4byte有符号整数 | 20 |
BIGINT | long | 8byte有符号整数 | 20 |
BOOLEAN | boolean | 布尔类型,true或者false | TRUE FALSE |
FLOAT | float | 单精度浮点数 | 3.14159 |
DOUBLE | double | 双精度浮点数 | 3.14159 |
STRING | string | 字符系列。可以指定字符集。可以使用单引号或者双引号。 | ‘now is the time’ “for all good men” |
TIMESTAMP | 时间类型 | ||
BINARY | 字节数组 |
对于Hive的String类型相当于数据库的varchar类型,该类型是一个可变的字符串,不过它不能声明其中最多能存储多少个字符,理论上它可以存储2GB的字符数。
3.2 集合数据类型¶
数据类型 | 描述 | 语法示例 |
---|---|---|
STRUCT | 和c语言中的struct类似,都可以通过“点”符号访问元素内容。例如,如果某个列的数据类型是STRUCT | struct() |
MAP | MAP是一组键-值对元组集合,使用数组表示法可以访问数据。例如,如果某个列的数据类型是MAP,其中键->值对是’first’->’John’和’last’->’Doe’,那么可以通过字段名[‘last’]获取最后一个元素 | map() |
ARRAY | 数组是一组具有相同类型和名称的变量的集合。这些变量称为数组的元素,每个数组元素都有一个编号,编号从零开始。例如,数组值为[‘John’, ‘Doe’],那么第2个元素可以通过数组名[1]进行引用。 | Array() |
Hive有三种复杂数据类型ARRAY、MAP 和 STRUCT。ARRAY和MAP与Java中的Array和Map类似,而STRUCT与C语言中的Struct类似,它封装了一个命名字段集合,复杂数据类型允许任意层次的嵌套。
案例操作:
- 假设某表有如下一行,我们用JSON格式来表示其数据结构。在Hive下访问的格式为
{ "name": "summer", "friends": ["leco" , "ICPC"] , //列表Array, "children": { //键值Map, "loocha": 20 , "realcloud": 30 } "address": { //结构Struct, "street": "南京新模范马路" , "city": "南京" } }
- 基于上述数据结构,我们在Hive里创建对应的表,并导入数据。
创建本地测试文件test.txt
summer,leco_ICPC,loocha:20_realcloud:30,南京新模范马路_南京
注意:MAP,STRUCT和ARRAY里的元素间关系都可以用同一个字符表示,这里用“_”。
- Hive上创建测试表summer
create table summer( name string, friends array<string>, children map<string, int>, address struct<street:string, city:string> ) row format delimited fields terminated by ',' collection items terminated by '_' map keys terminated by ':' lines terminated by '\n';
字段解释
row format delimited fields terminated by ',' -- 列分隔符 collection items terminated by '_' --MAP STRUCT 和 ARRAY 的分隔符(数据分割符号) map keys terminated by ':' -- MAP中的key与value的分隔符 lines terminated by '\n'; -- 行分隔符
MAP STRUCT 和 ARRAY 的分隔符(数据分割符号),数据不一样,一定要之前 清洗
详细操作
hive (default)> create table summer( > name string, > friends array<string>, > children map<string, int>, > address struct<street:string, city:string> > ) > row format delimited fields terminated by ',' > collection items terminated by '_' > map keys terminated by ':' > lines terminated by '\n'; OK Time taken: 0.348 seconds hive (default)> show tables like 'summer'; OK tab_name summer Time taken: 0.05 seconds, Fetched: 1 row(s) hive (default)> desc summer; OK col_name data_type comment name string friends array<string> children map<string,int> address struct<street:string,city:string> Time taken: 0.267 seconds, Fetched: 4 row(s)
- 导入文本数据到测试表
[root@master test]# ls /root/hive/sgg/test/test.txt /root/hive/sgg/test/test.txt [root@master test]# cat /root/hive/sgg/test/test.txt summer,leco_ICPC,loocha:20_realcloud:30,南京新模范马路_南京 hive (default)> load data local inpath '/root/hive/sgg/test/test.txt' into table summer; Loading data to table default.summer Table default.summer stats: [numFiles=1, totalSize=69] OK Time taken: 0.756 seconds
- 访问三种集合列里的数据,以下分别是ARRAY,MAP,STRUCT的访问方式
hive (default)> select * from summer; OK summer.name summer.friends summer.children summer.address summer ["leco","ICPC"] {"loocha":20,"realcloud":30} {"street":"南京新模范马路","city":"南京"} Time taken: 0.057 seconds, Fetched: 1 row(s) hive (default)> select friends[1],children['loocha'],address.street from summer where name='summer'; OK _c0 _c1 street ICPC 20 南京新模范马路 Time taken: 0.053 seconds, Fetched: 1 row(s)
0: jdbc:hive2://master:10000> show databases; +----------------+--+ | database_name | +----------------+--+ | cmz | | default | | hivedemo | | loocha | | test | +----------------+--+ 5 rows selected (0.071 seconds) 0: jdbc:hive2://master:10000> use default; No rows affected (0.05 seconds) 0: jdbc:hive2://master:10000> show tables; +---------------+--+ | tab_name | +---------------+--+ | cmz_external | | student | | student2 | | summer | | u_info | | visters_wai | +---------------+--+ 6 rows selected (0.059 seconds) 0: jdbc:hive2://master:10000> select * from summer; +--------------+------------------+-------------------------------+-----------------------------------+--+ | summer.name | summer.friends | summer.children | summer.address | +--------------+------------------+-------------------------------+-----------------------------------+--+ | summer | ["leco","ICPC"] | {"loocha":20,"realcloud":30} | {"street":"南京新模范马路","city":"南京"} | +--------------+------------------+-------------------------------+-----------------------------------+--+ 1 row selected (0.205 seconds) 0: jdbc:hive2://master:10000> select friends[1],children['loocha'],address.street from summer where name='summer'; +-------+------+----------+--+ | _c0 | _c1 | street | +-------+------+----------+--+ | ICPC | 20 | 南京新模范马路 | +-------+------+----------+--+ 1 row selected (0.437 seconds)
3.3 类型转化¶
Hive的原子数据类型是可以进行隐式转换的,类似于Java的类型转换,例如某表达式使用INT类型,TINYINT会自动转换为INT类型,但是Hive不会进行反向转化,例如,某表达式使用TINYINT类型,INT不会自动转换为TINYINT类型,它会返回错误,除非使用CAST操作。
3.3.1 隐式类型转换规则如下¶
-
任何整数类型都可以隐式地转换为一个范围更广的类型,如TINYINT可以转换成INT,INT可以转换成BIGINT。
-
所有整数类型、FLOAT和STRING类型都可以隐式地转换成DOUBLE。
-
TINYINT、SMALLINT、INT都可以转换为FLOAT。
-
BOOLEAN类型不可以转换为任何其它的类型。
3.3.2 可以使用CAST操作显示进行数据类型转换¶
例如CAST('1' AS INT)将把字符串'1' 转换成整数1;如果强制类型转换失败,如执行CAST('X' AS INT),表达式返回空值 NULL。
4. DDL操作¶
4.1 创建数据库¶
- 创建一个数据库,数据库在HDFS上的默认存储路径是/user/hive/warehouse/*.db。
hive (default)> create database db_hive;
- 避免要创建的数据库已经存在错误,增加if not exists判断。(标准写法)
hive (default)> create database db_hive; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Database db_hive already exists hive (default)> create database if not exists db_hive;
- 创建一个数据库,指定数据库在HDFS上存放的位置
hive (default)> create database cmz_hive2 location '/cmz_hive2.db'; OK Time taken: 0.092 seconds hive (default)> dfs -ls /cmz_hive2.db;
我们创建的数据库,系统会库后面加.db
不执行存放位置,会默认存放在/user/hive/warehouse/
[ 配置文件中指定的hive数据位置 ]
hive (default)> dfs -ls /user/hive/warehouse; Found 11 items drwxr-xr-x - root supergroup 0 2019-07-30 14:06 /user/hive/warehouse/cmz.db drwxr-xr-x - root supergroup 0 2019-06-24 20:23 /user/hive/warehouse/cmz_external drwxr-xr-x - root supergroup 0 2019-08-01 19:40 /user/hive/warehouse/cmz_hive3.db drwxr-xr-x - root supergroup 0 2019-07-28 23:39 /user/hive/warehouse/hivedemo.db drwxr-xr-x - root supergroup 0 2019-07-30 20:37 /user/hive/warehouse/loocha.db drwxr-xr-x - root supergroup 0 2019-08-01 13:47 /user/hive/warehouse/student drwxr-xr-x - root supergroup 0 2019-08-01 14:13 /user/hive/warehouse/student2 drwxr-xr-x - root supergroup 0 2019-08-01 19:31 /user/hive/warehouse/summer drwxr-xr-x - root supergroup 0 2019-07-30 16:52 /user/hive/warehouse/test.db drwxr-xr-x - root supergroup 0 2019-07-03 16:33 /user/hive/warehouse/u_info drwxr-xr-x - root supergroup 0 2019-07-28 22:42 /user/hive/warehouse/visters_wai
4.2 查询数据库¶
4.2.1 显示数据库¶
- 显示数据库
hive (default)> show databases; OK database_name cmz cmz_hive2 cmz_hive3 default hivedemo loocha test Time taken: 0.013 seconds, Fetched: 7 row(s)
- 过滤显示查询的数据库
hive (default)> show databases like 'cmz*'; OK database_name cmz cmz_hive2 cmz_hive3 Time taken: 0.013 seconds, Fetched: 3 row(s)
类似MySQL,支持正则。
4.2.2查看数据库详情¶
- 显示数据库信息
hive (cmz)> use cmz; OK Time taken: 0.025 seconds hive (cmz)> show tables; OK tab_name dept emp student2 Time taken: 0.031 seconds, Fetched: 3 row(s) hive (cmz)> desc emp; OK col_name data_type comment empno int ename string job string mgr int hiredate string sal double comm double deptno int Time taken: 0.078 seconds, Fetched: 8 row(s)
- 显示数据库详细信息,extended
hive (cmz)> desc extended emp; OK col_name data_type comment empno int ename string job string mgr int hiredate string sal double comm double deptno int Detailed Table Information Table(tableName:emp, dbName:cmz, owner:root, createTime:1564466299, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:empno, type:in t, comment:null), FieldSchema(name:ename, type:string, comment:null), FieldSchema(name:job, type:string, comment:null), FieldSchema(name:mgr, type:int, comment:null), FieldSchema(name:hiredate, type:string, comment:null), FieldSchema(name:sal, type:double, comment:null), FieldSchema(name:comm, type:double, comment:null), FieldSchema(name:deptno, type:int, comment:null)], location:hdfs://master:9000/user/hive/warehouse/cmz.db/emp, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{field.delim= , serialization.format=Time taken: 0.08 seconds, Fetched: 10 row(s)
extended 这样看数据自定的属性。
4.2.3 切换数据库¶
hive (cmz)> show databases; OK database_name cmz cmz_hive2 cmz_hive3 default hivedemo loocha test Time taken: 0.011 seconds, Fetched: 7 row(s) hive (cmz)> use loocha; OK Time taken: 0.035 seconds
use newdatabasename; 就是切换到newdatabasename数据库下。类似MySQL。
4.3 修改数据库¶
用户可以使用ALTER DATABASE命令为某个数据库的DBPROPERTIES设置键-值对属性值,来描述这个数据库的属性信息。 数据库的其他元数据信息都是不可更改的,包括数据库名和数据库所在的目录位置 。
hive (loocha)> alter database loocha set dbproperties('createtime'='20190802'); OK Time taken: 0.035 seconds hive (loocha)> !date; Thu Aug 1 19:51:14 CST 2019 hive (loocha)> desc database extended loocha; OK db_name comment location owner_name owner_type parameters loocha hdfs://master:9000/user/hive/warehouse/loocha.db root USER {createtime=20190802} Time taken: 0.021 seconds, Fetched: 1 row(s)
新添加了createtime属性。
4.4 删除库¶
- 删除空数据库
hive (loocha)> create database realcloud; OK Time taken: 0.045 seconds hive (loocha)> drop database realcloud; OK Time taken: 0.2 seconds
库是空的,里面没有任何表和数据。
- 如果删除的数据库不存在,最好采用 if exists判断数据库是否存在
create database icpc; use icpc; create table cmz(id int); insert into table cmz values(1); select * from cmz; drop database icpc;
详细操作过程
hive (loocha)> create database icpc; OK Time taken: 0.024 seconds hive (loocha)> use icpc; OK Time taken: 0.02 seconds hive (icpc)> create table cmz(id int); OK Time taken: 0.051 seconds hive (icpc)> insert into table cmz values(1); Query ID = root_20190801195511_2864592f-ce26-4e81-93b5-92ee44b7ac98 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1564639643032_0002, Tracking URL = http://master:8088/proxy/application_1564639643032_0002/ Kill Command = /usr/local/hadoop-2.6.5/bin/hadoop job -kill job_1564639643032_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2019-08-01 19:55:19,006 Stage-1 map = 0%, reduce = 0% 2019-08-01 19:55:34,716 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.57 sec MapReduce Total cumulative CPU time: 7 seconds 570 msec Ended Job = job_1564639643032_0002 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: hdfs://master:9000/user/hive/warehouse/icpc.db/cmz/.hive-staging_hive_2019-08-01_19-55-11_294_7002636935651063480-1/-ext-10000 Loading data to table icpc.cmz Table icpc.cmz stats: [numFiles=1, numRows=1, totalSize=2, rawDataSize=1] MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 7.57 sec HDFS Read: 3333 HDFS Write: 66 SUCCESS Total MapReduce CPU Time Spent: 7 seconds 570 msec OK _col0 Time taken: 24.725 seconds hive (icpc)> select * from cmz; OK cmz.id 1 Time taken: 0.055 seconds, Fetched: 1 row(s) hive (icpc)> drop database icpc; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database icpc is not empty. One or more tables exist.)
- 如果数据库不为空,可以采用cascade命令,强制删除
hive (cmz)> drop database icpc; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database icpc is not empty. One or more tables exist.) hive (cmz)> drop database icpc cascade; OK Time taken: 0.594 seconds
4.5 创建表¶
语法
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path]
字段解释
1. CREATE TABLE 创建一个指定名字的表。如果相同名字的表已经存在,则抛出异常;用户可以用 IF NOT EXISTS 选项来忽略这个异常。 2. EXTERNAL关键字可以让用户创建一个外部表,在建表的同时指定一个指向实际数据的路径(LOCATION),Hive创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。 3. COMMENT:为表和列添加注释。 4. PARTITIONED BY创建分区表 5. CLUSTERED BY创建分桶表 6. SORTED BY不常用 7. ROW FORMAT DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)] 用户在建表的时候可以自定义SerDe或者使用自带的SerDe。如果没有指定ROW FORMAT 或者ROW FORMAT DELIMITED,将会使用自带的SerDe。在建表的时候,用户还需要为表指定列,用户在指定表的列的同时也会指定自定义的SerDe,Hive通过SerDe确定表的具体的列的数据。 SerDe是Serialize/Deserilize的简称,目的是用于序列化和反序列化。 8. STORED AS指定存储文件类型 常用的存储文件类型:SEQUENCEFILE(二进制序列文件)、TEXTFILE(文本)、RCFILE(列式存储格式文件) 如果文件数据是纯文本,可以使用STORED AS TEXTFILE。如果数据需要压缩,使用 STORED AS SEQUENCEFILE。 9. LOCATION :指定表在HDFS上的存储位置。 10. LIKE允许用户复制现有的表结构,但是不复制数据。
4.5.1 管理表¶
4.5.1.1 定义¶
默认创建的表都是所谓的管理表,有时也被称为 内部表 。因为这种表,Hive会(或多或少地)控制着数据的生命周期。Hive默认情况下会将这些表的数据存储在由配置项hive.metastore.warehouse.dir
(例如,/user/hive/warehouse
)所定义的目录的子目录下。 当我们删除一个管理表时,Hive也会删除这个表中数据。管理表不适合和其他工具共享数据。
4.5.1.2 案例¶
- 普通创建表
cat>students.txt<<EOF 1,春天 2,夏天 3,秋天 4,冬天 EOF hadoop fs -mkdir /user/hive/warehouse/student hadoop fs -put students.txt /user/hive/warehouse/student/ hadoop fs -text /user/hive/warehouse/student/students.txt create table if not exists student( id int, name string ) row format delimited fields terminated by ',' stored as textfile location '/user/hive/warehouse/student/';
详细操作
[root@master test]# cat>students.txt<<EOF > 1,春天 > 2,夏天 > 3,秋天 > 4,冬天 > EOF [root@master test]# hadoop fs -mkdir /user/hive/warehouse/student [root@master test]# hadoop fs -put students.txt /user/hive/warehouse/student/ [root@master test]# hadoop fs -text /user/hive/warehouse/student text: `/user/hive/warehouse/student': Is a directory [root@master test]# hadoop fs -text /user/hive/warehouse/student/students.txt 1,春天 2,夏天 3,秋天 4,冬天 hive (cmz)> show tables; OK tab_name dept emp Time taken: 0.02 seconds, Fetched: 2 row(s) hive (cmz)> create table if not exists student( > id int, > name string > ) row format delimited fields terminated by ',' > stored as textfile > location '/user/hive/warehouse/student/'; OK Time taken: 0.059 seconds hive (cmz)> show tables; OK tab_name dept emp student Time taken: 0.046 seconds, Fetched: 3 row(s) hive (cmz)> select * from student; OK student.id student.name 1 春天 2 夏天 3 秋天 4 冬天 Time taken: 0.051 seconds, Fetched: 4 row(s)
- 根据查询结果创建表(查询的结果会添加到新创建的表中)
create table if not exists student2 as select id, name from student;
详细过程
hive (cmz)> create table if not exists student2 as select id, name from student; Query ID = root_20190801201833_eabee9e0-5ad5-4a79-a2a4-d6cc906e941f Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1564639643032_0003, Tracking URL = http://master:8088/proxy/application_1564639643032_0003/ Kill Command = /usr/local/hadoop-2.6.5/bin/hadoop job -kill job_1564639643032_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2019-08-01 20:18:43,107 Stage-1 map = 0%, reduce = 0% 2019-08-01 20:18:59,156 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.71 sec MapReduce Total cumulative CPU time: 10 seconds 710 msec Ended Job = job_1564639643032_0003 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: hdfs://master:9000/user/hive/warehouse/cmz.db/.hive-staging_hive_2019-08-01_20-18-33_987_886327310745491803-1/-ext-10001 Moving data to: hdfs://master:9000/user/hive/warehouse/cmz.db/student2 Table cmz.student2 stats: [numFiles=1, numRows=4, totalSize=36, rawDataSize=32] MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 10.71 sec HDFS Read: 2879 HDFS Write: 104 SUCCESS Total MapReduce CPU Time Spent: 10 seconds 710 msec OK id name Time taken: 27.412 seconds hive (cmz)> select * from student2; OK student2.id student2.name 1 春天 2 夏天 3 秋天 4 冬天 Time taken: 0.059 seconds, Fetched: 4 row(s)
- 根据已经存在的表结构创建表
hive (cmz)> create table if not exists student3 like student; OK Time taken: 0.095 seconds hive (cmz)> select * from student3; OK student3.id student3.name Time taken: 0.045 seconds hive (cmz)> desc student; OK col_name data_type comment id int name string Time taken: 0.048 seconds, Fetched: 2 row(s)
和MySQL一样,只复制表结构,不拷贝数据
- 查询表的类型
hive (cmz)> desc formatted student2; OK col_name data_type comment # col_name data_type comment id int name string # Detailed Table Information Database: cmz Owner: root CreateTime: Thu Aug 01 20:19:01 CST 2019 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://master:9000/user/hive/warehouse/cmz.db/student2 Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE true numFiles 1 numRows 4 rawDataSize 32 totalSize 36 transient_lastDdlTime 1564661941 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.067 seconds, Fetched: 32 row(s) hive (cmz)>
Table Type: MANAGED_TABLE 表示管理表,也成内部表
4.5.2 外部表¶
4.5.2.1 定义¶
因为表是外部表,所以Hive并非认为其完全拥有这份数据。 删除该表并不会删除掉这份数据,不过描述表的元数据信息会被删除掉。
4.5.2.2 管理表和外部表的使用场景¶
每天将收集到的网站日志定期流入HDFS文本文件。在外部表(原始日志表)的基础上做大量的统计分析,用到的中间表、结果表使用内部表存储,数据通过SELECT+INSERT进入内部表。
4.5.2.3 案例¶
分别创建部门和员工外部表,并向表中导入数据。
- 元数据
cat>dept.txt<<EOF 10,ACCOUNTING,1700 20,RESEARCH,1800 30,SALES,1900 40,OPERATIONS,1700 EOF cat>emp.txt<<EOF 7369,SMITH,CLERK,7902,1980-12-17,800.00,20 7499,ALLEN,SALESMAN,7698,1981-2-20,1600.00,300.00,30 7521,WARD,SALESMAN,7698,1981-2-22,1250.00,500.00,30 7566,JONES,MANAGER,7839,1981-4-2,2975.00,20 7654,MARTIN,SALESMAN,7698,1981-9-28,1250.00,1400.00,30 7698,BLAKE,MANAGER,7839,1981-5-1,2850.00,30 7782,CLARK,MANAGER,7839,1981-6-9,2450.00,10 7788,SCOTT,ANALYST,7566,1987-4-19,3000.00,20 7839,KING,PRESIDENT,1981-11-17,5000.00,10 7844,TURNER,SALESMAN,7698,1981-9-8,1500.00,0.00,30 7876,ADAMS,CLERK,7788,1987-5-23,1100.00,20 7900,JAMES,CLERK,7698,1981-12-3,950.00,30 7902,FORD,ANALYST,7566,1981-12-3,3000.00,20 7934,MILLER,CLERK,7782,1982-1-23,1300.00,10 EOF
- hql
-- 创建部门表 create external table if not exists dept( deptno int, dname string, loc int ) row format delimited fields terminated by ','; -- 创建员工表 create external table if not exists emp( empno int, ename string, job string, mgr int, hiredate string, sal double, comm double, deptno int ) row format delimited fields terminated by ','; load data local inpath '/root/hive/sgg/test/emp.txt' into table emp; load data local inpath '/root/hive/sgg/test/dept.txt' into table dept;
详细操作过程
[root@master test]# cat>dept.txt<<EOF > 10,ACCOUNTING,1700 > 20,RESEARCH,1800 > 30,SALES,1900 > 40,OPERATIONS,1700 > EOF 7566,JONES,MANAGER,7839,1981-4-2,2975.00,20 7654,MARTIN,SALESMAN,7698,1981-9-28,1250.00,1400.00,30 7698,BLAKE,MANAGER,7839,1981-5-1,2850.00,30 7782,CLARK,MANAGER,7839,1981-6-9,2450.00,10 7788,SCOTT,ANALYST,7566,1987-4-19,3000.00,20 7839,KING,PRESIDENT,1981-11-17,5000.00,10 7844,TURNER,SALESMAN,7698,1981-9-8,1500.00,0.00,30 7876,ADAMS,CLERK,7788,1987-5-23,1100.00,20 7900,JAMES,CLERK,7698,1981-12-3,950.00,30 7902,FORD,ANALYST,7566,1981-12-3,3000.00,20 7934,MILLER,CLERK,7782,1982-1-23,1300.00,10 EOF[root@master test]# cat>emp.txt<<EOF > 7369,SMITH,CLERK,7902,1980-12-17,800.00,20 > 7499,ALLEN,SALESMAN,7698,1981-2-20,1600.00,300.00,30 > 7521,WARD,SALESMAN,7698,1981-2-22,1250.00,500.00,30 > 7566,JONES,MANAGER,7839,1981-4-2,2975.00,20 > 7654,MARTIN,SALESMAN,7698,1981-9-28,1250.00,1400.00,30 > 7698,BLAKE,MANAGER,7839,1981-5-1,2850.00,30 > 7782,CLARK,MANAGER,7839,1981-6-9,2450.00,10 > 7788,SCOTT,ANALYST,7566,1987-4-19,3000.00,20 > 7839,KING,PRESIDENT,1981-11-17,5000.00,10 > 7844,TURNER,SALESMAN,7698,1981-9-8,1500.00,0.00,30 > 7876,ADAMS,CLERK,7788,1987-5-23,1100.00,20 > 7900,JAMES,CLERK,7698,1981-12-3,950.00,30 > 7902,FORD,ANALYST,7566,1981-12-3,3000.00,20 > 7934,MILLER,CLERK,7782,1982-1-23,1300.00,10 > EOF [root@master test]# ls dept.txt emp.txt students.txt test.txt hive (cmz)> create database icpc; OK Time taken: 0.048 seconds hive (cmz)> use icpc; hive (icpc)> -- 创建部门表 > create external table if not exists dept( > deptno int, > dname string, > loc int > ) > row format delimited fields terminated by ','; OK Time taken: 0.058 seconds hive (icpc)> > -- 创建员工表 > create external table if not exists emp( > empno int, > ename string, > job string, > mgr int, > hiredate string, > sal double, > comm double, > deptno int > ) > row format delimited fields terminated by ','; OK Time taken: 0.048 seconds hive (icpc)> show tables; OK tab_name dept emp Time taken: 0.02 seconds, Fetched: 2 row(s) hive (icpc)> desc dept; OK col_name data_type comment deptno int dname string loc int Time taken: 0.05 seconds, Fetched: 3 row(s) hive (icpc)> desc emp; OK col_name data_type comment empno int ename string job string mgr int hiredate string sal double comm double deptno int Time taken: 0.06 seconds, Fetched: 8 row(s)
导入数据
hive (icpc)> load data local inpath '/root/hive/sgg/test/emp.txt' into table emp; Loading data to table icpc.emp Table icpc.emp stats: [numFiles=1, totalSize=646] OK Time taken: 0.178 seconds hive (icpc)> load data local inpath '/root/hive/sgg/test/dept.txt' into table dept; Loading data to table icpc.dept Table icpc.dept stats: [numFiles=1, totalSize=69] OK Time taken: 0.261 seconds
- 查看结果
hive (icpc)> select * from emp; OK emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno 7369 SMITH CLERK 7902 1980-12-17 800.0 20.0 NULL 7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30 7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30 7566 JONES MANAGER 7839 1981-4-2 2975.0 20.0 NULL 7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30 7698 BLAKE MANAGER 7839 1981-5-1 2850.0 30.0 NULL 7782 CLARK MANAGER 7839 1981-6-9 2450.0 10.0 NULL 7788 SCOTT ANALYST 7566 1987-4-19 3000.0 20.0 NULL 7839 KING PRESIDENT NULL 5000.00 10.0 NULL NULL 7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30 7876 ADAMS CLERK 7788 1987-5-23 1100.0 20.0 NULL 7900 JAMES CLERK 7698 1981-12-3 950.0 30.0 NULL 7902 FORD ANALYST 7566 1981-12-3 3000.0 20.0 NULL 7934 MILLER CLERK 7782 1982-1-23 1300.0 10.0 NULL Time taken: 0.045 seconds, Fetched: 14 row(s) hive (icpc)> select * from dept; OK dept.deptno dept.dname dept.loc 10 ACCOUNTING 1700 20 RESEARCH 1800 30 SALES 1900 40 OPERATIONS 1700 Time taken: 0.111 seconds, Fetched: 4 row(s)
- 查看表结构
hive (icpc)> desc formatted dept; OK col_name data_type comment # col_name data_type comment deptno int dname string loc int # Detailed Table Information Database: icpc Owner: root CreateTime: Thu Aug 01 20:32:38 CST 2019 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://master:9000/user/hive/warehouse/icpc.db/dept Table Type: EXTERNAL_TABLE Table Parameters: COLUMN_STATS_ACCURATE true EXTERNAL TRUE numFiles 1 totalSize 69 transient_lastDdlTime 1564662921 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim , serialization.format , Time taken: 0.072 seconds, Fetched: 33 row(s) hive (icpc)> desc formatted emp;; OK col_name data_type comment # col_name data_type comment empno int ename string job string mgr int hiredate string sal double comm double deptno int # Detailed Table Information Database: icpc Owner: root CreateTime: Thu Aug 01 20:32:39 CST 2019 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://master:9000/user/hive/warehouse/icpc.db/emp Table Type: EXTERNAL_TABLE Table Parameters: COLUMN_STATS_ACCURATE true EXTERNAL TRUE numFiles 1 totalSize 646 transient_lastDdlTime 1564662873 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim , serialization.format , Time taken: 0.055 seconds, Fetched: 38 row(s)
4.5.3 管理表与外部表的互相转换¶
4.5.3.1 查询表的类型¶
hive (icpc)> desc formatted emp; Table Type: MANAGED_TABLE
4.5.3.2 修改内部表emp为外部表¶
hive (icpc)> alter table emp set tblproperties ('EXTERNAL'='TRUE'); OK
set tblproperties ('EXTERNAL'='TRUE'); 是固定用法,区分大小写。
hive (icpc)> desc formatted emp; Table Type: EXTERNAL_TABLE
4.5.3.3 修改外部表为内部表¶
hive (icpc)> alter table emp set tblproperties ('EXTERNAL'='FALSE'); hive (icpc)> desc formatted emp; Table Type: MANAGED_TABLE
注意:('EXTERNAL'='TRUE')和('EXTERNAL'='FALSE')为固定写法,区分大小写!
4.6 分区表¶
分区表实际上就是对应一个HDFS文件系统上的独立的文件夹,该文件夹下是该分区所有的数据文件。 Hive中的分区就是分目录 ,把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区,这样的查询效率会提高很多。
4.6.1 引入分区表¶
需要根据日期对日志进行管理,
/user/hive/warehouse/log_partition/20190801/20190801.log /user/hive/warehouse/log_partition/20190802/20190802.log /user/hive/warehouse/log_partition/20190803/20190803.log
4.6.2 创建分区表语法¶
cat>dept_partition.txt<<EOF 1,张三,南京 2,李四,北京 3,王五,上海 4,赵六,宿迁 EOF create table dept_partition( id int, name string, adress string ) partitioned by (month string) row format delimited fields terminated by ','; load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition partition(month='20190801'); load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition partition(month='20190802'); load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition partition(month='20190803');
详细操作
[root@master test]# cat>dept_partition.txt<<EOF > 1,张三,南京 > 2,李四,北京 > 3,王五,上海 > 4,赵六,宿迁 > EOF [root@master test]# pwd /root/hive/sgg/test hive (icpc)> use icpc; OK Time taken: 0.02 seconds hive (icpc)> show tables; OK tab_name dept emp Time taken: 0.027 seconds, Fetched: 2 row(s) hive (icpc)> create table dept_partition( > id int, > name string, > adress string > ) > partitioned by (month string) > row format delimited fields terminated by ','; OK Time taken: 1.103 seconds hive (icpc)> show tables; OK tab_name dept dept_partition emp Time taken: 0.027 seconds, Fetched: 3 row(s) 载入数据 hive (icpc)> load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition partition(month='20190 801');Loading data to table icpc.dept_partition partition (month=20190801) Partition icpc.dept_partition{month=20190801} stats: [numFiles=1, numRows=0, totalSize=64, rawDataSize=0] OK Time taken: 4.839 seconds hive (icpc)> load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition partition(month='20190802'); Loading data to table icpc.dept_partition partition (month=20190802) Partition icpc.dept_partition{month=20190802} stats: [numFiles=1, numRows=0, totalSize=64, rawDataSize=0] OK Time taken: 3.861 seconds hive (icpc)> load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition partition(month='20190803'); Loading data to table icpc.dept_partition partition (month=20190803) Partition icpc.dept_partition{month=20190803} stats: [numFiles=1, numRows=0, totalSize=64, rawDataSize=0] OK Time taken: 1.613 seconds 查看hdfs数据 hive (icpc)> dfs -ls /user/hive/warehouse/icpc.db/dept_partition/; Found 3 items drwxr-xr-x - root supergroup 0 2019-08-01 22:42 /user/hive/warehouse/icpc.db/dept_partition/month=20190801 drwxr-xr-x - root supergroup 0 2019-08-01 22:43 /user/hive/warehouse/icpc.db/dept_partition/month=20190802 drwxr-xr-x - root supergroup 0 2019-08-01 22:43 /user/hive/warehouse/icpc.db/dept_partition/month=20190803
4.6.3 查询分区表中数据¶
4.6.3.1 单分区查询¶
hive (icpc)> select * from dept_partition where month='20190801'; OK dept_partition.id dept_partition.name dept_partition.adress dept_partition.month 1 张三 南京 20190801 2 李四 北京 20190801 3 王五 上海 20190801 4 赵六 宿迁 20190801 Time taken: 0.078 seconds, Fetched: 4 row(s) hive (icpc)> select * from dept_partition where month='20190802'; OK dept_partition.id dept_partition.name dept_partition.adress dept_partition.month 1 张三 南京 20190802 2 李四 北京 20190802 3 王五 上海 20190802 4 赵六 宿迁 20190802 Time taken: 0.227 seconds, Fetched: 4 row(s) hive (icpc)> select * from dept_partition where month='20190803'; OK dept_partition.id dept_partition.name dept_partition.adress dept_partition.month 1 张三 南京 20190803 2 李四 北京 20190803 3 王五 上海 20190803 4 赵六 宿迁 20190803 Time taken: 0.078 seconds, Fetched: 4 row(s)
4.6.3.2 多分区联合查询¶
select * from dept_partition where month='20190801' union select * from dept_partition where month='20190802' union select * from dept_partition where month='20190803';
详细操作
hive (icpc)> select * from dept_partition where month='20190801' > union > select * from dept_partition where month='20190802' > union > select * from dept_partition where month='20190803'; Query ID = root_20190801224901_c3e3490f-9e8a-4c3c-87aa-f12926f6e21e Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1564669682555_0001, Tracking URL = http://master:8088/proxy/application_1564669682555_0001/ Kill Command = /usr/local/hadoop-2.6.5/bin/hadoop job -kill job_1564669682555_0001 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 2019-08-01 22:49:10,740 Stage-1 map = 0%, reduce = 0% 2019-08-01 22:49:19,302 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.08 sec 2019-08-01 22:49:29,129 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.44 sec MapReduce Total cumulative CPU time: 9 seconds 440 msec Ended Job = job_1564669682555_0001 Launching Job 2 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1564669682555_0002, Tracking URL = http://master:8088/proxy/application_1564669682555_0002/ Kill Command = /usr/local/hadoop-2.6.5/bin/hadoop job -kill job_1564669682555_0002 Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1 2019-08-01 22:49:43,545 Stage-2 map = 0%, reduce = 0% 2019-08-01 22:49:52,247 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 5.0 sec 2019-08-01 22:50:03,219 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 9.11 sec MapReduce Total cumulative CPU time: 9 seconds 110 msec Ended Job = job_1564669682555_0002 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 9.44 sec HDFS Read: 15383 HDFS Write: 424 SUCCESS Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 9.11 sec HDFS Read: 15079 HDFS Write: 444 SUCCESS Total MapReduce CPU Time Spent: 18 seconds 550 msec OK _u3.id _u3.name _u3.adress _u3.month 1 张三 南京 20190801 1 张三 南京 20190802 1 张三 南京 20190803 2 李四 北京 20190801 2 李四 北京 20190802 2 李四 北京 20190803 3 王五 上海 20190801 3 王五 上海 20190802 3 王五 上海 20190803 4 赵六 宿迁 20190801 4 赵六 宿迁 20190802 4 赵六 宿迁 20190803 Time taken: 62.651 seconds, Fetched: 12 row(s)
4.6.3.3 增加分区¶
- 增加单个分区
hive (icpc)> alter table dept_partition add partition(month='20190804') ; OK Time taken: 0.2 seconds
- 增加多个分区
hive (icpc)> alter table dept_partition add partition(month='20190805',month='20190806'); OK Time taken: 0.122 seconds
4.6.3.4 删除分区¶
- 删除单个分区
hive (icpc)> show partitions dept_partition; OK partition month=20190801 month=20190802 month=20190803 month=20190804 month=20190805 删除分区 hive (icpc)> alter table dept_partition drop partition (month='20190805'); Dropped the partition month=20190805 OK Time taken: 0.247 seconds hive (icpc)> show partitions dept_partion; FAILED: SemanticException [Error 10001]: Table not found dept_partion hive (icpc)> show partitions dept_partition; OK partition month=20190801 month=20190802 month=20190803 month=20190804 Time taken: 0.065 seconds, Fetched: 4 row(s)
- 删除多个分区
hive (icpc)> show partitions dept_partition; OK partition month=20190801 month=20190802 month=20190803 month=20190804 Time taken: 0.086 seconds, Fetched: 4 row(s) 删除多个分区 hive (icpc)> alter table dept_partition drop partition(month='20190804',month='20190803'); OK Time taken: 0.129 seconds hive (icpc)> show partitions dept_partition; OK partition month=20190801 month=20190802 month=20190803 month=20190804 Time taken: 0.078 seconds, Fetched: 4 row(s)
4.6.3.5 查看分区¶
hive (icpc)> show partitions dept_partition; OK partition month=20190801 month=20190802 month=20190803 month=20190804 month=20190805 Time taken: 0.055 seconds, Fetched: 5 row(s)
4.6.3.6 查看分区结构¶
hive (icpc)> desc formatted dept_partition; OK col_name data_type comment # col_name data_type comment id int name string adress string # Partition Information # col_name data_type comment month string # Detailed Table Information Database: icpc Owner: root CreateTime: Thu Aug 01 22:42:16 CST 2019 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://master:9000/user/hive/warehouse/icpc.db/dept_partition Table Type: MANAGED_TABLE Table Parameters: transient_lastDdlTime 1564670536 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim , serialization.format , Time taken: 0.089 seconds, Fetched: 34 row(s)
4.6.4 分区注意¶
- 创建二级分区表
use icpc; create table dept_partition2( id int, name string, loc string ) partitioned by (month string,day string) row format delimited fields terminated by ',';
详细操作
hive (icpc)> use icpc; OK Time taken: 0.034 seconds hive (icpc)> create table dept_partition2( > id int, > name string, > loc string > ) > partitioned by (month string,day string) > row format delimited fields terminated by ','; OK Time taken: 0.115 seconds
- 正常的加载数据
load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition2 partition(month='201908', day='13');
详细操作
hive (icpc)> load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition2 partition(month='201908', day='13'); Loading data to table icpc.dept_partition2 partition (month=201908, day=13) Partition icpc.dept_partition2{month=201908, day=13} stats: [numFiles=1, numRows=0, totalSize=64, rawDataSize=0] OK Time taken: 0.583 seconds hive (icpc)> show partitions dept_partition2; OK partition month=201908/day=13 Time taken: 0.066 seconds, Fetched: 1 row(s) hive (icpc)> dfs -ls /user/hive/warehouse/icpc.db/dept_partition2; Found 1 items drwxr-xr-x - root supergroup 0 2019-08-01 23:35 /user/hive/warehouse/icpc.db/dept_partition2/month=201908 hive (icpc)> dfs -ls /user/hive/warehouse/icpc.db/dept_partition2/month=201908/; Found 1 items drwxr-xr-x - root supergroup 0 2019-08-01 23:35 /user/hive/warehouse/icpc.db/dept_partition2/month=201908/day=13 hive (icpc)> dfs -ls /user/hive/warehouse/icpc.db/dept_partition2/month=201908/day=13; Found 1 items -rwxr-xr-x 3 root supergroup 64 2019-08-01 23:35 /user/hive/warehouse/icpc.db/dept_partition2/month=201908/day=13/dept_partition.txt 查看分区 hive (icpc)> select * from dept_partition2 where month='201908' and day='13'; OK dept_partition2.id dept_partition2.name dept_partition2.loc dept_partition2.month dept_partition2.day 1 张三 南京 201908 13 2 李四 北京 201908 13 3 王五 上海 201908 13 4 赵六 宿迁 201908 13 Time taken: 0.22 seconds, Fetched: 4 row(s)
创建多级分区的,就是使用partitioned by (month string,day string),谁在前面就是后面的,前一层分区,如上述, 也就是month分区下面还有day分区 ,在hdfs上体现就是month目录下还有day目录。
- 把数据直接上传到分区目录上,让分区表和数据产生关联的三种方式
-
方式一:上传数据后修复
- 上传数据
- 查询数据(查询不到刚上传的数据)
- 执行修复命令
- 再次查询数据
-
方式二:上传数据后添加分区
- 上传数据
- 执行添加分区
- 查询数据
-
方式三:创建文件夹后load数据到分区
- 创建目录
- 上传数据
- 查询数据