Hive 实战

write by 2019-07-28 23:50

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

1. Hive 内部表操作

  初始化环境

Hive SQL
show databases;
create database if not exists hivedemo;
show databases;

详细操作

hive> show databases;
OK
default
Time taken: 0.262 seconds, Fetched: 1 row(s)
hive> create database if not exists hivedemo;
OK
Time taken: 0.133 seconds
hive> show databases;
OK
default
hivedemo
Time taken: 0.011 seconds, Fetched: 2 row(s)

  创建内部表,命令如下

快速命令
cat>visiters_data.txt<<EOF
BUCKLEY SUMMER  10/12/2010 14:48    10/12/2010 14:45    WH  
CLOONEY GEORGE  10/12/2010 14:47    10/12/2010 14:45    WH  
PRENDERGAST JOHN    10/12/2010 14:48    10/12/2010 14:45    WH  
LANIER  JAZMIN      10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD ELIZABETH   10/13/2010 12:34    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD GREGORY 10/13/2010 12:35    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD JOANNE  10/13/2010 12:35    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD KATHERINE   10/13/2010 12:34    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD PHILIP  10/13/2010 12:35    10/13/2010 13:00    WH  BILL SIGNING/
MOHAN   EDWARD  10/13/2010 12:37    10/13/2010 13:00    WH  BILL SIGNING/
EOF


use hivedemo; 
create table visters
(
    last_name string, 
    first_name string, 
    arrival_time string, 
    scheduled_time string, 
    meeting_location string,
    info_comment string
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

# 从本地导入数据到hive
LOAD DATA LOCAL INPATH 'visiters_data.txt' OVERWRITE INTO TABLE visters;

# 从hdfs中导入数据到hive
LOAD DATA INPATH '/visiters_data.txt' OVERWRITE INTO TABLE visters;

select * from visters limit 5;

hadoop fs -text /user/hive/warehouse/hivedemo.db/visiters|head

注意 visiters_data.txt中字段是tab,你自己配置下,我的是空格了,

详细操作

hive> use hivedemo; 
OK
Time taken: 0.589 seconds
hive> create table visters
    > (
    >     last_name string, 
    >     first_name string, 
    >     arrival_time string, 
    >     scheduled_time string, 
    >     meeting_location string,
    >     info_comment string
    > ) 
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.356 seconds
hive> show tables;
OK
visters
Time taken: 0.087 seconds, Fetched: 1 row(s)

创建内部表以后,会在hdfs的/user/hive/warehouse/ 目录下创建一个以 库.db的目录。
[root@master ~]# hadoop fs -ls /user/hive/warehouse/hivedemo.db
Found 1 items
drwxr-xr-x   - root supergroup          0 2019-07-28 22:03 /user/hive/warehouse/hivedemo.db/visters

从本地导入数据和查看数据
[root@master innertable]# head visiters_data.txt 
BUCKLEY SUMMER  10/12/2010 14:48    10/12/2010 14:45    WH  
CLOONEY GEORGE  10/12/2010 14:47    10/12/2010 14:45    WH  
PRENDERGAST JOHN    10/12/2010 14:48    10/12/2010 14:45    WH  
LANIER  JAZMIN      10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD ELIZABETH   10/13/2010 12:34    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD GREGORY 10/13/2010 12:35    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD JOANNE  10/13/2010 12:35    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD KATHERINE   10/13/2010 12:34    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD PHILIP  10/13/2010 12:35    10/13/2010 13:00    WH  BILL SIGNING/
MOHAN   EDWARD  10/13/2010 12:37    10/13/2010 13:00    WH  BILL SIGNING/

hive> LOAD DATA LOCAL INPATH 'visiters_data.txt' OVERWRITE INTO TABLE visters;
Loading data to table hivedemo.visters
Table hivedemo.visters stats: [numFiles=1, numRows=0, totalSize=989239, rawDataSize=0]
OK
Time taken: 0.622 seconds

在Hive中查看
hive> select * from visters limit 5;
OK
BUCKLEY SUMMER  10/12/2010 14:48    10/12/2010 14:45    WH  
CLOONEY GEORGE  10/12/2010 14:47    10/12/2010 14:45    WH  
PRENDERGAST JOHN    10/12/2010 14:48    10/12/2010 14:45    WH  
LANIER  JAZMIN      10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD ELIZABETH   10/13/2010 12:34    10/13/2010 13:00    WH  BILL SIGNING/
Time taken: 0.31 seconds, Fetched: 5 row(s)

[root@master innertable]# hadoop fs -ls  /user/hive/warehouse/hivedemo.db/visters
Found 1 items
-rwxr-xr-x   3 root supergroup     989239 2019-08-10 11:14 /user/hive/warehouse/hivedemo.db/visters/visiters_data.txt

在hdfs中查看
[root@master innertable]#  hadoop fs -text /user/hive/warehouse/hivedemo.db/visters/*|head
BUCKLEY SUMMER  10/12/2010 14:48    10/12/2010 14:45    WH  
CLOONEY GEORGE  10/12/2010 14:47    10/12/2010 14:45    WH  
PRENDERGAST JOHN    10/12/2010 14:48    10/12/2010 14:45    WH  
LANIER  JAZMIN      10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD ELIZABETH   10/13/2010 12:34    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD GREGORY 10/13/2010 12:35    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD JOANNE  10/13/2010 12:35    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD KATHERINE   10/13/2010 12:34    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD PHILIP  10/13/2010 12:35    10/13/2010 13:00    WH  BILL SIGNING/
MOHAN   EDWARD  10/13/2010 12:37    10/13/2010 13:00    WH  BILL SIGNING/
text: Unable to write to output stream.

删除表
hive> drop table visters;
OK
Time taken: 0.129 seconds
hive> show tables;
OK
Time taken: 0.021 seconds

查看hdfs文件,也已经被删除了
[root@master innertable]# hadoop fs -ls /user/hive/warehouse/hivedemo.db/

内部表drop后,hdfs上对应的数据也相继被删除。

2. Hive 外部表操作

  创建外部表,就加EXTERNAL 描述即可。外部表和内部表最大区别就是内部表删除,hdfs里面数据也被删除了,而外部表被删除,hdfs里面数据不会被删除。

快速命令
use hivedemo; 
create EXTERNAL table visters_wai
(
    last_name string, 
    first_name string, 
    arrival_time string, 
    scheduled_time string, 
    meeting_location string,
    info_comment string
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

# 从本地导入数据到hive
LOAD DATA LOCAL INPATH 'visiters_data.txt' OVERWRITE INTO TABLE visters_wai;

# 从hdfs中导入数据到hive
LOAD DATA INPATH '/visiters_data.txt' OVERWRITE INTO TABLE visters_wai;

# 查看是哪种表
desc formatted visters_wai;

select * from visters_wai limit 5;

hadoop fs -text /user/hive/warehouse/hivedemo.db/visters_wai|head

详细操作

hive> create table EXTERNAL visters_wai
    > (
    >     last_name string, 
    >     first_name string, 
    >     arrival_time string, 
    >     scheduled_time string, 
    >     meeting_location string,
    >     info_comment string
    > ) 
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.066 seconds
hive> show tables;
OK
visters_wai
Time taken: 0.017 seconds, Fetched: 1 row(s)

hive>  desc formatted visters_wai;
OK
# col_name              data_type               comment             

last_name               string                                      
first_name              string                                      
arrival_time            string                                      
scheduled_time          string                                      
meeting_location        string                                      
info_comment            string                                      

# Detailed Table Information         
Database:               hivedemo                 
Owner:                  root                     
CreateTime:             Sun Jul 28 22:42:15 CST 2019     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://master:9000/user/hive/warehouse/hivedemo.db/visters_wai   
Table Type:             EXTERNAL_TABLE           
Table Parameters:        
    EXTERNAL                TRUE                
    transient_lastDdlTime   1564324935          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:         
    field.delim             \t                  
    serialization.format    \t                  
Time taken: 0.089 seconds, Fetched: 33 row(s)


导入数据
hive> LOAD DATA LOCAL INPATH 'visiters_data.txt' OVERWRITE INTO TABLE visters_wai;
Loading data to table hivedemo.visters_wai
Table hivedemo.visters_wai stats: [numFiles=1, numRows=0, totalSize=989239, rawDataSize=0]
OK
Time taken: 0.232 seconds

hive中查看
hive> select * from visters_wai limit 5;
OK
BUCKLEY SUMMER  10/12/2010 14:48    10/12/2010 14:45    WH  
CLOONEY GEORGE  10/12/2010 14:47    10/12/2010 14:45    WH  
PRENDERGAST JOHN    10/12/2010 14:48    10/12/2010 14:45    WH  
LANIER  JAZMIN      10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD ELIZABETH   10/13/2010 12:34    10/13/2010 13:00    WH  BILL SIGNING/
Time taken: 0.051 seconds, Fetched: 5 row(s)

hdfs中查看
[root@master innertable]# hadoop fs -text /user/hive/warehouse/hivedemo.db/visters_wai/visiters_data.txt|head
BUCKLEY SUMMER  10/12/2010 14:48    10/12/2010 14:45    WH  
CLOONEY GEORGE  10/12/2010 14:47    10/12/2010 14:45    WH  
PRENDERGAST JOHN    10/12/2010 14:48    10/12/2010 14:45    WH  
LANIER  JAZMIN      10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD ELIZABETH   10/13/2010 12:34    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD GREGORY 10/13/2010 12:35    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD JOANNE  10/13/2010 12:35    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD KATHERINE   10/13/2010 12:34    10/13/2010 13:00    WH  BILL SIGNING/
MAYNARD PHILIP  10/13/2010 12:35    10/13/2010 13:00    WH  BILL SIGNING/
MOHAN   EDWARD  10/13/2010 12:37    10/13/2010 13:00    WH  BILL SIGNING/
text: Unable to write to output stream.


删除外部表
hive> drop table visters_wai;
OK
Time taken: 0.109 seconds
hive> show tables;
OK
Time taken: 0.022 seconds

hive删除外部表以后,此时在去查看hdfs中数据
[root@master innertable]# hadoop fs -ls /user/hive/warehouse/hivedemo.db/visters_wai
Found 1 items
-rwxr-xr-x   3 root supergroup     989239 2019-07-28 22:43 /user/hive/warehouse/hivedemo.db/visters_wai/visiters_data.txt
hdfs数据仍然在,只是存放在MySQL中的Hive对应的源数据被删除掉了,只要重新创建该表就可以在Hive中查找到了,一般不建议删除,看需求。

hive> create EXTERNAL table visters_wai
    > (
    >     last_name string, 
    >     first_name string, 
    >     arrival_time string, 
    >     scheduled_time string, 
    >     meeting_location string,
    >     info_comment string
    > ) 
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.075 seconds
hive> select * from visters_wai limit 3;
OK
BUCKLEY SUMMER  10/12/2010 14:48    10/12/2010 14:45    WH  
CLOONEY GEORGE  10/12/2010 14:47    10/12/2010 14:45    WH  
PRENDERGAST JOHN    10/12/2010 14:48    10/12/2010 14:45    WH  
Time taken: 0.078 seconds, Fetched: 3 row(s)
hive> 

总结

1. 内部表放在 Hive 数据库中,drop 表,里面的数据文件也随之删除。
2. 外部表,真实数据放在 HDFS 的文件中,drop表,仅删除存放在MySQL的Hive元数据,HDFS的真实数据还在。

3. Hive 分区表操作

  创建表时可以同时为表创建一个或多个分区,这样在加载数据时为其指定具体的分区,查询数据时可以指定具体的分区从而提高效率。分区可以理解为表的一个特殊的列。关键字是partitioned。

快速命令
create table invites
(
    id int,
    name string
) partitioned by (ds string)
row format delimited 
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
stored as textfile;

LOAD DATA LOCAL INPATH 'data2.txt' OVERWRITE INTO TABLE invites partition (ds='2013');
LOAD DATA LOCAL INPATH 'data2.txt' OVERWRITE INTO TABLE invites partition (ds='2014');
select * from invites where ds='2013';
select * from invites where ds='2014';

详细操作

hive> create table invites
    > (
    >     id int,
    >     name string
    > ) partitioned by (ds string)
    > row format delimited 
    > FIELDS TERMINATED BY '\t'
    > LINES TERMINATED BY '\n'
    > stored as textfile;
OK
Time taken: 0.056 seconds

hive> LOAD DATA LOCAL INPATH 'data2.txt' OVERWRITE INTO TABLE invites partition (ds='2013');
Loading data to table hivedemo.invites partition (ds=2013)
Partition hivedemo.invites{ds=2013} stats: [numFiles=1, numRows=0, totalSize=26, rawDataSize=0]
OK
Time taken: 0.403 seconds
hive> LOAD DATA LOCAL INPATH 'data2.txt' OVERWRITE INTO TABLE invites partition (ds='2014');
Loading data to table hivedemo.invites partition (ds=2014)
Partition hivedemo.invites{ds=2014} stats: [numFiles=1, numRows=0, totalSize=26, rawDataSize=0]
OK
Time taken: 0.314 seconds


hive> select * from invites where ds='2013';
OK
4   maliu   2013
5   houqi   2013
6   zhaoba  2013
Time taken: 0.261 seconds, Fetched: 3 row(s)
hive> select * from invites where ds='2014';
OK
4   maliu   2014
5   houqi   2014
6   zhaoba  2014
Time taken: 0.068 seconds, Fetched: 3 row(s)

在Hadoop 分区表是把分区当成目录的,分区表实际上是将表文件分成多个有标记的小文件以方便查询。
[root@master innertable]# hadoop fs -ls /user/hive/warehouse/hivedemo.db/invites
Found 2 items
drwxr-xr-x   - root supergroup          0 2019-07-28 22:58 /user/hive/warehouse/hivedemo.db/invites/ds=2013
drwxr-xr-x   - root supergroup          0 2019-07-28 23:00 /user/hive/warehouse/hivedemo.db/invites/ds=2014

4. Hive 桶表操作

  分区表是将大的表文件划分成多个小文件以利于查询,但是如果数据分布不均衡,也会影响查询效率。桶表( bucket table )可以对数据进行哈希取模,目的是让数据能够均匀地分布在表的各个数据文件中。简而言之,桶表就是在分桶时,对指定宇段的值进行 hash 运算得到hash值,并使hash值除以桶的个数做取余运算得到的值进行分桶,放到不同文件中存储。物理上,每个桶bucket就是表(或分区〉目录里的一个文件,一个作业产生的桶(输出文件)和reduce任务个数相同。桶表专门用于抽样查询,不是日常用来存储数据的表,在需要抽样查询时,才创建和使用桶表。

桶表演示过程如下:

  • 环境配置,使 Hive 能够识别桶,set hive.enforce.bucketing=true
  • 创建桶表, create table bucket_table1 (id int) clustered by(id) into 8 buckets 创建只有一个字段[id]的桶表,clustered by 参数表明是以哪个字段分桶,按照id分桶,分为8个bucket,对id进行哈希取值,放到8个桶里。
  • 创建中间过滤表 bucket_table2并为其记载数据
快速命令
set hive.enforce.bucketing=true;
create table bucket_table1 (id int) clustered by(id) into 8 buckets;
create table bucket_table2 (id int);
load data local inpath 'testbucket1.txt' into table bucket_table2;

详细过程

hive> set hive.enforce.bucketing=true;
hive> create table bucket_table1 (id int) clustered by(id) into 8 buckets;
OK
Time taken: 0.097 seconds
hive> create table bucket_table2 (id int);
OK
Time taken: 0.055 seconds
hive> load data local inpath 'testbucket1.txt' into table bucket_table2;
Loading data to table hivedemo.bucket_table2
Table hivedemo.bucket_table2 stats: [numFiles=1, totalSize=119]
OK
Time taken: 0.201 seconds
  • 通表的数据插入,使用insert into table bucket_table1 select * from bucket_table2
hive> insert into table bucket_table1 select * from bucket_table2;
Query ID = root_20190728231752_1c61ebbe-eb14-4468-9a68-65d9278a5e2e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 8
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1564321568958_0001, Tracking URL = http://master:8088/proxy/application_1564321568958_0001/
Kill Command = /usr/local/hadoop-2.6.5/bin/hadoop job  -kill job_1564321568958_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 8
2019-07-28 23:18:03,142 Stage-1 map = 0%,  reduce = 0%
2019-07-28 23:18:09,507 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.01 sec
2019-07-28 23:18:23,394 Stage-1 map = 100%,  reduce = 8%, Cumulative CPU 5.96 sec
2019-07-28 23:18:26,875 Stage-1 map = 100%,  reduce = 13%, Cumulative CPU 8.13 sec
2019-07-28 23:18:30,337 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU 13.24 sec
2019-07-28 23:18:32,551 Stage-1 map = 100%,  reduce = 38%, Cumulative CPU 20.12 sec
2019-07-28 23:18:33,605 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 25.98 sec
2019-07-28 23:18:38,863 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 43.12 sec
MapReduce Total cumulative CPU time: 43 seconds 120 msec
Ended Job = job_1564321568958_0001
Loading data to table hivedemo.bucket_table1
Table hivedemo.bucket_table1 stats: [numFiles=8, numRows=32, totalSize=87, rawDataSize=55]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 8   Cumulative CPU: 43.12 sec   HDFS Read: 25584 HDFS Write: 711 SUCCESS
Total MapReduce CPU Time Spent: 43 seconds 120 msec
OK
Time taken: 48.749 seconds

物理上每个桶就是目录里的一个文件,一个作业产生的桶(输出文件)数量和 reduce 任务个数相同,通过 HDFS shell命令,可以看到有8个文件。

[root@master innertable]# hadoop fs -ls /user/hive/warehouse/hivedemo.db/bucket_table1
Found 8 items
-rwxr-xr-x   3 root supergroup         11 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000000_0
-rwxr-xr-x   3 root supergroup         10 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000001_0
-rwxr-xr-x   3 root supergroup         11 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000002_0
-rwxr-xr-x   3 root supergroup         11 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000003_0
-rwxr-xr-x   3 root supergroup         11 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000004_0
-rwxr-xr-x   3 root supergroup         11 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000005_0
-rwxr-xr-x   3 root supergroup         11 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000006_0
-rwxr-xr-x   3 root supergroup         11 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000007_0
  • Hive中的抽样查询select * from bucket_table1 tablesample(bucket 3 out of 4 on id)看下语法:

select * from table_name tablesample(bucket x out of y on field)
1. x 表示从哪个桶开始抽取,y表示相隔多个桶再次抽取
2. y 必须为分桶数量的倍数或者因子,比如分桶数为8,
   y为8则表示从桶中抽取一个bucket的数据,若y=4则表示从桶中抽取8/4 [2] 个bucket的数据
3. 上面抽样的语句执行记过是抽取第三个桶和第七个桶数据
hive> select * from bucket_table1 tablesample(bucket 3 out of 4 on id);
OK
2
10
26
18
22
6
14
30
Time taken: 0.118 seconds, Fetched: 8 row(s)
通过HDFS的shell命令,查看第三个和第七个桶的数据,
[root@master innertable]# hadoop fs -text /user/hive/warehouse/hivedemo.db/bucket_table1/000002_0
2
10
26
18
[root@master innertable]# hadoop fs -text /user/hive/warehouse/hivedemo.db/bucket_table1/000006_0
22
6
14
30

总而言之,对于每个表(table)或者分区,Hive可以进一步组织成桶,桶是更为细粒度的数据范围划分。Hive是针对某列进行分桶,对这列的值进行 Hash ,然后除以桶的个数再决定把这个值放到哪个桶中。

5. Hive 应用实例

  Hadoop 经典词频统计 WordCount.

快速命令
cat >>cmz1.txt<<EOF
hadoop is great 
spark is good
EOF

cat >>cmz2.txt<<EOF
hadoop is good
spark is great
hive is good
EOF

hadoop fs -mkdir /cmz
hadoop fs -put cmz1.txt cmz2.txt /cmz
hadoop fs -ls /cmz

详细操作

[root@master innertable]# cat cmz1.txt 
hadoop is great 
spark is good
[root@master innertable]# cat cmz2.txt 
hadoop is good
spark is great
hive is good
[root@master innertable]# hadoop fs -mkdir /cmz
[root@master innertable]# hadoop fs -put cmz1.txt cmz2.txt /cmz
[root@master innertable]# hadoop fs -ls /cmz
Found 2 items
-rw-r--r--   3 root supergroup         31 2019-07-28 23:35 /cmz/cmz1.txt
-rw-r--r--   3 root supergroup         43 2019-07-28 23:35 /cmz/cmz2.txt

用Hive编写HiveQL语言实现WordCount 算法如下:

快速命令
create table docs(line string);
load data inpath '/cmz/*.txt' overwrite into table docs;
create table word_count as  
    select word,count(1) as count from 
        (select explode (split(line," "))as word from docs) w
            group by word
            order by word;

select * from word_count;

详细操作

hive> create table word_count as  
    >     select word,count(1) as count from 
    >         (select explode (split(line," "))as word from docs) w
    >             group by word
    >             order by word;
Query ID = root_20190728233923_6645ea24-c03c-4dc8-bd9f-852b5d175df1
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1564321568958_0002, Tracking URL = http://master:8088/proxy/application_1564321568958_0002/
Kill Command = /usr/local/hadoop-2.6.5/bin/hadoop job  -kill job_1564321568958_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-07-28 23:39:28,602 Stage-1 map = 0%,  reduce = 0%
2019-07-28 23:39:32,765 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.63 sec
2019-07-28 23:39:37,962 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.85 sec
MapReduce Total cumulative CPU time: 2 seconds 850 msec
Ended Job = job_1564321568958_0002
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1564321568958_0003, Tracking URL = http://master:8088/proxy/application_1564321568958_0003/
Kill Command = /usr/local/hadoop-2.6.5/bin/hadoop job  -kill job_1564321568958_0003
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2019-07-28 23:39:47,781 Stage-2 map = 0%,  reduce = 0%
2019-07-28 23:39:51,929 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.87 sec
2019-07-28 23:39:57,111 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 2.78 sec
MapReduce Total cumulative CPU time: 2 seconds 780 msec
Ended Job = job_1564321568958_0003
Moving data to: hdfs://master:9000/user/hive/warehouse/hivedemo.db/word_count
Table hivedemo.word_count stats: [numFiles=1, numRows=7, totalSize=47, rawDataSize=40]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.85 sec   HDFS Read: 7107 HDFS Write: 255 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 2.78 sec   HDFS Read: 4528 HDFS Write: 122 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 630 msec
OK
Time taken: 34.594 seconds

hive> select * from word_count;
OK
    1
good    3
great   2
hadoop  2
hive    1
is  5
spark   2
Time taken: 0.047 seconds, Fetched: 7 row(s)

WordCount 单词统计算法在 MapReduce 中的编程实现和 ive 编程实现的主要不同点如下:

  • 采用 Hive实现WordCount 算法只需要编写较少的代码量,在MapReduce 中, WordCount类由 63行Java 码编写而成,在 Hive 中只需要编写7行代码

  • 在MapReduce实现中,需要进行编译生成JAR文件来执行算法,而在Hive中不需要,HiveQL语旬的最终实现需要转换为MapReduce务来执行这都是由 Hive 框架自 动完成的,用户不需要了解具体实现细节