Hive 实战
write by 2019-07-28 23:50
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
1. Hive 内部表操作¶
初始化环境
Hive SQL
show databases; create database if not exists hivedemo; show databases;
详细操作
hive> show databases; OK default Time taken: 0.262 seconds, Fetched: 1 row(s) hive> create database if not exists hivedemo; OK Time taken: 0.133 seconds hive> show databases; OK default hivedemo Time taken: 0.011 seconds, Fetched: 2 row(s)
创建内部表,命令如下
快速命令
cat>visiters_data.txt<<EOF BUCKLEY SUMMER 10/12/2010 14:48 10/12/2010 14:45 WH CLOONEY GEORGE 10/12/2010 14:47 10/12/2010 14:45 WH PRENDERGAST JOHN 10/12/2010 14:48 10/12/2010 14:45 WH LANIER JAZMIN 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD ELIZABETH 10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD GREGORY 10/13/2010 12:35 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD JOANNE 10/13/2010 12:35 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD KATHERINE 10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD PHILIP 10/13/2010 12:35 10/13/2010 13:00 WH BILL SIGNING/ MOHAN EDWARD 10/13/2010 12:37 10/13/2010 13:00 WH BILL SIGNING/ EOF use hivedemo; create table visters ( last_name string, first_name string, arrival_time string, scheduled_time string, meeting_location string, info_comment string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; # 从本地导入数据到hive LOAD DATA LOCAL INPATH 'visiters_data.txt' OVERWRITE INTO TABLE visters; # 从hdfs中导入数据到hive LOAD DATA INPATH '/visiters_data.txt' OVERWRITE INTO TABLE visters; select * from visters limit 5; hadoop fs -text /user/hive/warehouse/hivedemo.db/visiters|head
注意 visiters_data.txt中字段是tab,你自己配置下,我的是空格了,
详细操作
hive> use hivedemo; OK Time taken: 0.589 seconds hive> create table visters > ( > last_name string, > first_name string, > arrival_time string, > scheduled_time string, > meeting_location string, > info_comment string > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; OK Time taken: 0.356 seconds hive> show tables; OK visters Time taken: 0.087 seconds, Fetched: 1 row(s) 创建内部表以后,会在hdfs的/user/hive/warehouse/ 目录下创建一个以 库.db的目录。 [root@master ~]# hadoop fs -ls /user/hive/warehouse/hivedemo.db Found 1 items drwxr-xr-x - root supergroup 0 2019-07-28 22:03 /user/hive/warehouse/hivedemo.db/visters 从本地导入数据和查看数据 [root@master innertable]# head visiters_data.txt BUCKLEY SUMMER 10/12/2010 14:48 10/12/2010 14:45 WH CLOONEY GEORGE 10/12/2010 14:47 10/12/2010 14:45 WH PRENDERGAST JOHN 10/12/2010 14:48 10/12/2010 14:45 WH LANIER JAZMIN 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD ELIZABETH 10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD GREGORY 10/13/2010 12:35 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD JOANNE 10/13/2010 12:35 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD KATHERINE 10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD PHILIP 10/13/2010 12:35 10/13/2010 13:00 WH BILL SIGNING/ MOHAN EDWARD 10/13/2010 12:37 10/13/2010 13:00 WH BILL SIGNING/ hive> LOAD DATA LOCAL INPATH 'visiters_data.txt' OVERWRITE INTO TABLE visters; Loading data to table hivedemo.visters Table hivedemo.visters stats: [numFiles=1, numRows=0, totalSize=989239, rawDataSize=0] OK Time taken: 0.622 seconds 在Hive中查看 hive> select * from visters limit 5; OK BUCKLEY SUMMER 10/12/2010 14:48 10/12/2010 14:45 WH CLOONEY GEORGE 10/12/2010 14:47 10/12/2010 14:45 WH PRENDERGAST JOHN 10/12/2010 14:48 10/12/2010 14:45 WH LANIER JAZMIN 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD ELIZABETH 10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/ Time taken: 0.31 seconds, Fetched: 5 row(s) [root@master innertable]# hadoop fs -ls /user/hive/warehouse/hivedemo.db/visters Found 1 items -rwxr-xr-x 3 root supergroup 989239 2019-08-10 11:14 /user/hive/warehouse/hivedemo.db/visters/visiters_data.txt 在hdfs中查看 [root@master innertable]# hadoop fs -text /user/hive/warehouse/hivedemo.db/visters/*|head BUCKLEY SUMMER 10/12/2010 14:48 10/12/2010 14:45 WH CLOONEY GEORGE 10/12/2010 14:47 10/12/2010 14:45 WH PRENDERGAST JOHN 10/12/2010 14:48 10/12/2010 14:45 WH LANIER JAZMIN 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD ELIZABETH 10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD GREGORY 10/13/2010 12:35 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD JOANNE 10/13/2010 12:35 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD KATHERINE 10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD PHILIP 10/13/2010 12:35 10/13/2010 13:00 WH BILL SIGNING/ MOHAN EDWARD 10/13/2010 12:37 10/13/2010 13:00 WH BILL SIGNING/ text: Unable to write to output stream. 删除表 hive> drop table visters; OK Time taken: 0.129 seconds hive> show tables; OK Time taken: 0.021 seconds 查看hdfs文件,也已经被删除了 [root@master innertable]# hadoop fs -ls /user/hive/warehouse/hivedemo.db/
内部表drop后,hdfs上对应的数据也相继被删除。
2. Hive 外部表操作¶
创建外部表,就加EXTERNAL
描述即可。外部表和内部表最大区别就是内部表删除,hdfs里面数据也被删除了,而外部表被删除,hdfs里面数据不会被删除。
快速命令
use hivedemo; create EXTERNAL table visters_wai ( last_name string, first_name string, arrival_time string, scheduled_time string, meeting_location string, info_comment string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; # 从本地导入数据到hive LOAD DATA LOCAL INPATH 'visiters_data.txt' OVERWRITE INTO TABLE visters_wai; # 从hdfs中导入数据到hive LOAD DATA INPATH '/visiters_data.txt' OVERWRITE INTO TABLE visters_wai; # 查看是哪种表 desc formatted visters_wai; select * from visters_wai limit 5; hadoop fs -text /user/hive/warehouse/hivedemo.db/visters_wai|head
详细操作
hive> create table EXTERNAL visters_wai > ( > last_name string, > first_name string, > arrival_time string, > scheduled_time string, > meeting_location string, > info_comment string > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; OK Time taken: 0.066 seconds hive> show tables; OK visters_wai Time taken: 0.017 seconds, Fetched: 1 row(s) hive> desc formatted visters_wai; OK # col_name data_type comment last_name string first_name string arrival_time string scheduled_time string meeting_location string info_comment string # Detailed Table Information Database: hivedemo Owner: root CreateTime: Sun Jul 28 22:42:15 CST 2019 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://master:9000/user/hive/warehouse/hivedemo.db/visters_wai Table Type: EXTERNAL_TABLE Table Parameters: EXTERNAL TRUE transient_lastDdlTime 1564324935 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim \t serialization.format \t Time taken: 0.089 seconds, Fetched: 33 row(s) 导入数据 hive> LOAD DATA LOCAL INPATH 'visiters_data.txt' OVERWRITE INTO TABLE visters_wai; Loading data to table hivedemo.visters_wai Table hivedemo.visters_wai stats: [numFiles=1, numRows=0, totalSize=989239, rawDataSize=0] OK Time taken: 0.232 seconds hive中查看 hive> select * from visters_wai limit 5; OK BUCKLEY SUMMER 10/12/2010 14:48 10/12/2010 14:45 WH CLOONEY GEORGE 10/12/2010 14:47 10/12/2010 14:45 WH PRENDERGAST JOHN 10/12/2010 14:48 10/12/2010 14:45 WH LANIER JAZMIN 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD ELIZABETH 10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/ Time taken: 0.051 seconds, Fetched: 5 row(s) hdfs中查看 [root@master innertable]# hadoop fs -text /user/hive/warehouse/hivedemo.db/visters_wai/visiters_data.txt|head BUCKLEY SUMMER 10/12/2010 14:48 10/12/2010 14:45 WH CLOONEY GEORGE 10/12/2010 14:47 10/12/2010 14:45 WH PRENDERGAST JOHN 10/12/2010 14:48 10/12/2010 14:45 WH LANIER JAZMIN 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD ELIZABETH 10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD GREGORY 10/13/2010 12:35 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD JOANNE 10/13/2010 12:35 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD KATHERINE 10/13/2010 12:34 10/13/2010 13:00 WH BILL SIGNING/ MAYNARD PHILIP 10/13/2010 12:35 10/13/2010 13:00 WH BILL SIGNING/ MOHAN EDWARD 10/13/2010 12:37 10/13/2010 13:00 WH BILL SIGNING/ text: Unable to write to output stream. 删除外部表 hive> drop table visters_wai; OK Time taken: 0.109 seconds hive> show tables; OK Time taken: 0.022 seconds hive删除外部表以后,此时在去查看hdfs中数据 [root@master innertable]# hadoop fs -ls /user/hive/warehouse/hivedemo.db/visters_wai Found 1 items -rwxr-xr-x 3 root supergroup 989239 2019-07-28 22:43 /user/hive/warehouse/hivedemo.db/visters_wai/visiters_data.txt hdfs数据仍然在,只是存放在MySQL中的Hive对应的源数据被删除掉了,只要重新创建该表就可以在Hive中查找到了,一般不建议删除,看需求。 hive> create EXTERNAL table visters_wai > ( > last_name string, > first_name string, > arrival_time string, > scheduled_time string, > meeting_location string, > info_comment string > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; OK Time taken: 0.075 seconds hive> select * from visters_wai limit 3; OK BUCKLEY SUMMER 10/12/2010 14:48 10/12/2010 14:45 WH CLOONEY GEORGE 10/12/2010 14:47 10/12/2010 14:45 WH PRENDERGAST JOHN 10/12/2010 14:48 10/12/2010 14:45 WH Time taken: 0.078 seconds, Fetched: 3 row(s) hive>
总结
1. 内部表放在 Hive 数据库中,drop 表,里面的数据文件也随之删除。 2. 外部表,真实数据放在 HDFS 的文件中,drop表,仅删除存放在MySQL的Hive元数据,HDFS的真实数据还在。
3. Hive 分区表操作¶
创建表时可以同时为表创建一个或多个分区,这样在加载数据时为其指定具体的分区,查询数据时可以指定具体的分区从而提高效率。分区可以理解为表的一个特殊的列。关键字是partitioned。
快速命令
create table invites ( id int, name string ) partitioned by (ds string) row format delimited FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' stored as textfile; LOAD DATA LOCAL INPATH 'data2.txt' OVERWRITE INTO TABLE invites partition (ds='2013'); LOAD DATA LOCAL INPATH 'data2.txt' OVERWRITE INTO TABLE invites partition (ds='2014'); select * from invites where ds='2013'; select * from invites where ds='2014';
详细操作
hive> create table invites > ( > id int, > name string > ) partitioned by (ds string) > row format delimited > FIELDS TERMINATED BY '\t' > LINES TERMINATED BY '\n' > stored as textfile; OK Time taken: 0.056 seconds hive> LOAD DATA LOCAL INPATH 'data2.txt' OVERWRITE INTO TABLE invites partition (ds='2013'); Loading data to table hivedemo.invites partition (ds=2013) Partition hivedemo.invites{ds=2013} stats: [numFiles=1, numRows=0, totalSize=26, rawDataSize=0] OK Time taken: 0.403 seconds hive> LOAD DATA LOCAL INPATH 'data2.txt' OVERWRITE INTO TABLE invites partition (ds='2014'); Loading data to table hivedemo.invites partition (ds=2014) Partition hivedemo.invites{ds=2014} stats: [numFiles=1, numRows=0, totalSize=26, rawDataSize=0] OK Time taken: 0.314 seconds hive> select * from invites where ds='2013'; OK 4 maliu 2013 5 houqi 2013 6 zhaoba 2013 Time taken: 0.261 seconds, Fetched: 3 row(s) hive> select * from invites where ds='2014'; OK 4 maliu 2014 5 houqi 2014 6 zhaoba 2014 Time taken: 0.068 seconds, Fetched: 3 row(s) 在Hadoop 分区表是把分区当成目录的,分区表实际上是将表文件分成多个有标记的小文件以方便查询。 [root@master innertable]# hadoop fs -ls /user/hive/warehouse/hivedemo.db/invites Found 2 items drwxr-xr-x - root supergroup 0 2019-07-28 22:58 /user/hive/warehouse/hivedemo.db/invites/ds=2013 drwxr-xr-x - root supergroup 0 2019-07-28 23:00 /user/hive/warehouse/hivedemo.db/invites/ds=2014
4. Hive 桶表操作¶
分区表是将大的表文件划分成多个小文件以利于查询,但是如果数据分布不均衡,也会影响查询效率。桶表( bucket table )可以对数据进行哈希取模,目的是让数据能够均匀地分布在表的各个数据文件中。简而言之,桶表就是在分桶时,对指定宇段的值进行 hash 运算得到hash值,并使hash值除以桶的个数做取余运算得到的值进行分桶,放到不同文件中存储。物理上,每个桶bucket就是表(或分区〉目录里的一个文件,一个作业产生的桶(输出文件)和reduce任务个数相同。桶表专门用于抽样查询,不是日常用来存储数据的表,在需要抽样查询时,才创建和使用桶表。
桶表演示过程如下:
- 环境配置,使 Hive 能够识别桶,
set hive.enforce.bucketing=true
- 创建桶表,
create table bucket_table1 (id int) clustered by(id) into 8 buckets
创建只有一个字段[id]的桶表,clustered by 参数表明是以哪个字段分桶,按照id分桶,分为8个bucket,对id进行哈希取值,放到8个桶里。 - 创建中间过滤表 bucket_table2并为其记载数据
快速命令
set hive.enforce.bucketing=true; create table bucket_table1 (id int) clustered by(id) into 8 buckets; create table bucket_table2 (id int); load data local inpath 'testbucket1.txt' into table bucket_table2;
详细过程
hive> set hive.enforce.bucketing=true; hive> create table bucket_table1 (id int) clustered by(id) into 8 buckets; OK Time taken: 0.097 seconds hive> create table bucket_table2 (id int); OK Time taken: 0.055 seconds hive> load data local inpath 'testbucket1.txt' into table bucket_table2; Loading data to table hivedemo.bucket_table2 Table hivedemo.bucket_table2 stats: [numFiles=1, totalSize=119] OK Time taken: 0.201 seconds
- 通表的数据插入,使用
insert into table bucket_table1 select * from bucket_table2
hive> insert into table bucket_table1 select * from bucket_table2; Query ID = root_20190728231752_1c61ebbe-eb14-4468-9a68-65d9278a5e2e Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 8 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1564321568958_0001, Tracking URL = http://master:8088/proxy/application_1564321568958_0001/ Kill Command = /usr/local/hadoop-2.6.5/bin/hadoop job -kill job_1564321568958_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 8 2019-07-28 23:18:03,142 Stage-1 map = 0%, reduce = 0% 2019-07-28 23:18:09,507 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.01 sec 2019-07-28 23:18:23,394 Stage-1 map = 100%, reduce = 8%, Cumulative CPU 5.96 sec 2019-07-28 23:18:26,875 Stage-1 map = 100%, reduce = 13%, Cumulative CPU 8.13 sec 2019-07-28 23:18:30,337 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 13.24 sec 2019-07-28 23:18:32,551 Stage-1 map = 100%, reduce = 38%, Cumulative CPU 20.12 sec 2019-07-28 23:18:33,605 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 25.98 sec 2019-07-28 23:18:38,863 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 43.12 sec MapReduce Total cumulative CPU time: 43 seconds 120 msec Ended Job = job_1564321568958_0001 Loading data to table hivedemo.bucket_table1 Table hivedemo.bucket_table1 stats: [numFiles=8, numRows=32, totalSize=87, rawDataSize=55] MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 8 Cumulative CPU: 43.12 sec HDFS Read: 25584 HDFS Write: 711 SUCCESS Total MapReduce CPU Time Spent: 43 seconds 120 msec OK Time taken: 48.749 seconds
物理上每个桶就是目录里的一个文件,一个作业产生的桶(输出文件)数量和 reduce 任务个数相同,通过 HDFS shell命令,可以看到有8个文件。
[root@master innertable]# hadoop fs -ls /user/hive/warehouse/hivedemo.db/bucket_table1 Found 8 items -rwxr-xr-x 3 root supergroup 11 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000000_0 -rwxr-xr-x 3 root supergroup 10 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000001_0 -rwxr-xr-x 3 root supergroup 11 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000002_0 -rwxr-xr-x 3 root supergroup 11 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000003_0 -rwxr-xr-x 3 root supergroup 11 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000004_0 -rwxr-xr-x 3 root supergroup 11 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000005_0 -rwxr-xr-x 3 root supergroup 11 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000006_0 -rwxr-xr-x 3 root supergroup 11 2019-07-28 23:18 /user/hive/warehouse/hivedemo.db/bucket_table1/000007_0
- Hive中的抽样查询
select * from bucket_table1 tablesample(bucket 3 out of 4 on id)
看下语法:
select * from table_name tablesample(bucket x out of y on field) 1. x 表示从哪个桶开始抽取,y表示相隔多个桶再次抽取 2. y 必须为分桶数量的倍数或者因子,比如分桶数为8, y为8则表示从桶中抽取一个bucket的数据,若y=4则表示从桶中抽取8/4 [2] 个bucket的数据 3. 上面抽样的语句执行记过是抽取第三个桶和第七个桶数据
hive> select * from bucket_table1 tablesample(bucket 3 out of 4 on id); OK 2 10 26 18 22 6 14 30 Time taken: 0.118 seconds, Fetched: 8 row(s)
[root@master innertable]# hadoop fs -text /user/hive/warehouse/hivedemo.db/bucket_table1/000002_0 2 10 26 18 [root@master innertable]# hadoop fs -text /user/hive/warehouse/hivedemo.db/bucket_table1/000006_0 22 6 14 30
总而言之,对于每个表(table)或者分区,Hive可以进一步组织成桶,桶是更为细粒度的数据范围划分。Hive是针对某列进行分桶,对这列的值进行 Hash ,然后除以桶的个数再决定把这个值放到哪个桶中。
5. Hive 应用实例¶
Hadoop 经典词频统计 WordCount.
快速命令
cat >>cmz1.txt<<EOF hadoop is great spark is good EOF cat >>cmz2.txt<<EOF hadoop is good spark is great hive is good EOF hadoop fs -mkdir /cmz hadoop fs -put cmz1.txt cmz2.txt /cmz hadoop fs -ls /cmz
详细操作
[root@master innertable]# cat cmz1.txt hadoop is great spark is good [root@master innertable]# cat cmz2.txt hadoop is good spark is great hive is good [root@master innertable]# hadoop fs -mkdir /cmz [root@master innertable]# hadoop fs -put cmz1.txt cmz2.txt /cmz [root@master innertable]# hadoop fs -ls /cmz Found 2 items -rw-r--r-- 3 root supergroup 31 2019-07-28 23:35 /cmz/cmz1.txt -rw-r--r-- 3 root supergroup 43 2019-07-28 23:35 /cmz/cmz2.txt
用Hive编写HiveQL语言实现WordCount 算法如下:
快速命令
create table docs(line string); load data inpath '/cmz/*.txt' overwrite into table docs; create table word_count as select word,count(1) as count from (select explode (split(line," "))as word from docs) w group by word order by word; select * from word_count;
详细操作
hive> create table word_count as > select word,count(1) as count from > (select explode (split(line," "))as word from docs) w > group by word > order by word; Query ID = root_20190728233923_6645ea24-c03c-4dc8-bd9f-852b5d175df1 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1564321568958_0002, Tracking URL = http://master:8088/proxy/application_1564321568958_0002/ Kill Command = /usr/local/hadoop-2.6.5/bin/hadoop job -kill job_1564321568958_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2019-07-28 23:39:28,602 Stage-1 map = 0%, reduce = 0% 2019-07-28 23:39:32,765 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.63 sec 2019-07-28 23:39:37,962 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.85 sec MapReduce Total cumulative CPU time: 2 seconds 850 msec Ended Job = job_1564321568958_0002 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1564321568958_0003, Tracking URL = http://master:8088/proxy/application_1564321568958_0003/ Kill Command = /usr/local/hadoop-2.6.5/bin/hadoop job -kill job_1564321568958_0003 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2019-07-28 23:39:47,781 Stage-2 map = 0%, reduce = 0% 2019-07-28 23:39:51,929 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.87 sec 2019-07-28 23:39:57,111 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.78 sec MapReduce Total cumulative CPU time: 2 seconds 780 msec Ended Job = job_1564321568958_0003 Moving data to: hdfs://master:9000/user/hive/warehouse/hivedemo.db/word_count Table hivedemo.word_count stats: [numFiles=1, numRows=7, totalSize=47, rawDataSize=40] MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.85 sec HDFS Read: 7107 HDFS Write: 255 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.78 sec HDFS Read: 4528 HDFS Write: 122 SUCCESS Total MapReduce CPU Time Spent: 5 seconds 630 msec OK Time taken: 34.594 seconds hive> select * from word_count; OK 1 good 3 great 2 hadoop 2 hive 1 is 5 spark 2 Time taken: 0.047 seconds, Fetched: 7 row(s)
WordCount 单词统计算法在 MapReduce 中的编程实现和 ive 编程实现的主要不同点如下:
-
采用 Hive实现WordCount 算法只需要编写较少的代码量,在MapReduce 中, WordCount类由 63行Java 码编写而成,在 Hive 中只需要编写7行代码
-
在MapReduce实现中,需要进行编译生成JAR文件来执行算法,而在Hive中不需要,HiveQL语旬的最终实现需要转换为MapReduce务来执行这都是由 Hive 框架自 动完成的,用户不需要了解具体实现细节