Hive DDL定义
鸡汤: 成功者和其他人最大的区别就是,他们真正动手去做了。
1. DDL操作¶
- DDL(Data Definition Language)数据库定义语言statements are used to define the database structure or schema.
DDL是SQL语言的四大功能之一。用于定义数据库的三级结构,包括外模式、概念模式、内模式及其相互之间的映像,定义数据的完整性、安全控制等约束DDL不需要commit.
CREATE ALTER DROP TRUNCATE COMMENT RENAME
- DML(Data Manipulation Language)数据操纵语言statements are used for managing data within schema objects.
由DBMS提供,用于让用户或程序员使用,实现对数据库中数据的操作。DML分成交互型DML和嵌入型DML两类。依据语言的级别,DML又可分成过程性DML和非过程性DML两种。 需要commit.
SELECT INSERT UPDATE DELETE MERGE CALL EXPLAIN PLAN LOCK TABLE
- DCL(Data Control Language)数据库控制语言 授权,角色控制等
GRANT 授权 REVOKE 取消授权
- TCL(Transaction Control Language)事务控制语言
SAVEPOINT 设置保存点 ROLLBACK 回滚 SET TRANSACTION
SQL主要分成四部分
1. 数据定义。(SQL DDL)用于定义SQL模式、基本表、视图和索引的创建和撤消操作。 2. 数据操纵。(SQL DML)数据操纵分成数据查询和数据更新两类。数据更新又分成插入、删除、和修改三种操作。 3. 数据控制。包括对基本表和视图的授权,完整性规则的描述,事务控制等内容。 4. 嵌入式SQL的使用规定。涉及到SQL语句嵌入在宿主语言程序中使用的规则。
1.1 创建数据库¶
- 创建一个数据库,数据库在HDFS上的默认存储路径是/user/hive/warehouse/*.db。
hive (default)> create database db_hive;
- 避免要创建的数据库已经存在错误,增加if not exists判断。(标准写法)
hive (default)> create database db_hive; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Database db_hive already exists hive (default)> create database if not exists db_hive;
- 创建一个数据库,指定数据库在HDFS上存放的位置
hive (default)> create database cmz_hive2 location '/cmz_hive2.db'; OK Time taken: 0.092 seconds hive (default)> dfs -ls /cmz_hive2.db;
我们创建的数据库,系统会库后面加.db
不执行存放位置,会默认存放在/user/hive/warehouse/
[ 配置文件中指定的hive数据位置 ]
hive (default)> dfs -ls /user/hive/warehouse; Found 11 items drwxr-xr-x - root supergroup 0 2019-07-30 14:06 /user/hive/warehouse/cmz.db drwxr-xr-x - root supergroup 0 2019-06-24 20:23 /user/hive/warehouse/cmz_external drwxr-xr-x - root supergroup 0 2019-08-01 19:40 /user/hive/warehouse/cmz_hive3.db drwxr-xr-x - root supergroup 0 2019-07-28 23:39 /user/hive/warehouse/hivedemo.db drwxr-xr-x - root supergroup 0 2019-07-30 20:37 /user/hive/warehouse/loocha.db drwxr-xr-x - root supergroup 0 2019-08-01 13:47 /user/hive/warehouse/student drwxr-xr-x - root supergroup 0 2019-08-01 14:13 /user/hive/warehouse/student2 drwxr-xr-x - root supergroup 0 2019-08-01 19:31 /user/hive/warehouse/summer drwxr-xr-x - root supergroup 0 2019-07-30 16:52 /user/hive/warehouse/test.db drwxr-xr-x - root supergroup 0 2019-07-03 16:33 /user/hive/warehouse/u_info drwxr-xr-x - root supergroup 0 2019-07-28 22:42 /user/hive/warehouse/visters_wai
1.2 查询数据库¶
1.2.1 显示数据库¶
- 显示数据库
hive (default)> show databases; OK database_name cmz cmz_hive2 cmz_hive3 default hivedemo loocha test Time taken: 0.013 seconds, Fetched: 7 row(s)
- 过滤显示查询的数据库
hive (default)> show databases like 'cmz*'; OK database_name cmz cmz_hive2 cmz_hive3 Time taken: 0.013 seconds, Fetched: 3 row(s)
类似MySQL,支持正则。
1.2.2查看数据库详情¶
- 显示数据库信息
hive (cmz)> use cmz; OK Time taken: 0.025 seconds hive (cmz)> show tables; OK tab_name dept emp student2 Time taken: 0.031 seconds, Fetched: 3 row(s) hive (cmz)> desc emp; OK col_name data_type comment empno int ename string job string mgr int hiredate string sal double comm double deptno int Time taken: 0.078 seconds, Fetched: 8 row(s)
- 显示数据库详细信息,extended
hive (cmz)> desc extended emp; OK col_name data_type comment empno int ename string job string mgr int hiredate string sal double comm double deptno int Detailed Table Information Table(tableName:emp, dbName:cmz, owner:root, createTime:1564466299, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:empno, type:in t, comment:null), FieldSchema(name:ename, type:string, comment:null), FieldSchema(name:job, type:string, comment:null), FieldSchema(name:mgr, type:int, comment:null), FieldSchema(name:hiredate, type:string, comment:null), FieldSchema(name:sal, type:double, comment:null), FieldSchema(name:comm, type:double, comment:null), FieldSchema(name:deptno, type:int, comment:null)], location:hdfs://master:9000/user/hive/warehouse/cmz.db/emp, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{field.delim= , serialization.format=Time taken: 0.08 seconds, Fetched: 10 row(s)
extended 这样看数据自定的属性。
1.2.3 切换数据库¶
hive (cmz)> show databases; OK database_name cmz cmz_hive2 cmz_hive3 default hivedemo loocha test Time taken: 0.011 seconds, Fetched: 7 row(s) hive (cmz)> use loocha; OK Time taken: 0.035 seconds
use newdatabasename; 就是切换到newdatabasename数据库下。类似MySQL。
1.3 修改数据库¶
用户可以使用ALTER DATABASE命令为某个数据库的DBPROPERTIES设置键-值对属性值,来描述这个数据库的属性信息。 数据库的其他元数据信息都是不可更改的,包括数据库名和数据库所在的目录位置 。
hive (loocha)> alter database loocha set dbproperties('createtime'='20190802'); OK Time taken: 0.035 seconds hive (loocha)> !date; Thu Aug 1 19:51:14 CST 2019 hive (loocha)> desc database extended loocha; OK db_name comment location owner_name owner_type parameters loocha hdfs://master:9000/user/hive/warehouse/loocha.db root USER {createtime=20190802} Time taken: 0.021 seconds, Fetched: 1 row(s)
新添加了createtime属性。
1.4 删除库¶
- 删除空数据库
hive (loocha)> create database realcloud; OK Time taken: 0.045 seconds hive (loocha)> drop database realcloud; OK Time taken: 0.2 seconds
库是空的,里面没有任何表和数据。
- 如果删除的数据库不存在,最好采用 if exists判断数据库是否存在
create database icpc; use icpc; create table cmz(id int); insert into table cmz values(1); select * from cmz; drop database icpc;
详细操作过程
hive (loocha)> create database icpc; OK Time taken: 0.024 seconds hive (loocha)> use icpc; OK Time taken: 0.02 seconds hive (icpc)> create table cmz(id int); OK Time taken: 0.051 seconds hive (icpc)> insert into table cmz values(1); Query ID = root_20190801195511_2864592f-ce26-4e81-93b5-92ee44b7ac98 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1564639643032_0002, Tracking URL = http://master:8088/proxy/application_1564639643032_0002/ Kill Command = /usr/local/hadoop-2.6.5/bin/hadoop job -kill job_1564639643032_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2019-08-01 19:55:19,006 Stage-1 map = 0%, reduce = 0% 2019-08-01 19:55:34,716 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.57 sec MapReduce Total cumulative CPU time: 7 seconds 570 msec Ended Job = job_1564639643032_0002 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: hdfs://master:9000/user/hive/warehouse/icpc.db/cmz/.hive-staging_hive_2019-08-01_19-55-11_294_7002636935651063480-1/-ext-10000 Loading data to table icpc.cmz Table icpc.cmz stats: [numFiles=1, numRows=1, totalSize=2, rawDataSize=1] MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 7.57 sec HDFS Read: 3333 HDFS Write: 66 SUCCESS Total MapReduce CPU Time Spent: 7 seconds 570 msec OK _col0 Time taken: 24.725 seconds hive (icpc)> select * from cmz; OK cmz.id 1 Time taken: 0.055 seconds, Fetched: 1 row(s) hive (icpc)> drop database icpc; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database icpc is not empty. One or more tables exist.)
- 如果数据库不为空,可以采用cascade命令,强制删除
hive (cmz)> drop database icpc; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database icpc is not empty. One or more tables exist.) hive (cmz)> drop database icpc cascade; OK Time taken: 0.594 seconds
1.5 创建表¶
语法
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path]
字段解释
1. CREATE TABLE 创建一个指定名字的表。如果相同名字的表已经存在,则抛出异常;用户可以用 IF NOT EXISTS 选项来忽略这个异常。 2. EXTERNAL关键字可以让用户创建一个外部表,在建表的同时指定一个指向实际数据的路径(LOCATION),Hive创建内部表时,会将数据移动到数据仓库指向的路径;若创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变。在删除表的时候,内部表的元数据和数据会被一起删除,而外部表只删除元数据,不删除数据。 3. COMMENT:为表和列添加注释。 4. PARTITIONED BY创建分区表 5. CLUSTERED BY创建分桶表 6. SORTED BY不常用 7. ROW FORMAT DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)] 用户在建表的时候可以自定义SerDe或者使用自带的SerDe。如果没有指定ROW FORMAT 或者ROW FORMAT DELIMITED,将会使用自带的SerDe。在建表的时候,用户还需要为表指定列,用户在指定表的列的同时也会指定自定义的SerDe,Hive通过SerDe确定表的具体的列的数据。 SerDe是Serialize/Deserilize的简称,目的是用于序列化和反序列化。 8. STORED AS指定存储文件类型 常用的存储文件类型:SEQUENCEFILE(二进制序列文件)、TEXTFILE(文本)、RCFILE(列式存储格式文件) 如果文件数据是纯文本,可以使用STORED AS TEXTFILE。如果数据需要压缩,使用 STORED AS SEQUENCEFILE。 9. LOCATION :指定表在HDFS上的存储位置。 10. LIKE允许用户复制现有的表结构,但是不复制数据。
1.5.1 管理表¶
1.5.1.1 定义¶
默认创建的表都是所谓的管理表,有时也被称为 内部表 。因为这种表,Hive会(或多或少地)控制着数据的生命周期。Hive默认情况下会将这些表的数据存储在由配置项hive.metastore.warehouse.dir
(例如,/user/hive/warehouse
)所定义的目录的子目录下。 当我们删除一个管理表时,Hive也会删除这个表中数据。管理表不适合和其他工具共享数据。
1.5.1.2 案例¶
- 普通创建表
cat>students.txt<<EOF 1,春天 2,夏天 3,秋天 4,冬天 EOF hadoop fs -mkdir /user/hive/warehouse/student hadoop fs -put students.txt /user/hive/warehouse/student/ hadoop fs -text /user/hive/warehouse/student/students.txt create table if not exists student( id int, name string ) row format delimited fields terminated by ',' stored as textfile location '/user/hive/warehouse/student/';
详细操作
[root@master test]# cat>students.txt<<EOF > 1,春天 > 2,夏天 > 3,秋天 > 4,冬天 > EOF [root@master test]# hadoop fs -mkdir /user/hive/warehouse/student [root@master test]# hadoop fs -put students.txt /user/hive/warehouse/student/ [root@master test]# hadoop fs -text /user/hive/warehouse/student text: `/user/hive/warehouse/student': Is a directory [root@master test]# hadoop fs -text /user/hive/warehouse/student/students.txt 1,春天 2,夏天 3,秋天 4,冬天 hive (cmz)> show tables; OK tab_name dept emp Time taken: 0.02 seconds, Fetched: 2 row(s) hive (cmz)> create table if not exists student( > id int, > name string > ) row format delimited fields terminated by ',' > stored as textfile > location '/user/hive/warehouse/student/'; OK Time taken: 0.059 seconds hive (cmz)> show tables; OK tab_name dept emp student Time taken: 0.046 seconds, Fetched: 3 row(s) hive (cmz)> select * from student; OK student.id student.name 1 春天 2 夏天 3 秋天 4 冬天 Time taken: 0.051 seconds, Fetched: 4 row(s)
- 根据查询结果创建表(查询的结果会添加到新创建的表中)
create table if not exists student2 as select id, name from student;
详细过程
hive (cmz)> create table if not exists student2 as select id, name from student; Query ID = root_20190801201833_eabee9e0-5ad5-4a79-a2a4-d6cc906e941f Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1564639643032_0003, Tracking URL = http://master:8088/proxy/application_1564639643032_0003/ Kill Command = /usr/local/hadoop-2.6.5/bin/hadoop job -kill job_1564639643032_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2019-08-01 20:18:43,107 Stage-1 map = 0%, reduce = 0% 2019-08-01 20:18:59,156 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.71 sec MapReduce Total cumulative CPU time: 10 seconds 710 msec Ended Job = job_1564639643032_0003 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: hdfs://master:9000/user/hive/warehouse/cmz.db/.hive-staging_hive_2019-08-01_20-18-33_987_886327310745491803-1/-ext-10001 Moving data to: hdfs://master:9000/user/hive/warehouse/cmz.db/student2 Table cmz.student2 stats: [numFiles=1, numRows=4, totalSize=36, rawDataSize=32] MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 10.71 sec HDFS Read: 2879 HDFS Write: 104 SUCCESS Total MapReduce CPU Time Spent: 10 seconds 710 msec OK id name Time taken: 27.412 seconds hive (cmz)> select * from student2; OK student2.id student2.name 1 春天 2 夏天 3 秋天 4 冬天 Time taken: 0.059 seconds, Fetched: 4 row(s)
- 根据已经存在的表结构创建表
hive (cmz)> create table if not exists student3 like student; OK Time taken: 0.095 seconds hive (cmz)> select * from student3; OK student3.id student3.name Time taken: 0.045 seconds hive (cmz)> desc student; OK col_name data_type comment id int name string Time taken: 0.048 seconds, Fetched: 2 row(s)
和MySQL一样,只复制表结构,不拷贝数据
- 查询表的类型
hive (cmz)> desc formatted student2; OK col_name data_type comment # col_name data_type comment id int name string # Detailed Table Information Database: cmz Owner: root CreateTime: Thu Aug 01 20:19:01 CST 2019 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://master:9000/user/hive/warehouse/cmz.db/student2 Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE true numFiles 1 numRows 4 rawDataSize 32 totalSize 36 transient_lastDdlTime 1564661941 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.067 seconds, Fetched: 32 row(s) hive (cmz)>
Table Type: MANAGED_TABLE 表示管理表,也成内部表
1.5.2 外部表¶
4.5.2.1 定义¶
因为表是外部表,所以Hive并非认为其完全拥有这份数据。 删除该表并不会删除掉这份数据,不过描述表的元数据信息会被删除掉。
4.5.2.2 管理表和外部表的使用场景¶
每天将收集到的网站日志定期流入HDFS文本文件。在外部表(原始日志表)的基础上做大量的统计分析,用到的中间表、结果表使用内部表存储,数据通过SELECT+INSERT进入内部表。
1.5.2.3 案例¶
分别创建部门和员工外部表,并向表中导入数据。
- 元数据
cat>dept.txt<<EOF 10,ACCOUNTING,1700 20,RESEARCH,1800 30,SALES,1900 40,OPERATIONS,1700 EOF cat>emp.txt<<EOF 7369,SMITH,CLERK,7902,1980-12-17,800.00,20 7499,ALLEN,SALESMAN,7698,1981-2-20,1600.00,300.00,30 7521,WARD,SALESMAN,7698,1981-2-22,1250.00,500.00,30 7566,JONES,MANAGER,7839,1981-4-2,2975.00,20 7654,MARTIN,SALESMAN,7698,1981-9-28,1250.00,1400.00,30 7698,BLAKE,MANAGER,7839,1981-5-1,2850.00,30 7782,CLARK,MANAGER,7839,1981-6-9,2450.00,10 7788,SCOTT,ANALYST,7566,1987-4-19,3000.00,20 7839,KING,PRESIDENT,1981-11-17,5000.00,10 7844,TURNER,SALESMAN,7698,1981-9-8,1500.00,0.00,30 7876,ADAMS,CLERK,7788,1987-5-23,1100.00,20 7900,JAMES,CLERK,7698,1981-12-3,950.00,30 7902,FORD,ANALYST,7566,1981-12-3,3000.00,20 7934,MILLER,CLERK,7782,1982-1-23,1300.00,10 EOF
- hql
-- 创建部门表 create external table if not exists dept( deptno int, dname string, loc int ) row format delimited fields terminated by ','; -- 创建员工表 create external table if not exists emp( empno int, ename string, job string, mgr int, hiredate string, sal double, comm double, deptno int ) row format delimited fields terminated by ','; load data local inpath '/root/hive/sgg/test/emp.txt' into table emp; load data local inpath '/root/hive/sgg/test/dept.txt' into table dept;
详细操作过程
[root@master test]# cat>dept.txt<<EOF > 10,ACCOUNTING,1700 > 20,RESEARCH,1800 > 30,SALES,1900 > 40,OPERATIONS,1700 > EOF 7566,JONES,MANAGER,7839,1981-4-2,2975.00,20 7654,MARTIN,SALESMAN,7698,1981-9-28,1250.00,1400.00,30 7698,BLAKE,MANAGER,7839,1981-5-1,2850.00,30 7782,CLARK,MANAGER,7839,1981-6-9,2450.00,10 7788,SCOTT,ANALYST,7566,1987-4-19,3000.00,20 7839,KING,PRESIDENT,1981-11-17,5000.00,10 7844,TURNER,SALESMAN,7698,1981-9-8,1500.00,0.00,30 7876,ADAMS,CLERK,7788,1987-5-23,1100.00,20 7900,JAMES,CLERK,7698,1981-12-3,950.00,30 7902,FORD,ANALYST,7566,1981-12-3,3000.00,20 7934,MILLER,CLERK,7782,1982-1-23,1300.00,10 EOF[root@master test]# cat>emp.txt<<EOF > 7369,SMITH,CLERK,7902,1980-12-17,800.00,20 > 7499,ALLEN,SALESMAN,7698,1981-2-20,1600.00,300.00,30 > 7521,WARD,SALESMAN,7698,1981-2-22,1250.00,500.00,30 > 7566,JONES,MANAGER,7839,1981-4-2,2975.00,20 > 7654,MARTIN,SALESMAN,7698,1981-9-28,1250.00,1400.00,30 > 7698,BLAKE,MANAGER,7839,1981-5-1,2850.00,30 > 7782,CLARK,MANAGER,7839,1981-6-9,2450.00,10 > 7788,SCOTT,ANALYST,7566,1987-4-19,3000.00,20 > 7839,KING,PRESIDENT,1981-11-17,5000.00,10 > 7844,TURNER,SALESMAN,7698,1981-9-8,1500.00,0.00,30 > 7876,ADAMS,CLERK,7788,1987-5-23,1100.00,20 > 7900,JAMES,CLERK,7698,1981-12-3,950.00,30 > 7902,FORD,ANALYST,7566,1981-12-3,3000.00,20 > 7934,MILLER,CLERK,7782,1982-1-23,1300.00,10 > EOF [root@master test]# ls dept.txt emp.txt students.txt test.txt hive (cmz)> create database icpc; OK Time taken: 0.048 seconds hive (cmz)> use icpc; hive (icpc)> -- 创建部门表 > create external table if not exists dept( > deptno int, > dname string, > loc int > ) > row format delimited fields terminated by ','; OK Time taken: 0.058 seconds hive (icpc)> > -- 创建员工表 > create external table if not exists emp( > empno int, > ename string, > job string, > mgr int, > hiredate string, > sal double, > comm double, > deptno int > ) > row format delimited fields terminated by ','; OK Time taken: 0.048 seconds hive (icpc)> show tables; OK tab_name dept emp Time taken: 0.02 seconds, Fetched: 2 row(s) hive (icpc)> desc dept; OK col_name data_type comment deptno int dname string loc int Time taken: 0.05 seconds, Fetched: 3 row(s) hive (icpc)> desc emp; OK col_name data_type comment empno int ename string job string mgr int hiredate string sal double comm double deptno int Time taken: 0.06 seconds, Fetched: 8 row(s)
导入数据
hive (icpc)> load data local inpath '/root/hive/sgg/test/emp.txt' into table emp; Loading data to table icpc.emp Table icpc.emp stats: [numFiles=1, totalSize=646] OK Time taken: 0.178 seconds hive (icpc)> load data local inpath '/root/hive/sgg/test/dept.txt' into table dept; Loading data to table icpc.dept Table icpc.dept stats: [numFiles=1, totalSize=69] OK Time taken: 0.261 seconds
- 查看结果
hive (icpc)> select * from emp; OK emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno 7369 SMITH CLERK 7902 1980-12-17 800.0 20.0 NULL 7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30 7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30 7566 JONES MANAGER 7839 1981-4-2 2975.0 20.0 NULL 7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30 7698 BLAKE MANAGER 7839 1981-5-1 2850.0 30.0 NULL 7782 CLARK MANAGER 7839 1981-6-9 2450.0 10.0 NULL 7788 SCOTT ANALYST 7566 1987-4-19 3000.0 20.0 NULL 7839 KING PRESIDENT NULL 5000.00 10.0 NULL NULL 7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30 7876 ADAMS CLERK 7788 1987-5-23 1100.0 20.0 NULL 7900 JAMES CLERK 7698 1981-12-3 950.0 30.0 NULL 7902 FORD ANALYST 7566 1981-12-3 3000.0 20.0 NULL 7934 MILLER CLERK 7782 1982-1-23 1300.0 10.0 NULL Time taken: 0.045 seconds, Fetched: 14 row(s) hive (icpc)> select * from dept; OK dept.deptno dept.dname dept.loc 10 ACCOUNTING 1700 20 RESEARCH 1800 30 SALES 1900 40 OPERATIONS 1700 Time taken: 0.111 seconds, Fetched: 4 row(s)
- 查看表结构
hive (icpc)> desc formatted dept; OK col_name data_type comment # col_name data_type comment deptno int dname string loc int # Detailed Table Information Database: icpc Owner: root CreateTime: Thu Aug 01 20:32:38 CST 2019 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://master:9000/user/hive/warehouse/icpc.db/dept Table Type: EXTERNAL_TABLE Table Parameters: COLUMN_STATS_ACCURATE true EXTERNAL TRUE numFiles 1 totalSize 69 transient_lastDdlTime 1564662921 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim , serialization.format , Time taken: 0.072 seconds, Fetched: 33 row(s) hive (icpc)> desc formatted emp;; OK col_name data_type comment # col_name data_type comment empno int ename string job string mgr int hiredate string sal double comm double deptno int # Detailed Table Information Database: icpc Owner: root CreateTime: Thu Aug 01 20:32:39 CST 2019 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://master:9000/user/hive/warehouse/icpc.db/emp Table Type: EXTERNAL_TABLE Table Parameters: COLUMN_STATS_ACCURATE true EXTERNAL TRUE numFiles 1 totalSize 646 transient_lastDdlTime 1564662873 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim , serialization.format , Time taken: 0.055 seconds, Fetched: 38 row(s)
1.5.3 管理表与外部表的互相转换¶
1.5.3.1 查询表的类型¶
hive (icpc)> desc formatted emp; Table Type: MANAGED_TABLE
1.5.3.2 修改内部表emp为外部表¶
hive (icpc)> alter table emp set tblproperties ('EXTERNAL'='TRUE'); OK
set tblproperties ('EXTERNAL'='TRUE'); 是固定用法,区分大小写。
hive (icpc)> desc formatted emp; Table Type: EXTERNAL_TABLE
1.5.3.3 修改外部表为内部表¶
hive (icpc)> alter table emp set tblproperties ('EXTERNAL'='FALSE'); hive (icpc)> desc formatted emp; Table Type: MANAGED_TABLE
注意:('EXTERNAL'='TRUE')和('EXTERNAL'='FALSE')为固定写法,区分大小写!
1.6 分区表¶
分区表实际上就是对应一个HDFS文件系统上的独立的文件夹,该文件夹下是该分区所有的数据文件。 Hive中的分区就是分目录 ,把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区,这样的查询效率会提高很多。
1.6.1 引入分区表¶
需要根据日期对日志进行管理,
/user/hive/warehouse/log_partition/20190801/20190801.log /user/hive/warehouse/log_partition/20190802/20190802.log /user/hive/warehouse/log_partition/20190803/20190803.log
1.6.2 创建分区表语法¶
cat>dept_partition.txt<<EOF 1,张三,南京 2,李四,北京 3,王五,上海 4,赵六,宿迁 EOF create table dept_partition( id int, name string, adress string ) partitioned by (month string) row format delimited fields terminated by ','; load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition partition(month='20190801'); load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition partition(month='20190802'); load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition partition(month='20190803');
详细操作
[root@master test]# cat>dept_partition.txt<<EOF > 1,张三,南京 > 2,李四,北京 > 3,王五,上海 > 4,赵六,宿迁 > EOF [root@master test]# pwd /root/hive/sgg/test hive (icpc)> use icpc; OK Time taken: 0.02 seconds hive (icpc)> show tables; OK tab_name dept emp Time taken: 0.027 seconds, Fetched: 2 row(s) hive (icpc)> create table dept_partition( > id int, > name string, > adress string > ) > partitioned by (month string) > row format delimited fields terminated by ','; OK Time taken: 1.103 seconds hive (icpc)> show tables; OK tab_name dept dept_partition emp Time taken: 0.027 seconds, Fetched: 3 row(s) 载入数据 hive (icpc)> load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition partition(month='20190 801');Loading data to table icpc.dept_partition partition (month=20190801) Partition icpc.dept_partition{month=20190801} stats: [numFiles=1, numRows=0, totalSize=64, rawDataSize=0] OK Time taken: 4.839 seconds hive (icpc)> load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition partition(month='20190802'); Loading data to table icpc.dept_partition partition (month=20190802) Partition icpc.dept_partition{month=20190802} stats: [numFiles=1, numRows=0, totalSize=64, rawDataSize=0] OK Time taken: 3.861 seconds hive (icpc)> load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition partition(month='20190803'); Loading data to table icpc.dept_partition partition (month=20190803) Partition icpc.dept_partition{month=20190803} stats: [numFiles=1, numRows=0, totalSize=64, rawDataSize=0] OK Time taken: 1.613 seconds 查看hdfs数据 hive (icpc)> dfs -ls /user/hive/warehouse/icpc.db/dept_partition/; Found 3 items drwxr-xr-x - root supergroup 0 2019-08-01 22:42 /user/hive/warehouse/icpc.db/dept_partition/month=20190801 drwxr-xr-x - root supergroup 0 2019-08-01 22:43 /user/hive/warehouse/icpc.db/dept_partition/month=20190802 drwxr-xr-x - root supergroup 0 2019-08-01 22:43 /user/hive/warehouse/icpc.db/dept_partition/month=20190803
1.6.3 查询分区表中数据¶
1.6.3.1 单分区查询¶
hive (icpc)> select * from dept_partition where month='20190801'; OK dept_partition.id dept_partition.name dept_partition.adress dept_partition.month 1 张三 南京 20190801 2 李四 北京 20190801 3 王五 上海 20190801 4 赵六 宿迁 20190801 Time taken: 0.078 seconds, Fetched: 4 row(s) hive (icpc)> select * from dept_partition where month='20190802'; OK dept_partition.id dept_partition.name dept_partition.adress dept_partition.month 1 张三 南京 20190802 2 李四 北京 20190802 3 王五 上海 20190802 4 赵六 宿迁 20190802 Time taken: 0.227 seconds, Fetched: 4 row(s) hive (icpc)> select * from dept_partition where month='20190803'; OK dept_partition.id dept_partition.name dept_partition.adress dept_partition.month 1 张三 南京 20190803 2 李四 北京 20190803 3 王五 上海 20190803 4 赵六 宿迁 20190803 Time taken: 0.078 seconds, Fetched: 4 row(s)
1.6.3.2 多分区联合查询¶
select * from dept_partition where month='20190801' union select * from dept_partition where month='20190802' union select * from dept_partition where month='20190803';
详细操作
hive (icpc)> select * from dept_partition where month='20190801' > union > select * from dept_partition where month='20190802' > union > select * from dept_partition where month='20190803'; Query ID = root_20190801224901_c3e3490f-9e8a-4c3c-87aa-f12926f6e21e Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1564669682555_0001, Tracking URL = http://master:8088/proxy/application_1564669682555_0001/ Kill Command = /usr/local/hadoop-2.6.5/bin/hadoop job -kill job_1564669682555_0001 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 2019-08-01 22:49:10,740 Stage-1 map = 0%, reduce = 0% 2019-08-01 22:49:19,302 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.08 sec 2019-08-01 22:49:29,129 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.44 sec MapReduce Total cumulative CPU time: 9 seconds 440 msec Ended Job = job_1564669682555_0001 Launching Job 2 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1564669682555_0002, Tracking URL = http://master:8088/proxy/application_1564669682555_0002/ Kill Command = /usr/local/hadoop-2.6.5/bin/hadoop job -kill job_1564669682555_0002 Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1 2019-08-01 22:49:43,545 Stage-2 map = 0%, reduce = 0% 2019-08-01 22:49:52,247 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 5.0 sec 2019-08-01 22:50:03,219 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 9.11 sec MapReduce Total cumulative CPU time: 9 seconds 110 msec Ended Job = job_1564669682555_0002 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 9.44 sec HDFS Read: 15383 HDFS Write: 424 SUCCESS Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 9.11 sec HDFS Read: 15079 HDFS Write: 444 SUCCESS Total MapReduce CPU Time Spent: 18 seconds 550 msec OK _u3.id _u3.name _u3.adress _u3.month 1 张三 南京 20190801 1 张三 南京 20190802 1 张三 南京 20190803 2 李四 北京 20190801 2 李四 北京 20190802 2 李四 北京 20190803 3 王五 上海 20190801 3 王五 上海 20190802 3 王五 上海 20190803 4 赵六 宿迁 20190801 4 赵六 宿迁 20190802 4 赵六 宿迁 20190803 Time taken: 62.651 seconds, Fetched: 12 row(s)
1.6.3.3 增加分区¶
- 增加单个分区
hive (icpc)> alter table dept_partition add partition(month='20190804') ; OK Time taken: 0.2 seconds
- 增加多个分区
hive (icpc)> alter table dept_partition add partition(month='20190805',month='20190806'); OK Time taken: 0.122 seconds
1.6.3.4 删除分区¶
- 删除单个分区
hive (icpc)> show partitions dept_partition; OK partition month=20190801 month=20190802 month=20190803 month=20190804 month=20190805 删除分区 hive (icpc)> alter table dept_partition drop partition (month='20190805'); Dropped the partition month=20190805 OK Time taken: 0.247 seconds hive (icpc)> show partitions dept_partion; FAILED: SemanticException [Error 10001]: Table not found dept_partion hive (icpc)> show partitions dept_partition; OK partition month=20190801 month=20190802 month=20190803 month=20190804 Time taken: 0.065 seconds, Fetched: 4 row(s)
- 删除多个分区
hive (icpc)> show partitions dept_partition; OK partition month=20190801 month=20190802 month=20190803 month=20190804 Time taken: 0.086 seconds, Fetched: 4 row(s) 删除多个分区 hive (icpc)> alter table dept_partition drop partition(month='20190804',month='20190803'); OK Time taken: 0.129 seconds hive (icpc)> show partitions dept_partition; OK partition month=20190801 month=20190802 month=20190803 month=20190804 Time taken: 0.078 seconds, Fetched: 4 row(s)
1.6.3.5 查看分区¶
hive (icpc)> show partitions dept_partition; OK partition month=20190801 month=20190802 month=20190803 month=20190804 month=20190805 Time taken: 0.055 seconds, Fetched: 5 row(s)
1.6.3.6 查看分区结构¶
hive (icpc)> desc formatted dept_partition; OK col_name data_type comment # col_name data_type comment id int name string adress string # Partition Information # col_name data_type comment month string # Detailed Table Information Database: icpc Owner: root CreateTime: Thu Aug 01 22:42:16 CST 2019 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://master:9000/user/hive/warehouse/icpc.db/dept_partition Table Type: MANAGED_TABLE Table Parameters: transient_lastDdlTime 1564670536 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim , serialization.format , Time taken: 0.089 seconds, Fetched: 34 row(s)
1.6.4 分区注意¶
- 创建二级分区表
use icpc; create table dept_partition2( id int, name string, loc string ) partitioned by (month string,day string) row format delimited fields terminated by ',';
详细操作
hive (icpc)> use icpc; OK Time taken: 0.034 seconds hive (icpc)> create table dept_partition2( > id int, > name string, > loc string > ) > partitioned by (month string,day string) > row format delimited fields terminated by ','; OK Time taken: 0.115 seconds
- 正常的加载数据
load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition2 partition(month='201908', day='13');
详细操作
hive (icpc)> load data local inpath '/root/hive/sgg/test/dept_partition.txt' into table dept_partition2 partition(month='201908', day='13'); Loading data to table icpc.dept_partition2 partition (month=201908, day=13) Partition icpc.dept_partition2{month=201908, day=13} stats: [numFiles=1, numRows=0, totalSize=64, rawDataSize=0] OK Time taken: 0.583 seconds hive (icpc)> show partitions dept_partition2; OK partition month=201908/day=13 Time taken: 0.066 seconds, Fetched: 1 row(s) hive (icpc)> dfs -ls /user/hive/warehouse/icpc.db/dept_partition2; Found 1 items drwxr-xr-x - root supergroup 0 2019-08-01 23:35 /user/hive/warehouse/icpc.db/dept_partition2/month=201908 hive (icpc)> dfs -ls /user/hive/warehouse/icpc.db/dept_partition2/month=201908/; Found 1 items drwxr-xr-x - root supergroup 0 2019-08-01 23:35 /user/hive/warehouse/icpc.db/dept_partition2/month=201908/day=13 hive (icpc)> dfs -ls /user/hive/warehouse/icpc.db/dept_partition2/month=201908/day=13; Found 1 items -rwxr-xr-x 3 root supergroup 64 2019-08-01 23:35 /user/hive/warehouse/icpc.db/dept_partition2/month=201908/day=13/dept_partition.txt 查看分区 hive (icpc)> select * from dept_partition2 where month='201908' and day='13'; OK dept_partition2.id dept_partition2.name dept_partition2.loc dept_partition2.month dept_partition2.day 1 张三 南京 201908 13 2 李四 北京 201908 13 3 王五 上海 201908 13 4 赵六 宿迁 201908 13 Time taken: 0.22 seconds, Fetched: 4 row(s)
创建多级分区的,就是使用partitioned by (month string,day string),谁在前面,就是后面的前一层分区,如上述, 也就是month分区下面还有day分区 ,在hdfs上体现就是month目录下还有day目录。
- 把数据直接上传到分区目录上,让分区表和数据产生关联的三种方式
-
方式一:上传数据后修复
- 上传数据
- 查询数据(查询不到刚上传的数据)
- 执行修复命令
- 再次查询数据
-
方式二:上传数据后添加分区
- 上传数据
- 执行添加分区
- 查询数据
-
方式三:创建文件夹后load数据到分区
- 创建目录
- 上传数据
- 查询数据
1.7 修改表¶
1.7.1 重命名表¶
ALTER TABLE table_name RENAME TO new_table_name
重命名表
1: jdbc:hive2://master:10000> show tables; +---------------+--+ | tab_name | +---------------+--+ | cmz_external | | dept | | emp | | student | | student2 | | summer | | u_info | | visters_wai | +---------------+--+ 8 rows selected (0.056 seconds) 1: jdbc:hive2://master:10000> alter table summer rename to caimengzhi; No rows affected (0.308 seconds) 1: jdbc:hive2://master:10000> show tables; +---------------+--+ | tab_name | +---------------+--+ | caimengzhi | | cmz_external | | dept | | emp | | student | | student2 | | u_info | | visters_wai | +---------------+--+ 8 rows selected (0.06 seconds)
1.7.2 增加/修改/替换列信息¶
1.7.2.1 语法¶
- 更新列
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
- 增加和替换列
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
ADD是代表新增一字段,字段位置在所有列后面(partition列前),REPLACE则是表示替换表中所有字段
1.7.2.2 案例¶
- 查询表结构
desc caimengzhi; |__________ 表名
- 添加列
alter table caimengzhi add columns(job string); | | |______ 新增字段的属性 | |___________ 新增字段名 |_______________________________ 表名
- 查询表结构
desc caimengzhi;
- 更新列
alter table caimengzhi change column job myjob string; | | |_____ 新字段属性 | |___________ 新字段名 |_______________旧字段
change 后面的关键字column 可以省略
- 查询表结构
desc caimengzhi;
- 替换列
alter table caimengzhi replace columns(myjob string, localtion string);
注意 替换字段是把之前的字段都删除了,改成现在字段,谨慎操作。
- 查询表结构
desc caimengzhi;
详细操作
1: jdbc:hive2://master:10000> desc caimengzhi; +-----------+------------------------------------+----------+--+ | col_name | data_type | comment | +-----------+------------------------------------+----------+--+ | name | string | | | friends | array<string> | | | children | map<string,int> | | | address | struct<street:string,city:string> | | +-----------+------------------------------------+----------+--+ 4 rows selected (0.387 seconds) # 添加新字段 1: jdbc:hive2://master:10000> alter table caimengzhi add columns(job string); No rows affected (0.252 seconds) 1: jdbc:hive2://master:10000> desc caimengzhi; +-----------+------------------------------------+----------+--+ | col_name | data_type | comment | +-----------+------------------------------------+----------+--+ | name | string | | | friends | array<string> | | | children | map<string,int> | | | address | struct<street:string,city:string> | | | job | string | | +-----------+------------------------------------+----------+--+ 5 rows selected (0.108 seconds) # 更新字段 [job string 更新为myjob string] 1: jdbc:hive2://master:10000> alter table caimengzhi change column job myjob string; No rows affected (0.241 seconds) 1: jdbc:hive2://master:10000> desc caimengzhi; +-----------+------------------------------------+----------+--+ | col_name | data_type | comment | +-----------+------------------------------------+----------+--+ | name | string | | | friends | array<string> | | | children | map<string,int> | | | address | struct<street:string,city:string> | | | myjob | string | | +-----------+------------------------------------+----------+--+ 5 rows selected (0.123 seconds) # 替换字段 1: jdbc:hive2://master:10000> desc caimengzhi; +-----------+------------------------------------+----------+--+ | col_name | data_type | comment | +-----------+------------------------------------+----------+--+ | name | string | | | friends | array<string> | | | children | map<string,int> | | | address | struct<street:string,city:string> | | | myjob | string | | +-----------+------------------------------------+----------+--+ 5 rows selected (0.197 seconds) 1: jdbc:hive2://master:10000> alter table caimengzhi replace columns(myjob string, localtion string); No rows affected (0.209 seconds) 1: jdbc:hive2://master:10000> desc caimengzhi; +------------+------------+----------+--+ | col_name | data_type | comment | +------------+------------+----------+--+ | myjob | string | | | localtion | string | | +------------+------------+----------+--+
1.7.3 删除表¶
drop table caimengzhi; |________ 要删除的表名字
删除操作
1: jdbc:hive2://master:10000> drop table caimengzhi; No rows affected (0.338 seconds) 1: jdbc:hive2://master:10000> show tables like 'caimengzhi'; +-----------+--+ | tab_name | +-----------+--+ +-----------+--+ No rows selected (0.032 seconds)