在之前的博客中我们看了Hive 环境的搭建,如下:
Hive 3.1.1 环境 搭建手册
https://www.cndba.cn/dave/article/3335
本篇我们了解一下Hive对象的基本操作。
1 创建数据库
hive> create database cndba;
OK
Time taken: 1.103 seconds
hive>
建立一个新数据库,就会在HDFS的/user/hive/warehouse/中生成一个cndba.db文件夹。如果不创建新数据库,不使用hive>use <数据库名>,系统默认的数据库。可以显式使用:
hive> use default;
默认/user/hive/warehouse/中建表
[dave@www.cndba.cn ~]$ hdfs dfs -ls -R /user
drwxr-xr-x - cndba supergroup 0 2019-01-23 23:55 /user/cndba
drwxr-xr-x - cndba supergroup 0 2019-01-23 23:55 /user/cndba/output
-rw-r--r-- 2 cndba supergroup 0 2019-01-23 23:55 /user/cndba/output/_SUCCESS
-rw-r--r-- 2 cndba supergroup 34795 2019-01-23 23:55 /user/cndba/output/part-r-00000
drwxr-xr-x - cndba supergroup 0 2019-03-07 23:48 /user/hive
drwxr-xr-x - cndba supergroup 0 2019-03-07 23:48 /user/hive/warehouse
drwxr-xr-x - cndba supergroup 0 2019-03-07 23:48 /user/hive/warehouse/cndba.db
[dave@www.cndba.cn ~]$
2 创建表
语法:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
示例:
hive> create table employees(id int,name string,salary string) row format delimited fields terminated by '/,';
2019-03-09 01:10:08,488 INFO [89328b4b-0873-45ca-b2a5-fb2986043911 main] DataNucleus.Persistence: Request to load fields "comment,name,type" of class org.apache.hadoop.hive.metastore.model.MFieldSchema but object is embedded, so ignored
2019-03-09 01:10:08,488 INFO [89328b4b-0873-45ca-b2a5-fb2986043911 main] DataNucleus.Persistence: Request to load fields "comment,name,type" of class org.apache.hadoop.hive.metastore.model.MFieldSchema but object is embedded, so ignored
OK
Time taken: 0.071 seconds
hive>
以上语法创建了一个employees表,对应在Mysql元数据表TBLS表会增加表和列的信息。同时,会在HDFS的中的tabletest.db文件夹中增加一个employees文件夹。所有的 Table 数据(不包括 External Table)都保存在这个目录中。
[dave@www.cndba.cn ~]$ hdfs dfs -ls -R /user/hive
drwxr-xr-x - cndba supergroup 0 2019-03-09 01:10 /user/hive/warehouse
drwxr-xr-x - cndba supergroup 0 2019-03-07 23:48 /user/hive/warehouse/cndba.db
drwxr-xr-x - cndba supergroup 0 2019-03-08 21:55 /user/hive/warehouse/dave
drwxr-xr-x - cndba supergroup 0 2019-03-09 01:10 /user/hive/warehouse/employees
drwxr-xr-x - cndba supergroup 0 2019-03-07 23:53 /user/hive/warehouse/t_cndba
drwxr-xr-x - cndba supergroup 0 2019-03-08 21:27 /user/hive/warehouse/tpm_cndba
[dave@www.cndba.cn ~]$
3 创建临时表
临时表可以存储中间结果:
hive> create table tpm_cndba as select * from t_cndba;
Query ID = cndba_20190308212650_1110d65a-bdbc-43d2-8386-38e12391eb95
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
2019-03-08 21:26:54,059 INFO [89328b4b-0873-45ca-b2a5-fb2986043911 main] client.RMProxy: Connecting to ResourceManager at hadoopmaster/192.168.20.80:8032
2019-03-08 21:26:54,475 INFO [89328b4b-0873-45ca-b2a5-fb2986043911 main] client.RMProxy: Connecting to ResourceManager at hadoopmaster/192.168.20.80:8032
Starting Job = job_1551186730130_0001, Tracking URL = http://hadoopmaster:8088/proxy/application_1551186730130_0001/
Kill Command = /home/cndba/hadoop/bin/mapred job -kill job_1551186730130_0001
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2019-03-08 21:27:07,893 Stage-1 map = 0%, reduce = 0%
Ended Job = job_1551186730130_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://hadoopmaster:9000/user/hive/warehouse/.hive-staging_hive_2019-03-08_21-26-50_554_9030156807573657821-1/-ext-10002
Moving data to directory hdfs://hadoopmaster:9000/user/hive/warehouse/tpm_cndba
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 19.886 seconds
hive> show tables;
OK
t_cndba
tpm_cndba
Time taken: 0.021 seconds, Fetched: 2 row(s)
hive>
4 Load 加载数据
语法:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
有LOCAL表示从本地文件系统加载(文件会被拷贝到HDFS中)
无LOCAL表示从HDFS中加载数据(注意:文件直接被移动,而不是拷贝,并且文件名都不改。)
OVERWRITE 表示是否覆盖表中数据(或指定分区的数据)(没有OVERWRITE 会直接APPEND,而不会滤重!)
示例:
从本地导入数据到表格并追加原表
hive> load data local inpath '/home/cndba/dave.txt' into table employees;
Loading data to table default.employees
OK
Time taken: 0.265 seconds
hive>
hive> select * from employees;
OK
100 Steven 24000.00
101 Neena 17000.00
102 Lex 17000.00
103 Alexander 9000.00
104 Bruce 6000.00
105 David 4800.00
106 Valli 4800.00
107 Diana 4200.00
…
从HDFS导入数据到表格并追加记录:
hive> load data inpath '/dave/dave.txt' into table employees;
Loading data to table default.employees
OK
Time taken: 0.207 seconds
从HDFS导入数据到表格并覆盖原表:
hive> load data inpath '/dave/dave.txt' overwrite into table employees;
Loading data to table default.employees
OK
Time taken: 0.258 seconds
hive>
这里注意一点:文本数据的字段间隔符,是在建表的时候指定的,如果要将自定义间隔符的文件读入一个表,需要通过创建表的语句来指明输入文件间隔符,然后load data到表。
5 插入数据
5.1 INSERT语法
语法如下:
Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ...] select_statement2] ...
Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement
Insert时,from子句既可以放在select子句后,也可以放在insert子句前,下面两句是等价的。
hive> insert overwrite table cn select * from employees;
hive> from employees insert overwrite table cn select * where id > 1;
Query ID = cndba_20190309020852_5bc199f6-0431-43c0-adad-4e91b8b4ae1a
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
2019-03-09 02:08:53,370 INFO [89328b4b-0873-45ca-b2a5-fb2986043911 main] client.RMProxy: Connecting to ResourceManager at hadoopmaster/192.168.20.80:8032
2019-03-09 02:08:53,403 INFO [89328b4b-0873-45ca-b2a5-fb2986043911 main] client.RMProxy: Connecting to ResourceManager at hadoopmaster/192.168.20.80:8032
Starting Job = job_1551186730130_0005, Tracking URL = http://hadoopmaster:8088/proxy/application_1551186730130_0005/
Kill Command = /home/cndba/hadoop/bin/mapred job -kill job_1551186730130_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-03-09 02:09:02,268 Stage-1 map = 0%, reduce = 0%
2019-03-09 02:09:10,589 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.85 sec
2019-03-09 02:09:19,875 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.99 sec
MapReduce Total cumulative CPU time: 7 seconds 990 msec
Ended Job = job_1551186730130_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://hadoopmaster:9000/user/hive/warehouse/cn/.hive-staging_hive_2019-03-09_02-08-52_882_2438057570131793940-1/-ext-10000
Loading data to table default.cn
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.99 sec HDFS Read: 17771 HDFS Write: 3611 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 990 msec
OK
Time taken: 28.319 seconds
hive>
Hive不支持一条一条的用insert语句进行插入操作,也不支持update的操作。数据是以load的方式,加载到建立好的表中。数据一旦导入,则不可修改。要么drop掉整个表,要么建立新的表,导入新的数据。
如果想一次插入一条数据,可以通过其他方法实现: 假设有一张表B至少有一条数据,我们想向表A(int,string)中插入一条数据,可以用下面的方法实现:
from B insert table A select 1,‘abc’ limit 1;
5.2 WRITE语法
语法如下:
Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ...
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
导出文件到本地:
hive> insert overwrite local directory '/tmp/dave' select * from employees;
注意这里是导到/tmp/dave的目录下,下面还会生成一个文件:
[hadoop@hadoopMaster tmp]$ ll /tmp/dave
total 8
-rw-r--r--. 1 hadoop hadoop 8056 Mar 9 21:30 000000_0
[hadoop@hadoopMaster tmp]$ ll /tmp/dave/000000_0
-rw-r--r--. 1 hadoop hadoop 8056 Mar 9 21:30 /tmp/dave/000000_0
[hadoop@hadoopMaster tmp]$
导出文件到HDFS:
hive> insert overwrite directory '/dave' select * from employees;
[hadoop@Slave1 hadoop]$ hdfs dfs -ls /dave
Found 2 items
-rw-r--r-- 2 hadoop supergroup 8056 2019-03-09 21:31 /dave/000000_0
-rw-r--r-- 2 hadoop supergroup 0 2019-03-03 01:11 /dave/_SUCCESS
[hadoop@Slave1 hadoop]$
一个源可以同时插入到多个目标表或目标文件,多目标insert可以用一句话来完成:
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2013-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
6 显示表
查看表名:
hive> show tables;
OK
employees
page_view
Time taken: 0.112 seconds, Fetched: 2 row(s)
hive>
查看表名,部分匹配,注意这里的规则在模糊匹配之前有个.:
hive> show tables 'cn.*';
OK
cndba
Time taken: 0.117 seconds, Fetched: 1 row(s)
hive>
查看某表的所有Partition,如果没有就报错:
SHOW PARTITIONS page_view;
查看某表结构:
hive> desc cndba;
OK
id int
name string
salary string
Time taken: 0.138 seconds, Fetched: 3 row(s)
hive>
查看有限行内容,同Greenplum,用limit关键词:
hive> select * from employees limit 3;
OK
100 Steven 24000.00
101 Neena 17000.00
102 Lex 17000.00
Time taken: 0.561 seconds, Fetched: 3 row(s)
hive>
7 external表
Hive 创建内部表时,会将数据移动到数据仓库指向的路径;创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变,数据源可以在任意位置。 外部表删除时,只删除元数据信息,存储在HDFS中的数据没有被删除。
[hadoop@hadoopMaster hadoop]$ hdfs dfs -mkdir -p /ext/ahdba
[hadoop@hadoopMaster ~]$ hdfs dfs -put ~/ahdba.txt /ext/ahdba
[hadoop@hadoopMaster hadoop]$ hdfs dfs -ls -R /ext
drwxr-xr-x - hadoop supergroup 0 2019-03-09 21:52 /ext/ahdba
-rw-r--r-- 2 hadoop supergroup 2120 2019-03-09 21:52 /ext/ahdba/ahdba.txt
hive> create external table ahdba(id int,name string,salary string) row format delimited fields terminated by '/,' stored as textfile location '/ext/ahdba';
OK
Time taken: 0.266 seconds
hive>
hive> select * from ahdba limit 3;
OK
100 Steven 24000.00
101 Neena 17000.00
102 Lex 17000.00
Time taken: 0.243 seconds, Fetched: 3 row(s)
hive>
8 Partition表(分区表)
如果文件很大,用分区表可以快过滤出按分区字段划分的数据。假设anqing中有两个分区part1和part2. 实际就是在HDFS中的anqing文件夹下面再建立两个文件夹,每个文件名就是part1和part2。hive中的分区就是再多建一个目录,优点:便于统计,效率更高,缩小数据集。
示例:
hive> create table anqing(id int,name string,salary string) partitioned by (part_flag string) row format delimited fields terminated by '/,';
插入数据:
hive> load data local inpath '/home/hadoop/dave.txt' overwrite into table anqing partition(part_flag='1');
Loading data to table default.anqing partition (part_flag=1)
OK
Time taken: 1.003 seconds
hive> load data local inpath '/home/hadoop/dave.txt' overwrite into table anqing partition(part_flag='2');
Loading data to table default.anqing partition (part_flag=2)
OK
Time taken: 0.752 seconds
hive> load data local inpath '/home/hadoop/dave.txt' overwrite into table anqing partition(part_flag='3');
Loading data to table default.anqing partition (part_flag=3)
OK
Time taken: 1.013 seconds
hive>
目录结构如下:
[hadoop@Slave1 hadoop]$ hdfs dfs -ls -R /user/hive/warehouse/anqing
drwxr-xr-x - hadoop supergroup 0 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=1
-rw-r--r-- 2 hadoop supergroup 2120 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=1/dave.txt
drwxr-xr-x - hadoop supergroup 0 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=2
-rw-r--r-- 2 hadoop supergroup 2120 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=2/dave.txt
drwxr-xr-x - hadoop supergroup 0 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=3
-rw-r--r-- 2 hadoop supergroup 2120 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=3/dave.txt
[hadoop@Slave1 hadoop]$
hive> select count(1) from anqing where part_flag='1';
Query ID = hadoop_20190309220709_376b5416-1e10-4633-9ebf-aa5ed30eadd0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
2019-03-09 22:07:10,274 INFO [11ef99e0-fe20-400e-87b6-1006d834fa31 main] client.RMProxy: Connecting to ResourceManager at hadoopMaster/192.168.56.100:8032
2019-03-09 22:07:10,301 INFO [11ef99e0-fe20-400e-87b6-1006d834fa31 main] client.RMProxy: Connecting to ResourceManager at hadoopMaster/192.168.56.100:8032
Starting Job = job_1552137405153_0007, Tracking URL = http://hadoopMaster:8088/proxy/application_1552137405153_0007/
Kill Command = /home/hadoop/hadoop/bin/mapred job -kill job_1552137405153_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-03-09 22:07:17,411 Stage-1 map = 0%, reduce = 0%
2019-03-09 22:07:24,641 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.68 sec
2019-03-09 22:07:29,805 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.54 sec
MapReduce Total cumulative CPU time: 3 seconds 540 msec
Ended Job = job_1552137405153_0007
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.54 sec HDFS Read: 14697 HDFS Write: 103 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 540 msec
OK
106
Time taken: 22.651 seconds, Fetched: 1 row(s)
hive>
相关命令:
SHOW TABLES; # 查看所有的表
SHOW TABLES 'TMP'; #支持模糊查询
SHOW PARTITIONS TMP_TABLE; #查看表有哪些分区
DESCRIBE TMP_TABLE; #查看表结构
hive> show partitions anqing;
OK
part_flag=1
part_flag=2
part_flag=3
Time taken: 0.171 seconds, Fetched: 3 row(s)
hive> desc anqing;
OK
id int
name string
salary string
part_flag string
# Partition Information
# col_name data_type comment
part_flag string
Time taken: 0.175 seconds, Fetched: 8 row(s)
hive>
9 分桶表
Hive里的分桶=MapReduce中的分区,而Hive中的分区只是将数据分到了不同的文件夹。
9.1 创建分桶表
hive> create table emp_buck(id int,name string,salary string) clustered by(id) sorted by(id DESC) into 4 buckets row format delimited fields terminated by ',';
OK
Time taken: 0.192 seconds
hive>
含义:根据id字段分桶,每个桶按照id字段局部有序,4个桶。建桶的时候不会的数据做处理,只是要求插入的数据是被分好桶的。
9.2 分桶表内插入数据
一般不适用load数据进入分桶表,因为load进入的数据不会自动按照分桶规则分为四个小文件。所以,一般使用select查询向分桶表插入文件。
设置变量,设置分桶为true,设置reduce数量是分桶的数量个数
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
hive> insert overwrite table emp_buck select * from ahdba cluster by(id);
或者:
insert overwrite table emp_buck select * from ahdba distribute by(id) sort by(id asc);
其中, distribute by(id) sort by(id asc)等价cluster by(id),cluster by(id) = 分桶+排序。
先分发,再局部排序。区别是distribute更加灵活,可以根据一个字段分区,另外字段排序。
第二个子查询的输出了4个文件作为主查询的输入。
9.3 分桶表的原理与作用
原理:
Hive是针对某一列进行桶的组织。Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。(原理和MapReduce的getPartition方法相同)
作用:
(1) 最大的作用是用来提高join操作的效率;
前提是两个都是分桶表且分桶数量相同或者倍数关系。
比如:select a.id,a.name,b.addr from a join b on a.id = b.id;
如果a表和b表已经是分桶表,而且分桶的字段是id字段。
对于JOIN操作两个表有一个相同的列,如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以,可以大大较少JOIN的数据量。
(2)取样(sampling)更高效。在处理大规模数据集时,在开发和修改查询的阶段,如果能在数据集的一小部分数据上试运行查询,会带来很多方便。
10 创建视图
CREATE VIEW [IF NOT EXISTS] view_name [ (column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...
注:视图关键字。 视图是只读的,不能用LOAD/INSERT/ALTER
hive> create view v_ahdba as select * from ahdba;
OK
Time taken: 0.43 seconds
hive> select * from v_ahdba limit 3;
OK
100 Steven 24000.00
101 Neena 17000.00
102 Lex 17000.00
Time taken: 0.21 seconds, Fetched: 3 row(s)
hive>
11 Alter Table
11.1 添加分区
ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...
partition_spec:
: PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
示例:
hive> show partitions anqing;
OK
part_flag=1
part_flag=2
part_flag=3
Time taken: 0.175 seconds, Fetched: 3 row(s)
hive>
hive> ALTER TABLE anqing ADD
> PARTITION (part_flag='4') location '/user/hive/warehouse/anqing/part_flag=4'
> PARTITION (part_flag='5') location '/user/hive/warehouse/anqing/part_flag=5' ;
OK
Time taken: 0.422 seconds
hive> show partitions anqing;
OK
part_flag=1
part_flag=2
part_flag=3
part_flag=4
part_flag=5
Time taken: 0.317 seconds, Fetched: 5 row(s)
hive>
删除分区
ALTER TABLE table_name DROP partition_spec, partition_spec,...
示例:
hive> ALTER TABLE anqing DROP PARTITION (part_flag='4');
Dropped the partition part_flag=4
OK
Time taken: 0.451 seconds
hive> show partitions anqing;
OK
part_flag=1
part_flag=2
part_flag=3
part_flag=5
Time taken: 0.144 seconds, Fetched: 4 row(s)
hive>
11.2 重命名表
hive> show tables;
OK
ahdba
anqing
cndba
emp_buck
employees
page_view
v_ahdba
Time taken: 0.046 seconds, Fetched: 7 row(s)
hive> alter table ahdba rename to huaining;
OK
Time taken: 0.187 seconds
hive> show tables;
OK
anqing
cndba
emp_buck
employees
huaining
page_view
v_ahdba
Time taken: 0.055 seconds, Fetched: 7 row(s)
hive>
这个命令可以让用户为表更名。数据所在的位置和分区名并不改变。换而言之,老的表名并未“释放”,对老表的更改会改变新表的数据。
11.3 修改列/属性
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
这个命令可以允许改变列名、数据类型、注释、列位置或者它们的任意组合。
11.4 添加/替换列
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], …)
ADD是代表新增一字段,字段位置在所有列后面(partition列前);REPLACE则是表示替换表中所有字段。
12 保存select查询结果的几种方式
1、将查询结果保存到一张新的hive表中
create table t_tmp
as
select * from t_p;
2、将查询结果保存到一张已经存在的hive表中
insert into table t_tmp
select * from t_p;
3、将查询结果保存到指定的文件目录(可以是本地,也可以是HDFS)
insert overwrite local directory ‘/home/hadoop/test’
select * from t_p;
插入HDFS
insert overwrite directory ‘/aaa/test’
select * from t_p;
13 查看/删除数据
13.1 查询
13.1.1 语法
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]
注意:
CLUSTER BY字段含义:根据这个字段进行分区,需要注意设置reduce_num数量。
order by 会对输入做全局排序,因此只有一个reducer,会导致当输入规模较大时,需要较长的计算时间。
sort by不是全局排序,其在数据进入reducer前完成排序。因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1,则sort by只保证每个reducer的输出有序,不保证全局有序。
distribute by(字段)根据指定的字段将数据分到不同的reducer,且分发算法是hash散列。
Cluster by(字段) 除了具有Distribute by的功能外,还会对该字段进行排序。
如果分桶和sort字段是同一个时,此时,cluster by = distribute by + sort by:
select from inner_table
select count() from inner_table
删除一个内部表的同时会同时删除表的元数据和数据。删除一个外部表,只删除元数据而保留数据。
语法:
DROP TABLE tbl_name
13.1.2 GROUP BY
groupByClause: GROUP BY groupByExpression (, groupByExpression)*
groupByExpression: expression
groupByQuery: SELECT expression (, expression)* FROM src groupByClause?
13.1.3 Order/Sort By
Order by 语法:
colOrder: ( ASC | DESC )
orderBy: ORDER BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src orderBy
Sort By 语法: Sort顺序将根据列类型而定。如果数字类型的列,则排序顺序也以数字顺序。如果字符串类型的列,则排序顺序将字典顺序。
colOrder: ( ASC | DESC )
sortBy: SORT BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src sortBy
13.2 Limit/Top/REGEX
Limit 可以限制查询的记录数。查询的结果是随机选择的。下面的查询语句从 t1 表中随机查询5条记录:
SELECT * FROM t1 LIMIT 5
下面的查询语句查询销售记录最大的 5 个销售代表。
SET mapred.reduce.tasks = 1
SELECT * FROM sales SORT BY amount DESC LIMIT 5
SELECT 语句可以使用正则表达式做列选择,下面的语句查询除了 ds 和 hr 之外的所有列:
SELECT
(ds|hr)?+.+
FROM sales
14 Hive中的join
14.1 语法:
join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
table_reference:
table_factor
| join_table
table_factor:
tbl_name [alias]
| table_subquery alias
| ( table_references )
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression
Hive 只支持等值连接(equality joins)、外连接(outer joins)和(left/right joins)。Hive 不支持所有非等值的连接,因为非等值连接非常难转化到 map/reduce 任务。另外,Hive 支持多于 2 个表的连接。
14.2 只支持等值join。
例如:
SELECT a. FROM a JOIN b ON (a.id = b.id)
SELECT a. FROM a JOIN b ON (a.id = b.id AND a.department = b.department)
14.3 可以join多于2个表。
例如:
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
如果join中多个表的 join key 是同一个,则 join 会被转化为单个 map/reduce 任务,例如:
SELECT a.val, b.val, c.val FROM a JOIN b
ON (a.key = b.key1) JOIN c
ON (c.key = b.key1)
被转化为单个 map/reduce 任务,因为 join 中只使用了 b.key1 作为 join key。SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1)
JOIN c ON (c.key = b.key2)
而这一 join 被转化为 2 个 map/reduce 任务。因为 b.key1 用于第一次 join 条件,而 b.key2 用于第二次 join。
14.4 join 时,每次 map/reduce 任务的逻辑。
reducer 会缓存 join 序列中除了最后一个表的所有表的记录,再通过最后一个表将结果序列化到文件系统。这一实现有助于在 reduce 端减少内存的使用量。实践中,应该把最大的那个表写在最后(否则会因为缓存浪费大量内存)。例如:
SELECT a.val, b.val, c.val FROM a
JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
所有表都使用同一个 join key(使用 1 次 map/reduce 任务计算)。Reduce 端会缓存 a 表和 b 表的记录,然后每次取得一个 c 表的记录就计算一次 join 结果,类似的还有:
SELECT a.val, b.val, c.val FROM a
JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
这里用了 2 次 map/reduce 任务。第一次缓存 a 表,用 b 表序列化;第二次缓存第一次 map/reduce 任务的结果,然后用 c 表序列化。
15 Hive 的自定义函数
当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)。
15.1 UDF
基本函数:
SHOW FUNCTIONS;
DESCRIBE FUNCTION;
15.2 UDTF
UDTF即Built-in Table-Generating Functions 使用这些UDTF函数有一些限制:
1、SELECT里面不能有其它字段,如:
SELECT pageid, explode(adid_list) AS myCol…
2、不能嵌套,如:
SELECT explode(explode(adid_list)) AS myCol… # 不支持
3、不支持GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY ,如:
SELECT explode(adid_list) AS myCol … GROUP BY myCol
15.3 EXPLODE
场景:将数据进行转置,如:
create table test2(mycol array<int>);
insert OVERWRITE table test2 select * from (select array(1,2,3) from a union all select array(7,8,9) from d)c;
hive> select * from test2;
OK
[1,2,3]
[7,8,9]
hive> SELECT explode(myCol) AS myNewCol FROM test2;
OK