签到成功

知道了

CNDBA社区CNDBA社区

Hive 对象 基本操作

2019-03-09 22:54 3052 0 转载 Hive
作者: dave

在之前的博客中我们看了Hive 环境的搭建,如下:
Hive 3.1.1 环境 搭建手册
https://www.cndba.cn/dave/article/3335

本篇我们了解一下Hive对象的基本操作。

https://www.cndba.cn/dave/article/3338

1 创建数据库

hive> create database cndba;
OK
Time taken: 1.103 seconds
hive>

建立一个新数据库,就会在HDFS的/user/hive/warehouse/中生成一个cndba.db文件夹。如果不创建新数据库,不使用hive>use <数据库名>,系统默认的数据库。可以显式使用:https://www.cndba.cn/dave/article/3338

hive> use default;

默认/user/hive/warehouse/中建表https://www.cndba.cn/dave/article/3338

[dave@www.cndba.cn ~]$ hdfs dfs -ls -R /user
drwxr-xr-x   - cndba supergroup          0 2019-01-23 23:55 /user/cndba
drwxr-xr-x   - cndba supergroup          0 2019-01-23 23:55 /user/cndba/output
-rw-r--r--   2 cndba supergroup          0 2019-01-23 23:55 /user/cndba/output/_SUCCESS
-rw-r--r--   2 cndba supergroup      34795 2019-01-23 23:55 /user/cndba/output/part-r-00000
drwxr-xr-x   - cndba supergroup          0 2019-03-07 23:48 /user/hive
drwxr-xr-x   - cndba supergroup          0 2019-03-07 23:48 /user/hive/warehouse
drwxr-xr-x   - cndba supergroup          0 2019-03-07 23:48 /user/hive/warehouse/cndba.db
[dave@www.cndba.cn ~]$

2 创建表

语法:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name 
[(col_name data_type [COMMENT col_comment], ...)] 
[COMMENT table_comment] 
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 
[CLUSTERED BY (col_name, col_name, ...) 
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] 
[ROW FORMAT row_format] 
[STORED AS file_format] 
[LOCATION hdfs_path]

示例:

hive> create table employees(id int,name string,salary string) row format delimited fields terminated by '/,';
2019-03-09 01:10:08,488 INFO  [89328b4b-0873-45ca-b2a5-fb2986043911 main] DataNucleus.Persistence: Request to load fields "comment,name,type" of class org.apache.hadoop.hive.metastore.model.MFieldSchema but object is embedded, so ignored
2019-03-09 01:10:08,488 INFO  [89328b4b-0873-45ca-b2a5-fb2986043911 main] DataNucleus.Persistence: Request to load fields "comment,name,type" of class org.apache.hadoop.hive.metastore.model.MFieldSchema but object is embedded, so ignored
OK
Time taken: 0.071 seconds
hive>

以上语法创建了一个employees表,对应在Mysql元数据表TBLS表会增加表和列的信息。同时,会在HDFS的中的tabletest.db文件夹中增加一个employees文件夹。所有的 Table 数据(不包括 External Table)都保存在这个目录中。

[dave@www.cndba.cn ~]$ hdfs dfs -ls -R /user/hive
drwxr-xr-x   - cndba supergroup          0 2019-03-09 01:10 /user/hive/warehouse
drwxr-xr-x   - cndba supergroup          0 2019-03-07 23:48 /user/hive/warehouse/cndba.db
drwxr-xr-x   - cndba supergroup          0 2019-03-08 21:55 /user/hive/warehouse/dave
drwxr-xr-x   - cndba supergroup          0 2019-03-09 01:10 /user/hive/warehouse/employees
drwxr-xr-x   - cndba supergroup          0 2019-03-07 23:53 /user/hive/warehouse/t_cndba
drwxr-xr-x   - cndba supergroup          0 2019-03-08 21:27 /user/hive/warehouse/tpm_cndba
[dave@www.cndba.cn ~]$

3 创建临时表

临时表可以存储中间结果:

hive> create table tpm_cndba as select * from t_cndba;
Query ID = cndba_20190308212650_1110d65a-bdbc-43d2-8386-38e12391eb95
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
2019-03-08 21:26:54,059 INFO  [89328b4b-0873-45ca-b2a5-fb2986043911 main] client.RMProxy: Connecting to ResourceManager at hadoopmaster/192.168.20.80:8032
2019-03-08 21:26:54,475 INFO  [89328b4b-0873-45ca-b2a5-fb2986043911 main] client.RMProxy: Connecting to ResourceManager at hadoopmaster/192.168.20.80:8032
Starting Job = job_1551186730130_0001, Tracking URL = http://hadoopmaster:8088/proxy/application_1551186730130_0001/
Kill Command = /home/cndba/hadoop/bin/mapred job  -kill job_1551186730130_0001
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2019-03-08 21:27:07,893 Stage-1 map = 0%,  reduce = 0%
Ended Job = job_1551186730130_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://hadoopmaster:9000/user/hive/warehouse/.hive-staging_hive_2019-03-08_21-26-50_554_9030156807573657821-1/-ext-10002
Moving data to directory hdfs://hadoopmaster:9000/user/hive/warehouse/tpm_cndba
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 19.886 seconds
hive> show tables;
OK
t_cndba
tpm_cndba
Time taken: 0.021 seconds, Fetched: 2 row(s)
hive>

4 Load 加载数据

语法:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

有LOCAL表示从本地文件系统加载(文件会被拷贝到HDFS中)
无LOCAL表示从HDFS中加载数据(注意:文件直接被移动,而不是拷贝,并且文件名都不改。)
OVERWRITE  表示是否覆盖表中数据(或指定分区的数据)(没有OVERWRITE  会直接APPEND,而不会滤重!)

示例:
从本地导入数据到表格并追加原表

hive> load data local inpath '/home/cndba/dave.txt' into table employees;
Loading data to table default.employees
OK
Time taken: 0.265 seconds
hive>
hive> select * from employees;
OK
100    Steven    24000.00
101    Neena    17000.00
102    Lex    17000.00
103    Alexander    9000.00
104    Bruce    6000.00
105    David    4800.00
106    Valli    4800.00
107    Diana    4200.00
…

从HDFS导入数据到表格并追加记录:https://www.cndba.cn/dave/article/3338

hive> load data inpath '/dave/dave.txt' into table employees;
Loading data to table default.employees
OK
Time taken: 0.207 seconds

从HDFS导入数据到表格并覆盖原表:

hive> load data inpath '/dave/dave.txt'  overwrite  into table employees;
Loading data to table default.employees
OK
Time taken: 0.258 seconds
hive>

这里注意一点:文本数据的字段间隔符,是在建表的时候指定的,如果要将自定义间隔符的文件读入一个表,需要通过创建表的语句来指明输入文件间隔符,然后load data到表。

5 插入数据

5.1 INSERT语法

语法如下:

Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement 

Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ...] select_statement2] ...

Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement

Insert时,from子句既可以放在select子句后,也可以放在insert子句前,下面两句是等价的。

hive> insert overwrite table cn select * from employees;
hive> from employees insert overwrite table cn select * where id > 1;
Query ID = cndba_20190309020852_5bc199f6-0431-43c0-adad-4e91b8b4ae1a
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2019-03-09 02:08:53,370 INFO  [89328b4b-0873-45ca-b2a5-fb2986043911 main] client.RMProxy: Connecting to ResourceManager at hadoopmaster/192.168.20.80:8032
2019-03-09 02:08:53,403 INFO  [89328b4b-0873-45ca-b2a5-fb2986043911 main] client.RMProxy: Connecting to ResourceManager at hadoopmaster/192.168.20.80:8032
Starting Job = job_1551186730130_0005, Tracking URL = http://hadoopmaster:8088/proxy/application_1551186730130_0005/
Kill Command = /home/cndba/hadoop/bin/mapred job  -kill job_1551186730130_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-03-09 02:09:02,268 Stage-1 map = 0%,  reduce = 0%
2019-03-09 02:09:10,589 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.85 sec
2019-03-09 02:09:19,875 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.99 sec
MapReduce Total cumulative CPU time: 7 seconds 990 msec
Ended Job = job_1551186730130_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://hadoopmaster:9000/user/hive/warehouse/cn/.hive-staging_hive_2019-03-09_02-08-52_882_2438057570131793940-1/-ext-10000
Loading data to table default.cn
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.99 sec   HDFS Read: 17771 HDFS Write: 3611 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 990 msec
OK
Time taken: 28.319 seconds
hive>

Hive不支持一条一条的用insert语句进行插入操作,也不支持update的操作。数据是以load的方式,加载到建立好的表中。数据一旦导入,则不可修改。要么drop掉整个表,要么建立新的表,导入新的数据。

如果想一次插入一条数据,可以通过其他方法实现: 假设有一张表B至少有一条数据,我们想向表A(int,string)中插入一条数据,可以用下面的方法实现:

from B insert table A select 1,‘abc’ limit 1;https://www.cndba.cn/dave/article/3338

5.2 WRITE语法

语法如下:

Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ...

Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...

导出文件到本地:

hive> insert overwrite local directory '/tmp/dave' select * from employees;

注意这里是导到/tmp/dave的目录下,下面还会生成一个文件:

[hadoop@hadoopMaster tmp]$ ll /tmp/dave
total 8
-rw-r--r--. 1 hadoop hadoop 8056 Mar  9 21:30 000000_0
[hadoop@hadoopMaster tmp]$ ll /tmp/dave/000000_0 
-rw-r--r--. 1 hadoop hadoop 8056 Mar  9 21:30 /tmp/dave/000000_0
[hadoop@hadoopMaster tmp]$

导出文件到HDFS:

hive> insert overwrite directory '/dave' select * from employees;
[hadoop@Slave1 hadoop]$ hdfs dfs -ls /dave
Found 2 items
-rw-r--r--   2 hadoop supergroup       8056 2019-03-09 21:31 /dave/000000_0
-rw-r--r--   2 hadoop supergroup          0 2019-03-03 01:11 /dave/_SUCCESS
[hadoop@Slave1 hadoop]$

一个源可以同时插入到多个目标表或目标文件,多目标insert可以用一句话来完成:

https://www.cndba.cn/dave/article/3338

FROM src
  INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
  INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
  INSERT OVERWRITE TABLE dest3 PARTITION(ds='2013-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
  INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

6 显示表

查看表名:

hive> show tables;
OK
employees
page_view
Time taken: 0.112 seconds, Fetched: 2 row(s)
hive>

查看表名,部分匹配,注意这里的规则在模糊匹配之前有个.:

hive> show tables 'cn.*';
OK
cndba
Time taken: 0.117 seconds, Fetched: 1 row(s)
hive>

查看某表的所有Partition,如果没有就报错:

SHOW PARTITIONS page_view;

查看某表结构:

hive> desc cndba;
OK
id                      int                                         
name                    string                                      
salary                  string                                      
Time taken: 0.138 seconds, Fetched: 3 row(s)
hive>

查看有限行内容,同Greenplum,用limit关键词:

hive> select * from employees limit 3;
OK
100    Steven    24000.00
101    Neena    17000.00
102    Lex    17000.00
Time taken: 0.561 seconds, Fetched: 3 row(s)
hive>

7 external表

Hive 创建内部表时,会将数据移动到数据仓库指向的路径;创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变,数据源可以在任意位置。 外部表删除时,只删除元数据信息,存储在HDFS中的数据没有被删除。

[hadoop@hadoopMaster hadoop]$ hdfs dfs -mkdir -p /ext/ahdba
[hadoop@hadoopMaster ~]$ hdfs dfs -put ~/ahdba.txt /ext/ahdba
[hadoop@hadoopMaster hadoop]$ hdfs dfs -ls -R /ext
drwxr-xr-x   - hadoop supergroup          0 2019-03-09 21:52 /ext/ahdba
-rw-r--r--   2 hadoop supergroup       2120 2019-03-09 21:52 /ext/ahdba/ahdba.txt

hive>  create external table ahdba(id int,name string,salary string) row format delimited fields terminated by '/,' stored as textfile location '/ext/ahdba';
OK
Time taken: 0.266 seconds
hive>
hive> select * from ahdba limit 3;
OK
100    Steven    24000.00
101    Neena    17000.00
102    Lex    17000.00
Time taken: 0.243 seconds, Fetched: 3 row(s)
hive>

8 Partition表(分区表)

如果文件很大,用分区表可以快过滤出按分区字段划分的数据。假设anqing中有两个分区part1和part2. 实际就是在HDFS中的anqing文件夹下面再建立两个文件夹,每个文件名就是part1和part2。hive中的分区就是再多建一个目录,优点:便于统计,效率更高,缩小数据集。

示例:

hive> create table anqing(id int,name string,salary string) partitioned by (part_flag string) row format delimited fields terminated by '/,';

插入数据:

hive> load data local inpath '/home/hadoop/dave.txt' overwrite into table anqing partition(part_flag='1');
Loading data to table default.anqing partition (part_flag=1)
OK
Time taken: 1.003 seconds
hive> load data local inpath '/home/hadoop/dave.txt' overwrite into table anqing partition(part_flag='2');
Loading data to table default.anqing partition (part_flag=2)
OK
Time taken: 0.752 seconds
hive> load data local inpath '/home/hadoop/dave.txt' overwrite into table anqing partition(part_flag='3');
Loading data to table default.anqing partition (part_flag=3)
OK
Time taken: 1.013 seconds
hive>

目录结构如下:

[hadoop@Slave1 hadoop]$ hdfs dfs -ls -R /user/hive/warehouse/anqing
drwxr-xr-x   - hadoop supergroup          0 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=1
-rw-r--r--   2 hadoop supergroup       2120 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=1/dave.txt
drwxr-xr-x   - hadoop supergroup          0 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=2
-rw-r--r--   2 hadoop supergroup       2120 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=2/dave.txt
drwxr-xr-x   - hadoop supergroup          0 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=3
-rw-r--r--   2 hadoop supergroup       2120 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=3/dave.txt
[hadoop@Slave1 hadoop]$

hive> select count(1) from anqing where part_flag='1';
Query ID = hadoop_20190309220709_376b5416-1e10-4633-9ebf-aa5ed30eadd0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
2019-03-09 22:07:10,274 INFO  [11ef99e0-fe20-400e-87b6-1006d834fa31 main] client.RMProxy: Connecting to ResourceManager at hadoopMaster/192.168.56.100:8032
2019-03-09 22:07:10,301 INFO  [11ef99e0-fe20-400e-87b6-1006d834fa31 main] client.RMProxy: Connecting to ResourceManager at hadoopMaster/192.168.56.100:8032
Starting Job = job_1552137405153_0007, Tracking URL = http://hadoopMaster:8088/proxy/application_1552137405153_0007/
Kill Command = /home/hadoop/hadoop/bin/mapred job  -kill job_1552137405153_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-03-09 22:07:17,411 Stage-1 map = 0%,  reduce = 0%
2019-03-09 22:07:24,641 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.68 sec
2019-03-09 22:07:29,805 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.54 sec
MapReduce Total cumulative CPU time: 3 seconds 540 msec
Ended Job = job_1552137405153_0007
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.54 sec   HDFS Read: 14697 HDFS Write: 103 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 540 msec
OK
106
Time taken: 22.651 seconds, Fetched: 1 row(s)
hive>

相关命令:

SHOW TABLES; # 查看所有的表
SHOW TABLES 'TMP'; #支持模糊查询
SHOW PARTITIONS TMP_TABLE; #查看表有哪些分区
DESCRIBE TMP_TABLE; #查看表结构

hive> show partitions anqing;
OK
part_flag=1
part_flag=2
part_flag=3
Time taken: 0.171 seconds, Fetched: 3 row(s)
hive> desc anqing;
OK
id                      int                                         
name                    string                                      
salary                  string                                      
part_flag               string                                      

# Partition Information          
# col_name                data_type               comment             
part_flag               string                                      
Time taken: 0.175 seconds, Fetched: 8 row(s)
hive>

9 分桶表

Hive里的分桶=MapReduce中的分区,而Hive中的分区只是将数据分到了不同的文件夹。

9.1 创建分桶表

hive> create table emp_buck(id int,name string,salary string) clustered by(id) sorted by(id DESC) into 4 buckets row format delimited fields terminated by ',';
OK
Time taken: 0.192 seconds
hive>
含义:根据id字段分桶,每个桶按照id字段局部有序,4个桶。建桶的时候不会的数据做处理,只是要求插入的数据是被分好桶的。

9.2 分桶表内插入数据

一般不适用load数据进入分桶表,因为load进入的数据不会自动按照分桶规则分为四个小文件。所以,一般使用select查询向分桶表插入文件。

设置变量,设置分桶为true,设置reduce数量是分桶的数量个数

set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;

hive> insert overwrite table emp_buck select * from ahdba cluster by(id);
或者:
insert overwrite table emp_buck select * from ahdba distribute by(id) sort by(id asc);

其中, distribute by(id) sort by(id asc)等价cluster by(id),cluster by(id) = 分桶+排序。
先分发,再局部排序。区别是distribute更加灵活,可以根据一个字段分区,另外字段排序。
第二个子查询的输出了4个文件作为主查询的输入。

9.3 分桶表的原理与作用

原理:

Hive是针对某一列进行桶的组织。Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。(原理和MapReduce的getPartition方法相同)

作用:

(1) 最大的作用是用来提高join操作的效率;
前提是两个都是分桶表且分桶数量相同或者倍数关系。
比如:select a.id,a.name,b.addr from a join b on a.id = b.id;
如果a表和b表已经是分桶表,而且分桶的字段是id字段。
对于JOIN操作两个表有一个相同的列,如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以,可以大大较少JOIN的数据量。
(2)取样(sampling)更高效。在处理大规模数据集时,在开发和修改查询的阶段,如果能在数据集的一小部分数据上试运行查询,会带来很多方便。

10 创建视图

CREATE VIEW [IF NOT EXISTS] view_name [ (column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...
注:视图关键字。 视图是只读的,不能用LOAD/INSERT/ALTER

hive> create view v_ahdba as select * from ahdba;
OK
Time taken: 0.43 seconds
hive> select * from v_ahdba limit 3;
OK
100    Steven    24000.00
101    Neena    17000.00
102    Lex    17000.00
Time taken: 0.21 seconds, Fetched: 3 row(s)
hive>

11 Alter Table

11.1 添加分区

ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...
partition_spec:
  : PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)

示例:
hive> show partitions anqing;
OK
part_flag=1
part_flag=2
part_flag=3
Time taken: 0.175 seconds, Fetched: 3 row(s)
hive>

hive> ALTER TABLE anqing ADD 
    > PARTITION (part_flag='4') location '/user/hive/warehouse/anqing/part_flag=4' 
    > PARTITION (part_flag='5') location '/user/hive/warehouse/anqing/part_flag=5' ;
OK
Time taken: 0.422 seconds
hive> show partitions anqing;
OK
part_flag=1
part_flag=2
part_flag=3
part_flag=4
part_flag=5
Time taken: 0.317 seconds, Fetched: 5 row(s)
hive>

删除分区

ALTER TABLE table_name DROP partition_spec, partition_spec,...
 
示例:
hive> ALTER TABLE anqing DROP PARTITION (part_flag='4');
Dropped the partition part_flag=4
OK
Time taken: 0.451 seconds
hive> show partitions anqing;
OK
part_flag=1
part_flag=2
part_flag=3
part_flag=5
Time taken: 0.144 seconds, Fetched: 4 row(s)
hive>

11.2 重命名表

hive> show tables;
OK
ahdba
anqing
cndba
emp_buck
employees
page_view
v_ahdba
Time taken: 0.046 seconds, Fetched: 7 row(s)
hive> alter table ahdba rename to huaining;
OK
Time taken: 0.187 seconds
hive> show tables;
OK
anqing
cndba
emp_buck
employees
huaining
page_view
v_ahdba
Time taken: 0.055 seconds, Fetched: 7 row(s)
hive>

这个命令可以让用户为表更名。数据所在的位置和分区名并不改变。换而言之,老的表名并未“释放”,对老表的更改会改变新表的数据。

https://www.cndba.cn/dave/article/3338

11.3 修改列/属性

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
这个命令可以允许改变列名、数据类型、注释、列位置或者它们的任意组合。

11.4 添加/替换列

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], …)
ADD是代表新增一字段,字段位置在所有列后面(partition列前);REPLACE则是表示替换表中所有字段。

12 保存select查询结果的几种方式

1、将查询结果保存到一张新的hive表中

create table t_tmp
as
select * from t_p;

2、将查询结果保存到一张已经存在的hive表中

insert into table t_tmp
select * from t_p;

3、将查询结果保存到指定的文件目录(可以是本地,也可以是HDFS)

insert overwrite local directory ‘/home/hadoop/test’
select * from t_p;

插入HDFS

insert overwrite directory ‘/aaa/test’
select * from t_p;

13 查看/删除数据

13.1 查询

13.1.1 语法

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition] 
[GROUP BY col_list]
[   CLUSTER BY col_list
  | [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]

注意:

CLUSTER BY字段含义:根据这个字段进行分区,需要注意设置reduce_num数量。
order by 会对输入做全局排序,因此只有一个reducer,会导致当输入规模较大时,需要较长的计算时间。
sort by不是全局排序,其在数据进入reducer前完成排序。因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1,则sort by只保证每个reducer的输出有序,不保证全局有序。
distribute by(字段)根据指定的字段将数据分到不同的reducer,且分发算法是hash散列。
Cluster by(字段) 除了具有Distribute by的功能外,还会对该字段进行排序。

如果分桶和sort字段是同一个时,此时,cluster by = distribute by + sort by:

select from inner_table
select count(
) from inner_table

删除一个内部表的同时会同时删除表的元数据和数据。删除一个外部表,只删除元数据而保留数据。
语法:

DROP TABLE tbl_name

13.1.2 GROUP BY

groupByClause: GROUP BY groupByExpression (, groupByExpression)*
groupByExpression: expression
groupByQuery: SELECT expression (, expression)* FROM src groupByClause?

13.1.3 Order/Sort By

Order by 语法:
colOrder: ( ASC | DESC )
orderBy: ORDER BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src orderBy

Sort By 语法: Sort顺序将根据列类型而定。如果数字类型的列,则排序顺序也以数字顺序。如果字符串类型的列,则排序顺序将字典顺序。
colOrder: ( ASC | DESC )
sortBy: SORT BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src sortBy

13.2 Limit/Top/REGEX

Limit 可以限制查询的记录数。查询的结果是随机选择的。下面的查询语句从 t1 表中随机查询5条记录:

SELECT * FROM t1 LIMIT 5

下面的查询语句查询销售记录最大的 5 个销售代表。

SET mapred.reduce.tasks = 1
SELECT * FROM sales SORT BY amount DESC LIMIT 5

SELECT 语句可以使用正则表达式做列选择,下面的语句查询除了 ds 和 hr 之外的所有列:

SELECT (ds|hr)?+.+ FROM sales

14 Hive中的join

14.1 语法:

join_table:
    table_reference JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
  | table_reference LEFT SEMI JOIN table_reference join_condition

table_reference:
    table_factor
  | join_table

table_factor:
    tbl_name [alias]
  | table_subquery alias
  | ( table_references )

join_condition:
    ON equality_expression ( AND equality_expression )*

equality_expression:
    expression = expression

Hive 只支持等值连接(equality joins)、外连接(outer joins)和(left/right joins)。Hive 不支持所有非等值的连接,因为非等值连接非常难转化到 map/reduce 任务。另外,Hive 支持多于 2 个表的连接。

14.2 只支持等值join。

 例如:

SELECT a. FROM a JOIN b ON (a.id = b.id)
SELECT a.
FROM a JOIN b ON (a.id = b.id AND a.department = b.department)https://www.cndba.cn/dave/article/3338

14.3 可以join多于2个表。

例如:

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)

https://www.cndba.cn/dave/article/3338

如果join中多个表的 join key 是同一个,则 join 会被转化为单个 map/reduce 任务,例如:

SELECT a.val, b.val, c.val FROM a JOIN b
ON (a.key = b.key1) JOIN c
ON (c.key = b.key1)
  被转化为单个 map/reduce 任务,因为 join 中只使用了 b.key1 作为 join key。

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1)
JOIN c ON (c.key = b.key2)
而这一 join 被转化为 2 个 map/reduce 任务。因为 b.key1 用于第一次 join 条件,而 b.key2 用于第二次 join。

14.4 join 时,每次 map/reduce 任务的逻辑。

reducer 会缓存 join 序列中除了最后一个表的所有表的记录,再通过最后一个表将结果序列化到文件系统。这一实现有助于在 reduce 端减少内存的使用量。实践中,应该把最大的那个表写在最后(否则会因为缓存浪费大量内存)。例如:

SELECT a.val, b.val, c.val FROM a
JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)

所有表都使用同一个 join key(使用 1 次 map/reduce 任务计算)。Reduce 端会缓存 a 表和 b 表的记录,然后每次取得一个 c 表的记录就计算一次 join 结果,类似的还有:https://www.cndba.cn/dave/article/3338

SELECT a.val, b.val, c.val FROM a
JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)

这里用了 2 次 map/reduce 任务。第一次缓存 a 表,用 b 表序列化;第二次缓存第一次 map/reduce 任务的结果,然后用 c 表序列化。

15 Hive 的自定义函数

当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)。

15.1 UDF

  基本函数:

SHOW FUNCTIONS;
DESCRIBE FUNCTION ;

15.2 UDTF

UDTF即Built-in Table-Generating Functions 使用这些UDTF函数有一些限制:

1、SELECT里面不能有其它字段,如:
SELECT pageid, explode(adid_list) AS myCol…
2、不能嵌套,如:
SELECT explode(explode(adid_list)) AS myCol… # 不支持
3、不支持GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY ,如:
SELECT explode(adid_list) AS myCol … GROUP BY myCol

15.3 EXPLODE 

场景:将数据进行转置,如:

create table test2(mycol array<int>);
insert OVERWRITE table test2 select * from (select array(1,2,3) from a union all select array(7,8,9)  from d)c;
hive> select * from test2;
OK
[1,2,3]
[7,8,9]
hive> SELECT explode(myCol) AS myNewCol FROM test2;
OK
用户评论
* 以下用户言论只代表其个人观点,不代表CNDBA社区的观点或立场
dave

dave

关注

人的一生应该是这样度过的:当他回首往事的时候,他不会因为虚度年华而悔恨,也不会因为碌碌无为而羞耻;这样,在临死的时候,他就能够说:“我的整个生命和全部精力,都已经献给世界上最壮丽的事业....."

  • 2249
    原创
  • 3
    翻译
  • 573
    转载
  • 191
    评论
  • 访问:7505781次
  • 积分:4303
  • 等级:核心会员
  • 排名:第1名
精华文章
    最新问题
    查看更多+
    热门文章
      热门用户
      推荐用户
        Copyright © 2016 All Rights Reserved. Powered by CNDBA · 皖ICP备2022006297号-1·

        QQ交流群

        注册联系QQ