Hive 默认使用的MapReduce 引擎,该引擎效率较低,使用Spark引擎,效率会提升很多。CDH官方手册里有Hive on Spark的配置说明,链接如下:
https://www.cloudera.com/documentation/enterprise/5-16-x/topics/admin_hos_oview.html
1 添加Spark 服务
在CDH 集群中添加Spark 服务,注意这里有2个Spark,后面解释的很清楚,具体操作截图如下:
2 修改Hive 配置
修改Hive 的配置,启用Spark On YARN,如下图:
在配置里,还需要修改Hive 的默认引擎,从MR改成Spark:
3 添加Spark Gateway 到HiveServer2
在Spark 配置中,在运行HiveServer2的主机上添加Spark gateway角色,实际上在之前安装Spark服务的时候,就已经有选择了,如果之前配置过,那么忽略该步骤:
4 重启所有过时的服务
返回CDH 主界面,重启所有过时的服务:
5 验证
Hive中执行查询操作,从输出结果中可以看到Starting Spark Job:
[dave@www.cndba.cn ~]# hive
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: Using incremental CMS is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.16.1-1.cdh5.16.1.p0.3/jars/hive-common-1.1.0-cdh5.16.1.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> select * from default.cndba;
OK
1 http://www.cndba.cn
2 中国DBA社区
3 Oracle
4 Hadoop
Time taken: 3.42 seconds, Fetched: 4 row(s)
hive> select * from default.cndba;
OK
1 http://www.cndba.cn
2 中国DBA社区
3 Oracle
4 Hadoop
Time taken: 0.177 seconds, Fetched: 4 row(s)
hive> select count(1) from default.cndba;
Query ID = root_20281105001414_486f5d2d-cbae-41e9-a643-51a2ddb55d12
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Spark Job = 2ba7b7f9-722c-4202-afef-18f654639d61
Running with YARN Application = application_1856966558053_0001
Kill Command = /opt/cloudera/parcels/CDH-5.16.1-1.cdh5.16.1.p0.3/lib/hadoop/bin/yarn application -kill application_1856966558053_0001
Query Hive on Spark job[0] stages:
0
1
Status: Running (Hive on Spark job[0])
Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
2028-11-05 00:15:26,194 Stage-0_0: 0(+1)/1 Stage-1_0: 0/1
2028-11-05 00:15:29,230 Stage-0_0: 0(+1)/1 Stage-1_0: 0/1
2028-11-05 00:15:31,267 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/1
2028-11-05 00:15:32,283 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished
Status: Finished successfully in 21.16 seconds
OK
4
Time taken: 44.595 seconds, Fetched: 1 row(s)
hive>
在YARN的ResourceManager的web管理界面也可以看到这些Job的信息:
版权声明:本文为博主原创文章,未经博主允许不得转载。