TiSpark 混合部署
其他部署方案参考: https://pingcap.com/docs-cn/tispark/tispark-user-guide/
因为tidb-ansible 部署安装TiDB集群 inventoy.ini并未指定spark配置,所以TiDB集群环境并未安装Spark集群,当前采用
[spark_master]
[spark_slaves]
1、已有 Spark 集群的部署方式
如果在已有 Spark 集群上运行 TiSpark,无需重启集群。可以使用 Spark 的 --jars 参数将 TiSpark 作为依赖引入:
spark-shell --jars $TISPARK_FOLDER/tispark-core-${version}-SNAPSHOT-jar-with-dependencies.jar
2、没有 Spark 集群的部署方式
如果没有使用中的 Spark 集群,推荐使用 Saprk Standalone 方式部署。这里简单介绍下 Standalone 部署方式。如果遇到问题,可以去官网寻求帮助;也欢迎在我们的 GitHub 上提 issue。
2.1、下载安装包并安装
你可以在这里下载 Apache Spark https://www.apache.org/dyn/closer.lua/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.7.tgz。 注意spark版本说明
对于 Standalone 模式且无需 Hadoop 支持,则选择 Spark 2.3.x 且带有 Hadoop 依赖的 Pre-build with Apache Hadoop 2.x 任意版本。如有需要配合使用的 Hadoop 集群,则选择对应的 Hadoop 版本号。你也可以选择从源代码自行构建以配合官方 Hadoop 2.x 之前的版本。
tidb用户
$ mkdir -p ${deploy_dir}/spark && cd ${deploy_dir}/spark
$ export $SPARKPATH=${deploy_dir}/spark/spark-2.3.3-bin-hadoop2.7
$ tar -xvf spark-2.3.3-bin-hadoop2.7.tgz
如果你已经有了 Spark 二进制文件,并且当前 PATH 为 SPARKPATH,需将 TiSpark jar 包拷贝到 `${SPARKPATH}/jars` 目录下。
$ cd ${tidb-ansible}/resource/bin
$ cp tispark-SNAPSHOT-jar-with-dependencies.jar ${deploy_dir}/spark/spark-2.3.3-bin-hadoop2.7/jars/
3、启动 Master
在选中的 Spark Master 节点执行如下命令:
$ cd $SPARKPATH
$ ./sbin/start-master.sh
在这步完成以后,屏幕上会打印出一个 log 文件。检查 log 文件确认 Spark-Master 是否启动成功。你可以打开 http://spark-master-hostname:8080 查看集群信息(如果你没有改动 Spark-Master 默认 Port Numebr)。在启动 Spark-Slave 的时候,也可以通过这个面板来确认 Slave 是否已经加入集群。
4、启动 Slave
类似地,可以用如下命令启动 Spark-Slave 节点:指定spark-master-hostname 主机IP
$ cd $SPARKPATH
$ ./sbin/start-slave.sh spark://spark-master-hostname:7077
命令返回以后,即可通过刚才的面板查看这个 Slave 是否已经正确地加入了 Spark 集群。在所有 Slave 节点重复刚才的命令。确认所有的 Slave 都可以正确连接 Master,这样你就拥有了一个 Standalone 模式的 Spark 集群。
5、样例查询
5.1、部署信息
Spark 部署在 TiDB 实例部署目录下 spark 目录中
TiSpark jar 包默认部署在 Spark 部署目录 jars 文件夹下:
spark/spark-2.3.3-bin-hadoop2.7/jars/tispark-SNAPSHOT-jar-with-dependencies.jar
TiSpark sample data 及导入脚本默认部署在 TiDB-Ansible 目录下:
tidb-ansible/resources/bin/tispark-sample-data
5.2、环境准备
在 TiDB 实例上安装 JDK
在 Oracle JDK 官方下载页面 下载 JDK 1.8 当前最新版,本示例中下载的版本为 jdk-8u141-linux-x64.tar.gz。 https://www.oracle.com/technetwork/java/javase/downloads/java-archive-javase8-2177648.html
解压并根据您的 JDK 部署目录设置环境变量, 编辑 ~/.bashrc 文件,比如:
export JAVA_HOME=/home/pingcap/jdk1.8.0_144
export PATH=$JAVA_HOME/bin:$PATH
验证 JDK 有效性:
$ java -version
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
6、导入样例数据
假设 TiDB 集群已启动,其中一台 TiDB 实例服务 IP 为 192.168.0.2,端口为 4000,用户名为 root, 密码为空。
cd tidb-ansible/resources/bin/tispark-sample-data
修改 sample_data.sh 中 TiDB 登录信息,比如:
mysql -h 192.168.0.2 -P 4000 -u root < dss.ddl
执行脚本
./sample_data.sh
执行脚本的机器上需要安装 MySQL client,CentOS 用户可通过
yum -y install mysql来安装。
登录 TiDB 并验证数据包含 TPCH_001 库及以下表:
$ mysql -uroot -P4000 -h192.168.0.2
MySQL [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| INFORMATION_SCHEMA |
| PERFORMANCE_SCHEMA |
| TPCH_001 |
| mysql |
| test |
+--------------------+
5 rows in set (0.00 sec)
MySQL [(none)]> use TPCH_001
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
MySQL [TPCH_001]> show tables;
+--------------------+
| Tables_in_TPCH_001 |
+--------------------+
| CUSTOMER |
| LINEITEM |
| NATION |
| ORDERS |
| PART |
| PARTSUPP |
| REGION |
| SUPPLIER |
+--------------------+
8 rows in set (0.00 sec)
7、样例使用
假设你已经按照上述步骤成功启动了 TiSpark 集群,下面简单介绍如何使用 Spark SQL 来做 OLAP 分析。这里我们用名为 tpch 数据库中的 lineitem 表作为范例。
假设你的 PD 节点位于 172.16.30.87,端口为 2479,在$SPARKPATH/conf/spark-defaults.conf加入
$ cd ${deploy_dir}/spark/spark-2.3.3-bin-hadoop2.7/conf
$ cp spark-defaults.conf.template spark-defaults.conf
$ vi spark-defaults.conf
spark.tispark.pd.addresses 172.16.30.87:2479,172.16.30.88:2479,172.16.30.89:2479
spark.sql.extensions org.apache.spark.sql.TiExtensions
编辑spark环境变量
$ cd ${deploy_dir}/spark/spark-2.3.3-bin-hadoop2.7/conf
$ cp spark-env.sh.template spark-env.sh
$ mv spark-env.sh.template spark-env.sh.template.bak
$ vi spark-env.sh 加入如下几行:
export SPARK_MASTER_IP=172.16.30.86
export SPARK_LOCAL_IP=172.16.30.86
export JAVA_HOME=/usr/java/jre1.8.0_181-amd64
进入 spark 部署目录启动 spark-shell:
$ cd ${deploy_dir}/spark/spark-2.3.3-bin-hadoop2.7/
[tidb@ip-172-16-30-86 spark-2.3.3-bin-hadoop2.7]$ bin/spark-shell
2019-02-27 17:00:56 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://172.16.30.86:4040
Spark context available as 'sc' (master = spark://172.16.30.86:7077, app id = app-20190227170101-0006).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_/ // _ // _ `/ __/ '_/
/___/ .__//_,_/_/ /_//_/ version 2.3.3
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.
然后像使用原生 Spark 一样查询 TiDB 表:
scala> spark.sql("show databases").show
+------------+
|databaseName|
+------------+
| default|
| marvin|
| dminc|
| dmmysql|
| test|
| tpch_001|
| tidb|
| dm_meta|
| dmtidb|
| bikeshare|
| mysql|
| bike|
+------------+
scala> spark.sql("use tpch_001").show()
2019-02-27 17:04:59 WARN ObjectStore:568 - Failed to get database tpch_001, returning NoSuchObjectException
++
||
++
++
scala> spark.sql("select count(*) from lineitem").show
2019-02-27 17:05:02 WARN ObjectStore:568 - Failed to get database tpch_001, returning NoSuchObjectException
2019-02-27 17:05:02 WARN ObjectStore:568 - Failed to get database tpch_001, returning NoSuchObjectException
+--------+
|count(1)|
+--------+
| 60175|
+--------+
复杂查询
scala> spark.sql("""select
| l_returnflag,
| l_linestatus,
| sum(l_quantity) as sum_qty,
| sum(l_extendedprice) as sum_base_price,
| sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
| sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
| avg(l_quantity) as avg_qty,
| avg(l_extendedprice) as avg_price,
| avg(l_discount) as avg_disc,
| count(*) as count_order
| from
| lineitem
| where
| l_shipdate <= date '1998-12-01' - interval '90' day
| group by
| l_returnflag,
| l_linestatus
| order by
| l_returnflag,
| l_linestatus""".stripMargin).show
2019-02-27 17:10:50 WARN ObjectStore:568 - Failed to get database tpch_001, returning NoSuchObjectException
2019-02-27 17:10:50 WARN ObjectStore:568 - Failed to get database tpch_001, returning NoSuchObjectException
+------------+------------+---------+--------------+--------------+-----------------+---------+------------+--------+-----------+
|l_returnflag|l_linestatus| sum_qty|sum_base_price|sum_disc_price| sum_charge| avg_qty| avg_price|avg_disc|count_order|
+------------+------------+---------+--------------+--------------+-----------------+---------+------------+--------+-----------+
| A| F|380456.00| 532348211.65|505822441.4861| 526165934.000839|25.575155|35785.709307|0.050081| 14876|
| N| F| 8971.00| 12384801.37| 11798257.2080| 12282485.056933|25.778736|35588.509684|0.047759| 348|
| N| O|742802.00| 1041502841.45|989737518.6346|1029418531.523350|25.454988|35691.129209|0.049931| 29181|
| R| F|381449.00| 534594445.35|507996454.4067| 528524219.358903|25.597168|35874.006533|0.049828| 14902|
+------------+------------+---------+--------------+--------------+-----------------+---------+------------+--------+-----------+
TiDB复杂查询
mysql> select
-> l_returnflag,
-> l_linestatus,
-> sum(l_quantity) as sum_qty,
-> sum(l_extendedprice) as sum_base_price,
-> sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
-> sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
-> avg(l_quantity) as avg_qty,
-> avg(l_extendedprice) as avg_price,
-> avg(l_discount) as avg_disc,
-> count(*) as count_order
-> from
-> lineitem
-> where
-> l_shipdate <= date '1998-12-01' - interval '90' day
-> group by
-> l_returnflag,
-> l_linestatus
-> order by
-> l_returnflag,
-> l_linestatus
-> ;
+--------------+--------------+-----------+----------------+----------------+-------------------+-----------+--------------+----------+-------------+
| l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order |
+--------------+--------------+-----------+----------------+----------------+-------------------+-----------+--------------+----------+-------------+
| A | F | 380456.00 | 532348211.65 | 505822441.4861 | 526165934.000839 | 25.575155 | 35785.709307 | 0.050081 | 14876 |
| N | F | 8971.00 | 12384801.37 | 11798257.2080 | 12282485.056933 | 25.778736 | 35588.509684 | 0.047759 | 348 |
| N | O | 742802.00 | 1041502841.45 | 989737518.6346 | 1029418531.523350 | 25.454988 | 35691.129209 | 0.049931 | 29181 |
| R | F | 381449.00 | 534594445.35 | 507996454.4067 | 528524219.358903 | 25.597168 | 35874.006533 | 0.049828 | 14902 |
+--------------+--------------+-----------+----------------+----------------+-------------------+-----------+--------------+----------+-------------+
4 rows in set (0.28 sec)
版权声明:本文为博主原创文章,未经博主允许不得转载。
- 上一篇:DM 合库合表问题记录实践
- 下一篇:Tidb percolator事务模型简概



