当前位置:首页 > 开发 > 开源软件 > 正文

spark sql 访问hive数据的配置详解

发表于: 2015-07-15   作者:daizj   来源:转载   浏览:
摘要: spark sql 能够通过thriftserver 访问hive数据,默认spark编译的版本是不支持访问hive,因为hive依赖比较多,因此打的包中不包含hive和thriftserver,因此需要自己下载源码进行编译,将hive,thriftserver打包进去才能够访问,详细配置步骤如下:   1、下载源码   2、下载Maven,并配置 此配置简单,就略过

spark sql 能够通过thriftserver 访问hive数据,默认spark编译的版本是不支持访问hive,因为hive依赖比较多,因此打的包中不包含hive和thriftserver,因此需要自己下载源码进行编译,将hive,thriftserver打包进去才能够访问,详细配置步骤如下:

 

1、下载源码

 

2、下载Maven,并配置

此配置简单,就略过

 

3、使用maven进行打包:

打包命令: 

mvn -Pyarn -Dhadoop.version=2.3.0-cdh5.0.0 -Phive -Phive-thriftserver -DskipTests clean package

 

上面的hadoop.version可以根据自己的需要设置相应的版本

 

注:在windows环境下编译时,由于编译需要的内存很大,因此要设置一下maven使用的内存数据,

进入MAVEN_HOME/bin/ 目录,修改mvn.bat文件,定位到下面一行位置,此行已被注释

 

@REM set MAVEN_OPTS=-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8000

 

 

然后在下一行增加下面内容,并保存

set MAVEN_OPTS= -Xms1024m -Xmx1024m -XX:MaxPermSize=512m 

 

编译时可能会失败,这是由于网络中断的原因,多试几次就能成功

 

 

4、将SPARK_SRC_HOME\assembly\target\scala-2.10\spark-assembly-1.4.0-hadoop2.3.0-cdh5.0.0.jar 文件替换掉SPARK_HOME/lib 目录下的  spark-assembly*.jar文件

 

SPARK_SRC_HOME:为spark源码路径

SPARK_HOME:spark安装路径

 

5、cp HIVE_HOME/conf/hive-site.xml SPARK_HOME/conf/

将hive安装目录下的hive-site.xml文件复制到spark的conf目录下

 

6、进入SPARK_HOME/bin/ 运行 ./spark-sql --master spark://masterIp:7077

 

 

7、查询hive表数据

 

 

         > select * from test2;

15/07/16 14:07:13 INFO ParseDriver: Parsing command: select * from test2

15/07/16 14:07:13 INFO ParseDriver: Parse Completed

15/07/16 14:07:20 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.209.140:51413 in memory (size: 1671.0 B, free: 267.3 MB)

15/07/16 14:07:20 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.209.141:35827 in memory (size: 1671.0 B, free: 267.3 MB)

15/07/16 14:07:21 WARN HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.

15/07/16 14:07:21 INFO deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps

15/07/16 14:07:22 INFO MemoryStore: ensureFreeSpace(377216) called with curMem=0, maxMem=280248975

15/07/16 14:07:22 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 368.4 KB, free 266.9 MB)

15/07/16 14:07:22 INFO MemoryStore: ensureFreeSpace(32203) called with curMem=377216, maxMem=280248975

15/07/16 14:07:22 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 31.4 KB, free 266.9 MB)

15/07/16 14:07:22 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.209.141:35827 (size: 31.4 KB, free: 267.2 MB)

15/07/16 14:07:22 INFO SparkContext: Created broadcast 1 from processCmd at CliDriver.java:423

15/07/16 14:07:23 INFO FileInputFormat: Total input paths to process : 1

15/07/16 14:07:23 INFO SparkContext: Starting job: processCmd at CliDriver.java:423

15/07/16 14:07:23 INFO DAGScheduler: Got job 1 (processCmd at CliDriver.java:423) with 2 output partitions (allowLocal=false)

15/07/16 14:07:23 INFO DAGScheduler: Final stage: ResultStage 1(processCmd at CliDriver.java:423)

15/07/16 14:07:23 INFO DAGScheduler: Parents of final stage: List()

15/07/16 14:07:23 INFO DAGScheduler: Missing parents: List()

15/07/16 14:07:23 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at processCmd at CliDriver.java:423), which has no missing parents

15/07/16 14:07:23 INFO MemoryStore: ensureFreeSpace(7752) called with curMem=409419, maxMem=280248975

15/07/16 14:07:23 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 7.6 KB, free 266.9 MB)

15/07/16 14:07:23 INFO MemoryStore: ensureFreeSpace(4197) called with curMem=417171, maxMem=280248975

15/07/16 14:07:23 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.1 KB, free 266.9 MB)

15/07/16 14:07:23 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.209.141:35827 (size: 4.1 KB, free: 267.2 MB)

15/07/16 14:07:23 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874

15/07/16 14:07:23 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at processCmd at CliDriver.java:423)

15/07/16 14:07:23 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks

15/07/16 14:07:24 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, 192.168.209.139, ANY, 1435 bytes)

15/07/16 14:07:24 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, 192.168.209.140, ANY, 1435 bytes)

15/07/16 14:07:24 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.209.140:51413 (size: 4.1 KB, free: 267.3 MB)

15/07/16 14:07:24 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.209.140:51413 (size: 31.4 KB, free: 267.2 MB)

15/07/16 14:07:25 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.209.139:44770 (size: 4.1 KB, free: 267.3 MB)

15/07/16 14:07:30 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.209.139:44770 (size: 31.4 KB, free: 267.2 MB)

15/07/16 14:07:34 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 10767 ms on 192.168.209.140 (1/2)

15/07/16 14:07:39 INFO DAGScheduler: ResultStage 1 (processCmd at CliDriver.java:423) finished in 15.284 s

15/07/16 14:07:39 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@9aa2a1b

15/07/16 14:07:39 INFO DAGScheduler: Job 1 finished: processCmd at CliDriver.java:423, took 15.414215 s

15/07/16 14:07:39 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 15287 ms on 192.168.209.139 (2/2)

15/07/16 14:07:39 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 

15/07/16 14:07:39 INFO StatsReportListener: task runtime:(count: 2, mean: 13027.000000, stdev: 2260.000000, max: 15287.000000, min: 10767.000000)

15/07/16 14:07:39 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%       95%     100%

15/07/16 14:07:39 INFO StatsReportListener:     10.8 s  10.8 s  10.8 s  10.8 s  15.3 s  15.3 s  15.3 s    15.3 s  15.3 s

15/07/16 14:07:39 INFO StatsReportListener: task result size:(count: 2, mean: 15663.000000, stdev: 15.000000, max: 15678.000000, min: 15648.000000)

15/07/16 14:07:39 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%       95%     100%

15/07/16 14:07:39 INFO StatsReportListener:     15.3 KB 15.3 KB 15.3 KB 15.3 KB 15.3 KB 15.3 KB 15.3 KB   15.3 KB 15.3 KB

15/07/16 14:07:39 INFO StatsReportListener: executor (non-fetch) time pct: (count: 2, mean: 72.195332, stdev: 18.424154, max: 90.619485, min: 53.771178)

15/07/16 14:07:39 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%       95%     100%

15/07/16 14:07:39 INFO StatsReportListener:     54 %    54 %    54 %    54 %    91 %    91 %    91 %      91 %    91 %

15/07/16 14:07:39 INFO StatsReportListener: other time pct: (count: 2, mean: 27.804668, stdev: 18.424154, max: 46.228822, min: 9.380515)

15/07/16 14:07:39 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%       95%     100%

15/07/16 14:07:39 INFO StatsReportListener:      9 %     9 %     9 %     9 %    46 %    46 %    46 %      46 %    46 %

111

222

333

444

555

666

777

888

999111

222

333

444

555

666

777

888

999111

............

 

Time taken: 26.523 seconds, Fetched 889 row(s)

15/07/16 14:07:39 INFO CliDriver: Time taken: 26.523 seconds, Fetched 889 row(s)

spark-sql>    

 

由于我是自己笔记本上的虚拟机上运行,同时只有两个work节点,所以运行会慢一点,但不影响功能

 

看到有的同学会要求先启动thriftserver,经测试可以不需要手动启动,在运行./spark-sql时会自动启动

同时hive-site.xml也不需要修改任何配置

 

 

spark sql 访问hive数据的配置详解

  • 0

    开心

    开心

  • 0

    板砖

    板砖

  • 0

    感动

    感动

  • 0

    有用

    有用

  • 0

    疑问

    疑问

  • 0

    难过

    难过

  • 0

    无聊

    无聊

  • 0

    震惊

    震惊

编辑推荐
最近在使用 Spark 结合 Hive 来执行查询操作。。跑了一个demo 出现如下错误: 01-20 14:49:41 [INFO
Spark SQL使用时需要有若干“表”的存在,这些“表”可以来自于Hive,也可以来自“临时表”。如果“
前一篇文章是Spark SQL的入门篇Spark SQL初探,介绍了一些基础知识和API,但是离我们的日常使用还似
在Hive中,如果一个很大的表和一个小表做join,Hive可以自动或者手动使用MapJoin,将小表的数据加载
在Hive中,如果一个很大的表和一个小表做join,Hive可以自动或者手动使用MapJoin,将小表的数据加载
相对于使用MapReduce或者Spark Application的方式进行数据分析,使用Hive SQL或Spark SQL能为我们省
Spark SQL也公布了很久,今天写了个程序来看下Spark SQL、Spark Hive以及直接用Hive执行的效率进行
Spark SQL也公布了很久,今天写了个程序来看下Spark SQL、Spark Hive以及直接用Hive执行的效率进行
Spark SQL也公布了很久,今天写了个程序来看下Spark SQL、Spark Hive以及直接用Hive执行的效率进行
Hive是基于Hadoop的一个数据仓库系统,在各大公司都有广泛的应用。美团数据仓库也是基于Hive搭建,
版权所有 IT知识库 CopyRight © 2009-2015 IT知识库 IT610.com , All Rights Reserved. 京ICP备09083238号