当前位置:首页 > 开发 > 开源软件 > 正文

【Spark】Spark十: Spark SQL第一部分

发表于: 2015-01-03   作者:bit1129   来源:转载   浏览次数:
摘要: Spark的One Stack to rule them all的特性,在Spark SQL即有显现。在传统的基于Hadoop的解决方案中,需要另外安装Pig或者Hive来解决类SQL的即席查询问题。   本文以Spark Shell交互式命令行终端简单的体验下Spark提供的类SQL的数据查询能力   上传数据到HDFS 首先将测试数据上传到HDFS中,本文用到的测试

Spark的One Stack to rule them all的特性,在Spark SQL即有显现。在传统的基于Hadoop的解决方案中,需要另外安装Pig或者Hive来解决类SQL的即席查询问题。

 

本文以Spark Shell交互式命令行终端简单的体验下Spark提供的类SQL的数据查询能力

 

上传数据到HDFS

首先将测试数据上传到HDFS中,本文用到的测试数据来自于Spark安装里面的people.txt文件,它位于spark-1.2.0-bin-hadoop2.4\examples\src\main\resources\people.txt。people.txt的文件内容是:

 

Michael, 29
Andy, 30
Justin, 19

 

使用如下命令将people.txt上传至HDFS(people.txt已经拷贝至当前目录

 

 

hdfs dfs -put people.txt /user/hadoop

 

Spark Shell操作

 

1. 创建SQLContext对象

 

val cxt = new org.apache.spark.sql.SQLContext(sc);
cxt: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@ab552b0

 

2. 引入隐式转化,用于把RDD转换为SchemaRDD

 

scala> import cxt._
import cxt._

 

3. 创建一个POJO类Person

 

scala> case class Person(name: String, age: Int)
defined class Person

 

4. 读取HDFS中的数据并ORM为Person集合

 

scala> val people = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0),p(1).trim.toInt))

 

5. 查看people这个RDD的lineage的关系

 

scala> people.toDebugString
15/01/03 06:25:17 INFO mapred.FileInputFormat: Total input paths to process : 1
res0: String = 
(1) MappedRDD[3] at map at <console>:19 []
 |  MappedRDD[2] at map at <console>:19 []
 |  people.txt MappedRDD[1] at textFile at <console>:19 []
 |  people.txt HadoopRDD[0] at textFile at <console>:19 []

 

6. 将people这个RDD注册为一个虚拟表People

 

scala> people.registerAsTable("People")

 

此时查看people的RDD lineage关系,结果同第5步一样

 

scala> people.toDebugString
res2: String = 
(1) MappedRDD[3] at map at <console>:19 []
 |  MappedRDD[2] at map at <console>:19 []
 |  people.txt MappedRDD[1] at textFile at <console>:19 []
 |  people.txt HadoopRDD[0] at textFile at <console>:19 []

 

7. 对People表进行查询并查看查询计划和物理计划

 

scala> val teenagers = cxt.sql("select name from People where age < 20 and age > 10");
teenagers: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[6] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==
Project [name#0]
 Filter ((age#1 < 20) && (age#1 > 10))
  PhysicalRDD [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at ExistingRDD.scala:36

scala> teenagers.toDebugString
res3: String = 
(1) SchemaRDD[6] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==
Project [name#0]
 Filter ((age#1 < 20) && (age#1 > 10))
  PhysicalRDD [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at ExistingRDD.scala:36 []
 |  MapPartitionsRDD[8] at mapPartitions at basicOperators.scala:43 []
 |  MapPartitionsRDD[7] at mapPartitions at basicOperators.scala:58 []
 |  MapPartitionsRDD[4] at mapPartitions at ExistingRDD.scala:36 []
 |  MappedRDD[3] at map at <console>:19 []
 |  MappedRDD[2] at map at <console>:19 []
 |  people.txt MappedRDD[1] at textFile at <console>:19 []
 |  people.txt HadoopRDD[0] at textFile at <console>:19 []

 

8. 提交查询作业,打印结果

 

teenagers.map(t => "Name:" + t(0)).collect().foreach(println)

///结果
Justin

 

 

参考:http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started

 

 

【Spark】Spark十: Spark SQL第一部分

  • 0

    开心

    开心

  • 0

    板砖

    板砖

  • 0

    感动

    感动

  • 0

    有用

    有用

  • 0

    疑问

    疑问

  • 0

    难过

    难过

  • 0

    无聊

    无聊

  • 0

    震惊

    震惊

编辑推荐
RDD,Resilient Distributed Object,弹性分布式数据集,是Spark牛逼的基石。 RDD是什么? Spark使
RDD,Resilient Distributed Object,弹性分布式数据集,是Spark牛逼的基石。 RDD是什么? Spark使
在前面几篇博客里,介绍了Spark的伪分布式安装,以及使用Spark Shell进行交互式操作,本篇博客主要
在前面几篇博客里,介绍了Spark的伪分布式安装,以及使用Spark Shell进行交互式操作,本篇博客主要
Spark Streaming uses a “micro-batch” architecture, where the streaming computation is treat
本文对Sogou的日志进行分析,Sogou日志下载地址. http://download.labs.sogou.com/dl/sogoulabdown/
本文对Sogou的日志进行分析,Sogou日志下载地址. http://download.labs.sogou.com/dl/sogoulabdown/
开发Spark WordCount的步骤 下载并配置Scala2.11.4 下载Scala版本的Eclipse,简称Scala IDE 下载Spa
什么是RDD Spark是围绕着RDD(Resilient Distributed Dataset,弹性分布式数据集)建立起来的,也就是
开发Spark WordCount的步骤 下载并配置Scala2.11.4 下载Scala版本的Eclipse,简称Scala IDE 下载Spa
版权所有 IT知识库 CopyRight © 2009-2015 IT知识库 IT610.com , All Rights Reserved. 京ICP备09083238号