【Spark Java API】Action(4)—sortBy、takeOrdered、takeSample

sortBy


官方文档描述:

Return this RDD sorted by the given key function.

函数原型:

def sortBy[S](f: JFunction[T, S], ascending: Boolean, numPartitions: Int): JavaRDD[T]

**
sortBy根据给定的f函数将RDD中的元素进行排序。
**

源码分析:

def sortBy[K](   
   f: (T) => K,    
  ascending: Boolean = true,    
  numPartitions: Int = this.partitions.length)    
  (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {  
    this.keyBy[K](f)      
    .sortByKey(ascending, numPartitions)      
    .values
}
/** 
* Creates tuples of the elements in this RDD by applying `f`. 
*/
def keyBy[K](f: T => K): RDD[(K, T)] = withScope {  
  val cleanedF = sc.clean(f)  
  map(x => (cleanedF(x), x))
}

**
从源码中可以看出,sortBy函数的实现依赖于sortByKey函数。该函数接受三个参数,第一参数是一个函数,该函数带有泛型参数T,返回类型与RDD中的元素类型一致,主要是用keyBy函数的map转化,将每个元素转化为tuples类型的元素;第二个参数是ascending,该参数是可选参数,主要用于RDD中的元素的排序方式,默认是true,是升序;第三个参数是numPartitions,该参数也是可选参数,主要使用对排序后的RDD进行分区,默认的分区个数与排序前一致是partitions.length。
**

实例:

List data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD javaRDD = javaSparkContext.parallelize(data, 3);
final Random random = new Random(100);
//对RDD进行转换,每个元素有两部分组成
JavaRDD javaRDD1 = javaRDD.map(new Function() {    
  @Override    
  public String call(Integer v1) throws Exception {        
    return v1.toString() + "_" + random.nextInt(100);    
  }
});
System.out.println(javaRDD1.collect());
//按RDD中每个元素的第二部分进行排序
JavaRDD resultRDD = javaRDD1.sortBy(new Function() {    
  @Override    
  public Object call(String v1) throws Exception {        
    return v1.split("_")[1];    
  }
},false,3);
System.out.println("result--------------" + resultRDD.collect());

takeOrdered


官方文档描述:

Returns the first k (smallest) elements from this RDD using the 
natural ordering for T while maintain the order.

函数原型:

def takeOrdered(num: Int): JList[T]
def takeOrdered(num: Int, comp: Comparator[T]): JList[T]

**
takeOrdered函数用于从RDD中,按照默认(升序)或指定排序规则,返回前num个元素。
**

源码分析:

def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {  
  if (num == 0) {    
    Array.empty  
  } else {    
    val mapRDDs = mapPartitions { items =>      
    // Priority keeps the largest elements, so let's reverse the ordering.      
    val queue = new BoundedPriorityQueue[T](num)(ord.reverse)      
    queue ++= util.collection.Utils.takeOrdered(items, num)(ord)      
    Iterator.single(queue)    
  }    
  if (mapRDDs.partitions.length == 0) {      
    Array.empty    
  } else {      
    mapRDDs.reduce { (queue1, queue2) =>        
      queue1 ++= queue2        
      queue1      
  }.toArray.sorted(ord)    
  }  
 }
}

**
从源码分析可以看出,利用mapPartitions在每个分区里面进行分区排序,每个分区局部排序只返回num个元素,这里注意返回的mapRDDs的元素是BoundedPriorityQueue优先队列,再针对mapRDDs进行reduce函数操作,转化为数组进行全局排序。
**

实例:

//注意comparator需要序列化
public static class TakeOrderedComparator implements Serializable,Comparator{    
    @Override    
    public int compare(Integer o1, Integer o2) {        
      return -o1.compareTo(o2);    
    }
}
List data = Arrays.asList(5, 1, 0, 4, 4, 2, 2);
JavaRDD javaRDD = javaSparkContext.parallelize(data, 3);
System.out.println("takeOrdered-----1-------------" + javaRDD.takeOrdered(2));
List list = javaRDD.takeOrdered(2, new TakeOrderedComparator());
System.out.println("takeOrdered----2--------------" + list);

takeSample


官方文档描述:

Return a fixed-size sampled subset of this RDD in an array

函数原型:

def takeSample(withReplacement: Boolean, num: Int): JList[T]
def takeSample(withReplacement: Boolean, num: Int, seed: Long): JList[T] 

**
takeSample函数返回一个数组,在数据集中随机采样 num 个元素组成。
**

源码分析:

def takeSample(    
  withReplacement: Boolean,    
  num: Int,    
  seed: Long = Utils.random.nextLong): Array[T] = 
{  
    val numStDev = 10.0  
    if (num < 0) {    
      throw new IllegalArgumentException("Negative number of elements requested")  
    } else if (num == 0) {    
      return new Array[T](0)  
    }  
    val initialCount = this.count()  
    if (initialCount == 0) {    
      return new Array[T](0)  
    }
    val maxSampleSize = Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt  
    if (num > maxSampleSize) {    
      throw new IllegalArgumentException("Cannot support a sample size > Int.MaxValue - " +      s"$numStDev * math.sqrt(Int.MaxValue)")  
    }  
    val rand = new Random(seed)    
    if (!withReplacement && num >= initialCount) {    
      return Utils.randomizeInPlace(this.collect(), rand)  
    }  
    val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,    withReplacement)  
    var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()  
    // If the first sample didn't turn out large enough, keep trying to take samples;  
    // this shouldn't happen often because we use a big multiplier for the initial size  
    var numIters = 0  
    while (samples.length < num) {    
      logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")    
      samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()    
      numIters += 1  
  }  
  Utils.randomizeInPlace(samples, rand).take(num)
}

**
从源码中可以看出,takeSample函数类似于sample函数,该函数接受三个参数,第一个参数withReplacement ,表示采样是否放回,true表示有放回的采样,false表示无放回采样;第二个参数num,表示返回的采样数据的个数,这个也是takeSample函数和sample函数的区别;第三个参数seed,表示用于指定的随机数生成器种子。另外,takeSample函数先是计算fraction,也就是采样比例,然后调用sample函数进行采样,并对采样后的数据进行collect(),最后调用take函数返回num个元素。注意,如果采样个数大于RDD的元素个数,且选择的无放回采样,则返回RDD的元素的个数。
**

实例:

List data = Arrays.asList(5, 1, 0, 4, 4, 2, 2);
JavaRDD javaRDD = javaSparkContext.parallelize(data, 3);
System.out.println("takeSample-----1-------------" + javaRDD.takeSample(true,2));
System.out.println("takeSample-----2-------------" + javaRDD.takeSample(true,2,100));
//返回20个元素
System.out.println("takeSample-----3-------------" + javaRDD.takeSample(true,20,100));
//返回7个元素
System.out.println("takeSample-----4-------------" + javaRDD.takeSample(false,20,100));

你可能感兴趣的