Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.6k views
in Technique[技术] by (71.8m points)

scala - How to access Spark RDD Array of elements based on index

I have an RDD with Array of elements like below, each element can be treated as tuple, Now question is i want to access only 4th element from first two tuples.. and loop through this RDD

Array[(Int, String, String, Int)] = Array(
    (1,Tom,AAA,2000), (2,Tim,AAA,3000),
    (3,Mark,BBB,6000), (4,Jim,BBB,6000), (5,James,CCC,4000))

I want to first take tuple1 4th element (2000) and tuple2 4th element (3000) run some condition and then do the same but now for tuple 2 and tuple 3..basically loop through the RDD..

I can write a for loop and if statement in Scala but I don't understanding who to do it on top of RDD since RDD doesn't allow parameters.

Thanks and appreciate any help. I am new to spark so still learning.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

How to access Spark RDD Array of elements based on index

The answer is simply don't try. RDDs are not indexed, and depending on a context order of values can be nondeterministic.

As far as I understand what you want is simply a map and sliding window:

import org.apache.spark.mllib.rdd.RDDFunctions._

// A dummy function
def doSomething(xs: Array[Int]) = xs match {
  case Array(x1, x2) => if (x1 <= x2) x1 else x2
}

val rdd = sc.parallelize(Array(
    (1, "Tom", "AAA", 2000),
    (2, "Tim", "AAA", 3000),
    (3, "Mark", "BBB", 6000),
    (4, "Jim", "BBB", 6000),
    (5, "James", "CCC", 4000)))

rdd.map(_._4).sliding(2).map(doSomething)

Above of course assumes that the order of values is defined or in other words ancestor lineage doesn't include shuffled RDDs.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...