TAGS :Viewed: 7 - Published at: a few seconds ago

[ How to join three RDDs in to a tuple? ]

I am relatively new to Apache Spark in Python, and here is what I am trying to do. I have input data as following.

  • rdd_row is a RDD of row indices (i),
  • rdd_col is a RDD of column indices (j),
  • rdd_values is a RDD of Values (v).

The above three RDDs are huge.

I am trying to convert them to a sparse rdd matrix

rdd_mat= ([rdd_row],[rdd_col],[rdd_values])

i.e,

rdd_mat=([i1,i2,i3 ..],[j1,j2,j3..], [v1,v2,v3 ..])

I have tried:

zip where rdd_row.zip(rdd_col).zip(rdd_val) 

but it ends up giving

[(i1,j1,v1),(i2,j2,v2) ..]

and

rdd1.union(rdd2) 

won't create a tuple.

Help guiding me in the right direction is much appreciated!

Answer 1


Unfortunately at this point (Spark 1.4) Scala and Java are much better choice than Python if you're interested in linear algebra. Assuming you have input as below:

import numpy as np
np.random.seed(323) 

rdd_row = sc.parallelize([0, 1, 1, 2, 3])
rdd_col = sc.parallelize([1, 2, 3, 4, 4])
rdd_vals = sc.parallelize(np.random.uniform(0, 1, size=5))

to get a rdd_mat of the desired shape you can do something like this:

assert rdd_row.count() == rdd_col.count() == rdd_vals.count()
rdd_mat = sc.parallelize(
    (rdd_row.collect(), rdd_row.collect(), rdd_vals.collect()))

but it is a rather bad idea. As already mentioned by @DeanLa parallel processing here is extremely limited not to mention every part (a whole rows list for example) will end up on a single partition / node.

Without knowing how do you want to use the output it is hard to give you a meaningful advice but one approach is to use something as below:

from pyspark.mllib.linalg import Vectors

coords = (rdd_row.
    zip(rdd_col).
    zip(rdd_vals).
    map(lambda ((row, col), val): (row, col, val)).
    cache())

ncol = coords.map(lambda (row, col, val): col).distinct().count()

rows = (coords.
    groupBy(lambda (row, col, val): row)
    .mapValues(lambda values: Vectors.sparse(
        ncol, sorted((col, val) for (row, col, val) in values))))

It will create a rdd of pairs representing row index and sparse vector of values for a given row. If you add some joins or add group by column you can implement some typical linear algebra routines by yourself nevertheless for full featured distributed data structures it is better to use Scala / Java CoordinateMatrix or another class from org.apache.spark.mllib.linalg.distributed