TAGS :Viewed: 17 - Published at: a few seconds ago

[ Using Hadoop to run a jar file - Python ]

I have an existing Python program that has a sequence of operations that goes something like this:

  1. Connect to MySQL DB and retrieve files into local FS.
  2. Run a program X that operates on these files. Something like: java -jar X.jar <folder_name> This will open every file in the folder and perform some operation on them and writes out an equal number of transformed files into another folder.
  3. Then, run a program Y that operates on these files as: java -jar Y.jar <folder_name> This creates multiple files of one line each which are then merged into a single file using a merge function.
  4. This merged file is then the input for some further operations and analyses that is not really important for this question.

I'd like to use make use of Hadoop to speed up operation Y as it takes very long to complete if there are: a) more number of files or b) large input files to be operated upon.

What I'd like to know is if it is a good idea to go with Hadoop in the first place to do something of this nature and if threads would make more sense in this case. Bear in mind that X and Y are things that cannot be replaced or changed in any way.

I came up with this idea:

  1. After step 2 above, within a mapper, copy the files into the HDFS and back out again, run the jar file, at which point the results are written back into the HDFS. I copy the results back out to the File System and send it for further processing.

I would like to know if this makes sense at all and especially given that the mapper expects a (key,value) pair, would I even have a k-v pair in this scenario?

I know this sounds like a project and that's because it is, but I'm not looking for code, just some guidance about whether or not this would even work and if it did, what is the right way of going about doing this if my proposed solution is not accurate (enough).

Thank you!

Answer 1

You absolutely can use hadoop mapreduce framework to complete your work, but the answer for if it's a good idea could be "it depends". It depends the number and sizes of files you want to proceed.

Keep in mind that hdfs is not very good at deal with small files, it could be a disaster for the namenode if you have a good number (say 10 million) of small files (size is less than 1k bytes). An another hand, if the sizes are too large but only a few files are needed to proceed, it is not cool to just wrap step#2 directly in a mapper, because the job won't be spread widely and evenly (in this situation i guess the key-value only can be "file no. - file content" or "file name - file content" given you mentioned X can't changed in any way. Actually, "line no. - line" would be more situable)

BTW, there are 2 ways to utilize hadoop mapreduce framework, one way is you write mapper/reducer in java and compile them in jar then run mapreduce job with hadoop jar you_job.jar . Another way is streaming, you can write mapper/reducer by using python is this way.