Within my mapper I'd like to call external software installed on the worker node outside of the HDFS. Is this possible? What is the best way to do this?
I understand that this may take some of the advantages/scalability of MapReduce away, but i'd like to interact both within the HDFS and call compiled/installed external software c开发者_JS百科odes within my mapper to process some data.
Mappers (and reducers) are like any other process on the box- as long as the TaskTracker user has permission to run the executable, there is no problem doing so. There are a few ways to call external processes, but since we are already in Java, ProcessBuilder seems a logical place to start.
EDIT: Just found that Hadoop has a class explicitly for this purpose: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Shell.html
This is certainly doable. You may find it best to work with Hadoop Streaming. As it says on that website:
Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer.
I tend to start with external code inside of Hadoop Streaming. Depending on your language, there are likely many good examples of how to use it in Streaming; once you get inside your language of choice, you can usually pipe data out to another program, if desired. I have had several layers of programs in different languages playing nicely with no additional effort than if I had run it on a normal Linux box, beyond just getting the outer layer working with Hadoop Streaming.
精彩评论