Using Pig and Python_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-18 00:46 出处：网络

Apologies if this question is poorly worded: I am embarking on a large scale machine learning project and I don\'t like programming in Java. I love writing programs in Python. I have heard good things

相关专题：apache-pig

Apologies if this question is poorly worded: I am embarking on a large scale machine learning project and I don't like programming in Java. I love writing programs in Python. I have heard good things about Pig. I was wondering if someone could clarify to me how usable Pig is in combination with Python for mathematically related work. Also, if I am to write "streaming python code",开发者_如何学C does Jython come into the picture? Is it more efficient if it does come into the picture?

Thanks

P.S: I for several reasons would not prefer to use Mahout's code as is. I might want to use a few of their data structures: It would be useful to know if that would be possible to do.

Another option to use Python with Hadoop is PyCascading. Instead of writing only the UDFs in Python/Jython, or using streaming, you can put the whole job together in Python, using Python functions as "UDFs" in the same script as where the data processing pipeline is defined. Jython is used as the Python interpreter, and the MapReduce framework for the stream operations is Cascading. The joins, groupings, etc. work similarly to Pig in spirit, so there is no surprise there if you already know Pig.

A word counting example looks like this:

@map(produces=['word'])
def split_words(tuple):
    # This is called for each line of text
    for word in tuple.get(1).split():
        yield [word]

def main():
    flow = Flow()
    input = flow.source(Hfs(TextLine(), 'input.txt'))
    output = flow.tsv_sink('output')

    # This is the processing pipeline
    input | split_words | GroupBy('word') | Count() | output

    flow.run()

When you use streaming in pig, it doesn't matter what language you use... all it is doing is executing a command in a shell (like via bash). You can use Python, just like you can use grep or a C program.

You can now define Pig UDFs in Python natively. These UDFs will be called via Jython when they are being executed.

The Programming Pig book discusses using UDFs. The book is indispensable in general. On a recent project, we used Python UDFs and occasionally had issues with Floats vs. Doubles mismatches, so be warned. My impression is that the support for Python UDFs may not be as solid as the support for Java UDFs, but overall, it works pretty well.

Using Pig and Python

精彩评论

关注公众号

热门标签

图文推荐

Using Pig and Python

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：