开发者

How to replace Unicode values using re in Python?

开发者 https://www.devze.com 2023-03-18 15:08 出处:网络
How to replace unicode values using re in Python ? I\'m looking for something like this: line.replace(\'Ã\',\'\')

How to replace unicode values using re in Python ? I'm looking for something like this:

line.replace('Ã','')
line.replace('¢','')
line.replace('â','')

Or is there any way which will replace all the non-ASCII characters from a file. Actually I converted PDF fil开发者_运维百科e to ASCII, where I'm getting some non-ASCII characters [e.g. bullets in PDF]

Please help me.


Edit after feedback in comments.

Another solution would be to check the numeric value of each character and see if they are under 128, since ascii goes from 0 - 127. Like so:

# coding=utf-8

def removeUnicode():
    text = "hejsanäöåbadasd wodqpwdk"
    asciiText = ""
    for char in text:
        if(ord(char) < 128):
            asciiText = asciiText + char

    return asciiText

import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())

Here's an altered version of jd's answer with benchmarks:

# coding=utf-8

def removeUnicode():
    text = u"hejsanäöåbadasd wodqpwdk"
    if(isinstance(text, str)):
        return text.decode('utf-8').encode("ascii", "ignore")
    else:
        return text.encode("ascii", "ignore")        

import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())

Output first solution using a str string as input:

computer:~ Ancide$ python test1.py
Time taken: 5.88719677925

Output first solution using a unicode string as input:

computer:~ Ancide$ python test1.py
Time taken: 7.21077990532

Output second solution using a str string as input:

computer:~ Ancide$ python test1.py
Time taken: 2.67580914497

Output second solution using a unicode string as input:

computer:~ Ancide$ python test1.py
Time taken: 1.740680933

Conclusion

Encoding is the faster solution and encoding the string is less code; Thus the better solution.


Why you want to replace if you have

title.decode('latin-1').encode('utf-8')

or if you want to ignore

unicode(title, errors='replace')


You have to encode your Unicode string to ASCII, ignoring any error that occurs. Here's how:

>>> u'uéa&à'.encode('ascii', 'ignore')
'ua&'


Try to pass re.UNICODE flag to params. Like this:

re.compile("pattern", re.UNICODE)

For more info see manual page.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号