开发者

Managing Mac OS created filenames with non ASCII characters in windows environments?

开发者 https://www.devze.com 2023-04-11 13:41 出处:网络
I deal with large collection of unknown files, and have been been learning python to help me filter / sort and otherwise wrangle these files.

I deal with large collection of unknown files, and have been been learning python to help me filter / sort and otherwise wrangle these files.

A collection I am looking at has a large number of resource forks, and I wrote a little script to find them, and delete them (next step is find them, and to move them, but thats for another day).

I found in this collection that there is a number of files that have non ascii characters in the file name, and this seems to be tripping the os.delete function.

Example file name: ._spec com report 395 (N.B. the 3 has a small dot underneath it, I can't find an example, or figure out how to show the hex of the filename...)

I log all the filenames, this is what that log records for that file: ._spec com report 3?95

The error I get is a windowserror, as it can't find the file (the string its passing is not what the file is known as by 开发者_StackOverflow中文版the windows OS.) I put in a try clause to allow me to work rounf it, but I really like to deal with it properly.

I also tried using a unicode switch in the walk option `os.walk(u'.') as per this post: Handling ascii char in python string (top answer) and I see the following error:

Traceback (most recent call last):
 File "<stdin>", line 3, in <module>
 File "c:\python27\lib\encodings\cp850.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\uf022' in position
20: character maps to <undefined>

So I am guessing the answer lies with how the filename is parsed, and wondering if anyone might be able to point in me in the right direction...

code:

import os
import sys

rootdir = "c:\target Dir to walk"
destKeep = "Keepers.txt"
destDelete = "Deleted.txt"

matchingText = "._"
files_removed = 1
for folder, subs, files in os.walk(rootdir):  
    outfileKeep = open(destKeep,"a")
    outfileDelete = open(destDelete,"a")
    for filename in files:
        matchScore = filename.find(matchingText)
        src = os.path.join(folder, filename)
        srcNewline = src + ", " + str(filename) + "\n"
        if matchScore == -1:
        outfileKeep.writelines(srcNewline)
        else: 
            outfileDelete.writelines(srcNewline)
            try:
                os.remove(src)
        except WindowsError:
                print "I was unable to delete this file:"
                outfileKeep.writelines(srcNewline)
            files_removed += 1
            if files_removed:
                print '%d files removed' % files_removed
            else :
                print 'No files removed'
    outfileKeep.close()
    outfileDelete.close()


os.walk(u'.') is the normal way to get native-Unicode filenames and it should work fine; it does for me.

Your problem is here instead:

srcNewline = src + ", " + str(filename) + "\n"

str(filename) will use the default encoding to convert your Unicode string back down to bytes, and because that encoding doesn't have the character U+F022(*) you get a UnicodeEncodeError. You will have to choose what encoding you want to store in your output file by doing eg srcNewLine= '%s, %s\n' % (src, filename.encode('utf-8')), or (perhaps better) keeping your strings as Unicode and writing them to the file using a codecs.opened file.

(*: which is a Private Use Area character that shouldn't be used, but not much you can do about that now I guess...)

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号