I have a .csv list of businesses. The file has some strange characters in. For example, in this field: Stocktonon-Tees
, the first hyphen, between Stockton
and on
seems to be a char开发者_Python百科acter with the value 6
rather than a hyphen, with the value 45
. Stack overflow will probably sanatize this so you can't see it, so here is a pastebin:
http://pastebin.com/NuyyaQy9
Can anyone explain why this could be? Is it some encoding issue that I have missed? Or a corruption in the dataset?
Yes, it's almost certainly an encoding issue. A file just consists of binary data - it's how you interpret that binary data that matters. It sounds like Notepad is guessing at the originally-intended encoding, but whatever else you're using isn't.
Unfortunately you haven't said anything about what software is trying to read the file or what wrote it in the first place - but you should look at what encoding Notepad thinks it is, and work from there.
If it's your code that wrote the file out, and you get to decide the encoding, I'd recommend UTF-8 as a good general purpose, platform-portable encoding.
精彩评论