开发者

Non-UTF8 files (Google CSV file)

开发者 https://www.devze.com 2023-02-06 18:59 出处:网络
I\'m running into weird encoding issues when handling uploaded files. I need to accept any sort of text file, and be able to read the contents. Specifically having trouble with files downloaded from

I'm running into weird encoding issues when handling uploaded files.

I need to accept any sort of text file, and be able to read the contents. Specifically having trouble with files downloaded from a Google Contacts export.

I've done the usual utf8_encode/decode, mb_detect_encoding, etc. Always returns as if the string is UTF-8, and tried many iconv options to try and revert encoding, but unsuccessful.

test.php

header('Content-type: text/html; charset=UTF-8');

if ($stream = fopen($_FILES['list']['tmp_name'], 'r'))
{
    $string = stream_get_contents($stream);

    f开发者_C百科close($stream);
}

echo substr($string, 0, 50);
var_dump(substr($string, 0, 50));
echo base64_encode(serialize(substr($string, 0, 50)));

Output

��N�a�m�e�,�G�i�v�e�n� �N�a�m�e�,�A�d�d�i�t�i�o�n�
��N�a�m�e�,�G�i�v�e�n� �N�a�m�e�,�A�d�d�i�t�i�o�n�
czo1MDoi//5OAGEAbQBlACwARwBpAHYAZQBuACAATgBhAG0AZQAsAEEAZABkAGkAdABpAG8AbgAiOw==


The beginning of the string carries the bytes \xFF \xFE which represent the Byte Order Mark for UTF-16 Little Endian. All letters are actually two-byte sequences. Mostly a leading \0 followed by the ASCII character.

Printing them on the console will make the terminal client interpret the UTF-16 sequences correctly. But you need to manually decode it (best via iconv) to make the whole array displayable.


When I decoded the base64 piece, I saw a strange mixed string: s:50:"\xff\xfeN\x00a\x00m\x00e\x00,\x00G\x00i\x00v\x00e\x00n\x00 \x00N\x00a\x00m\x00e\x00,\x00A\x00d\x00d\x00i\x00t\x00i\x00o\x00n\x00". The part after the second : is a 2-byte Unicode (UCS2) string enclosed in ASCII ", while "s" and "50" are plain ASCII. That \ff\fe piece is a byte-order mark of a UCS2 string. This is insane but parseable.

I suppose that you split the input string by :, strip " from beginning and end and try to decode each resulting string separately.

0

精彩评论

暂无评论...
验证码 换一张
取 消