Using EmEditor saving a Unicode file to another format distorts/changes the format. Solution?_问答_开发者

Using EmEditor saving a Unicode file to another format distorts/changes the format. Solution?

开发者 https://www.devze.com 2023-03-19 08:48 出处：网络

There is a MySQL backup file which is a huge file - about 3 GB. There is one table that has a LONGBLOB column that stores JPEG image data.

The file imports successfully if done from MySQL Workbench - Data Import/Restore.

I need to open this file and extract the first few lines (about two rows of INSERTs of the table with the image data) so that I can test if another program can import this data into another MySQL database.

I tried opening the file with EmEditor (which is good at opening large files) and then copy/paste only upto one Insert statement of the script into a new file (upto about line 25, because the table in question is the first table in the backup script), and then Paste the selection into a new file.

Here comes the problem:

However this messes up the encoding (even though I save as utf8). I realize this when I try to import (restore) this new file (again using MySQL Workbench) into a MySQL database, the restore goes ahead without errors, but the JPEG images in the blob column are now destroyed/corrupted.

My guess is that the encoding is different between the original file and new file.

EmEditor does not show the encoding on the original file, there is an option to detect, and it detects it as 'UTF8 Unsigned'. But when saving I save it as UTF8. I tried also saving as ANSI, ISO8859 (windows default), etc, etc.. but everytime the same result.

Do you have any solution for this particular problem? ie I want to only cut the first few lines of the huge backup file and save to a new file keeping the encoding the same, so that the images (blobs) are not changed. Is there any way this can be done with EmEditor (ie do I have the wrong approach [ie Cut-Paste]?) Is there any specialized software that can do this? How can I diagnose what is going 开发者_如何学运维wrong here?

Thanks for any responses.

this messes up the encoding (even though I save as utf8)

UTF-8 is not a good choice for arbitrary binary data. There are many sequences of high-bytes which are not valid in UTF-8, so you will mangle them at some point during the load-alter-save process.

If you load the file using an encoding that maps every single byte to a unique character, and re-save the file using that same encoding, you should preserve the original content(*). ISO-8859-1 is the encoding usually chosen for this purpose, since it simply maps each byte 0..0xFF to the Unicode code point with the same number.

(*: assuming the editor is binary-safe with regard to other tricky points like nulls, \n/\r and other control characters... I believe EmEditor can be.)

When opening the original file in EmEditor, trying selecting the encoding as Binary (ASCII View). The Binary (ASCII View) will, as bobince said, map each byte to a unique character and preserve that when you save the file. I think this should fix your problem.