开发者

How can I perform a buffered search and replace?

开发者 https://www.devze.com 2023-04-11 11:15 出处:网络
I have XML files that contain invalid characters sequences which cause parsing to fail. They look like . To solve the problem, I am escaping them by replacing the whole thing with an escape

I have XML files that contain invalid characters sequences which cause parsing to fail. They look like . To solve the problem, I am escaping them by replacing the whole thing with an escape sequence:  --> !#~10^. Then after I am done parsing I can restore them to what they were.

buffersize = 2**16   # 64 KB buffer

def escape(filename):
    out = file(filename + '_esc', 'w') 

    with open(filename, 'r') as f:
        buffer = 'x'     # is there a prettier way to handle the first one?
        while buffer != '':
            buffer = f.read(buffersize)
            out.write(re.sub(r'&#x([a-fA-F0-9]+);', r'!#~\1^', buffer))

    out.close()

The files are very large, so I have to use buffering (mmap gave me a MemoryError) . Because the buffer has a fixed size, I am running into problems when the buffer happens to be small enough to split a sequence. Imagine the buffer size is 8, and the file is like:

 123456789
 hello!&x10;

The buffer will only read hello!&x, allowing &x10; to slip through the cracks. How do I solve this? I thought of getting more characters if the last few look like they could belong to a c开发者_Go百科haracter sequence, but the logic I thought of is very ugly.


First, don't bother to read and write the file, you can create a file-like object that wraps your open file, and processes the data before it's handled by the parser. Second, your buffering can just take care of the ends of read bytes. Here's some working code:

class Wrapped(object):
    def __init__(self, f):
        self.f = f
        self.buffer = ""

    def read(self, size=0):
        buf = self.buffer + self.f.read(size)
        buf = buf.replace("!", "!!")
        buf = re.sub(r"&(#x[0-9a-fA-F]+;)", r"!\1", buf)
        # If there's an ampersand near the end, hold onto that piece until we
        # have more, to be sure we don't miss one.
        last_amp = buf.rfind("&", -10, -1)
        if last_amp > 0:
            self.buffer = buf[last_amp:]
            buf = buf[:last_amp]
        else:
            self.buffer = ""
        return buf

Then in your code, replace this:

it = ET.iterparse(file(xml, "rb"))

with this:

it = ET.iterparse(Wrapped(file(xml, "rb")))

Third, I used a substitution replacing "&" with "!", and "!" with "!!", so you can fix them after parsing, and you aren't counting on obscure sequences. This is Stack Overflow data after all, so lots of strange random punctuation could occur naturally.


If you sequence is 6 characters long, you can use buffers with 5 overlapping characters. That way, you are sure no sequence will even slip between the buffers.

Here is an example to help you visualize :

--&#x10
  --

--
   #x10;--

As for the implementation, just prepend the 5 last characters of the last buffer to the new buffer :

buffer = buffer[-5:] + f.read(buffersize)

The only problem is that the concatenation may require a copy of the whole buffer. Another solution, if you have random access to the file, is to rewind a little bit with :

f.seek(-5, os.SEEK_CUR)

In both case, you'll have to modify the script slightly to handle the first iteration.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号