开发者

Advanced C# pattern search in long string (100-25000 char)

开发者 https://www.devze.com 2023-04-13 03:15 出处:网络
L开发者_Go百科et me start with this: I can\'t zip it or anything similar. What I\'m trying to do is search through fairly large strings. I use data blocks that look like 0g12h. (The 0 is the color fr

L开发者_Go百科et me start with this: I can't zip it or anything similar.

What I'm trying to do is search through fairly large strings. I use data blocks that look like 0g12h. (The 0 is the color from my palette. The g is a space to divide the numbers. The 12 means 12 pixels in a row use that color. The h is to divide the numbers again.)

The problem I'm having is that the blocks aren't all the same length. They range from 0g1h to 2546g115h. Basically I want to create a palette of common patterns to hopefully save space. Say I have: 12g345h19g12h190g11h occurring at least three times, then I could save space if I had something like: a=12g345h19g12h190g11h in the palette array and just put 'a' in the string. Or even not look at the data blocks, as you see in the attached file you get g640h a ton of times.

I could be wrong, but I'm pretty sure this could work. If you have a better idea how I could save space and not lose data, I'm more than open to the ideas.

Here is a great example since you can visually see the pattern: http://pastebin.com/5dbhxZQK. I chose this file because I knew it would have massive redundancy; most aren't this simple.


You could use a dictionary (probably Dictionary<string, int> and just could how many times each pattern occurs, then go back and rewrite the string with the appropriate replacements.

However, I would recommend that you read up a little about compression algorithms, what you are implementing appears to be a Run Length Encoding (RLE) scheme. You are then trying to compress again on top of that, consider looking at how Sliding Window compression works (which is what GZIP does) as an alternative to your RLE. Or look at Huffman encoding as a mechanism to reduce the amount of space needed for the codewords that you are creating (in simple terms Huffman encoding uses shorter symbols for more frequent patterns and longer symbols for less frequent patterns in a 'optimal' way)

This is a fun problem space to play in! Good Luck!

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号