Binary run length encoding_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-04-09 14:08 出处：网络

I have a web开发者_如何转开发 form, for the contents of which I would like to generate a short representation in Base64. The form, among other things, contains a list of 264 binary values, the greater part of which are going to be 0 at any single time. (They represent regions on a geographical map). Even in Base64, this 264-bit number generates a long, intimidating string. I want to implement run-length encoding, as efficiently as possible. Can you help me with this? I've googled binary RLE, but have found nothing of use.

What I've tried this far - running RLE on the binary string using decimal counts and "A" as a separator denoting a change between 0 and 1, then converting the result from base 11 to base 64. For example:

00000000001111111000000010000000000000000000000001111111110001111010101000000000000000000000000000000000000111111111110111000000000000111111100000001000000000000000000000000111111111000111101010100000000000000000000000000000000000011111111111011100

becomes

10A5A5AA22A7A1A2AAAAAAA34A9AA1A10A5A5AA22A7A1A2AAAAAAA34A9AA1A

which in turn becomes

CNnbr/FxkgbbOw0LNAKgk65P8SdvaTG+t74o

or, in base 62,

6imo7zq1pqr2mqglTHzXwJRAksm7fvHZHWQK

It's better, but I still can't help but doubt if I'm doing something wrong - is using the digit "A" as a separator is the best way to do this?

And another update:

Thanks to @comingstorm, I have shortened the compressed string some more.

ILHHASCAASBYwwccDASYgAEgWDI=

As I mentioned it in the comments, real usage cases would generally result in an even shorter string.

Since you're coding bits, you probably want to use a bit-based RLE instead of a byte-based one. In this context, you should consider Elias gamma coding (or some variant thereof) to efficiently encode your run lengths.

A reasonable first approximation for your encoding format might be:

first bit = same as the first bit of the uncompressed string (to set initial polarity)
remaining bits: Elias coded lengths of successive bit runs (alternating 1 and 0)

Since you know how many bits are in your uncompressed string, you don't need a termination code; you can just add any necessary binary padding as arbitrary bits.

Note that it is always possible for the run-length "compression" to expand your bit string; if you're concerned about this, you can add another initial bit to indicate whether your data is in compressed or uncompressed format, limiting your compression overhead to 1 bit.

264 bit, that are just 33 byte, and that are in base64 just 44 byte. I think this (very small) amount of information is hardly compressable. The sparse representation nulvinge refers too just stores the non zero elements and their values (as you have just 0/1), i.e. in your case just the index of the non zero bits. But as you have 264 possible bits - you need 9 bits for the index, which means, in case you have more than 29 non zero entries, you need already more than original.

Maybe your question is formulated wrong, but I dont see how 264 bits can lead to an intimidating base64 string (How do you generate it - maybe you translate not the 264 bits, but 264 ASCIIs chars (with the value 0 and 1) - that would explain your long result string?).

An alternative that I think is more what you want is a sparse matrix: http://en.wikipedia.org/wiki/Sparse_matrix