开发者

How can I find the starting index of a string within a UTF-8 byte array? (C#)

开发者 https://www.devze.com 2023-01-22 02:42 出处:网络
I have a UTF-8 byte array of data. I would like to search for a specific string in the array of bytes in C#.

I have a UTF-8 byte array of data. I would like to search for a specific string in the array of bytes in C#.

byte[] dataArray = (some UTF-8 byte array of data);

string searchString = "Hello";

How do I find the first occurrence of the word "Hello" in the array dataArray and return an index location where the string begins (where the 'H' from 'Hello' would be located in dataArray)?

Before, I was erroneously using something like:

int helloIndex = Encodi开发者_如何学Cng.UTF8.GetString(dataArray).IndexOf("Hello");

Obviously, that code would not be guaranteed to work since I am returning the index of a String, not the index of the UTF-8 byte array. Are there any built-in C# methods or proven, efficient code I can reuse?

Thanks,

Matt


One of the nice features about UTF-8 is that if a sequence of bytes represents a character and that sequence of bytes appears anywhere in valid UTF-8 encoded data then it always represents that character.

Knowing this, you can convert the string you are searching for to a byte array and then use the Boyer-Moore string searching algorithm (or any other string searching algorithm you like) adapted slightly to work on byte arrays instead of strings.

There are a number of answers here that can help you:

  • byte[] array pattern search


Try the following snippet:

// Setup our little test.

string sourceText = "ʤhello";

byte[] searchBytes = Encoding.UTF8.GetBytes(sourceText);

// Convert the bytes into a string we can search in.

string searchText = Encoding.UTF8.GetString(searchBytes);

int position = searchText.IndexOf("hello");

// Get all text that is before the position we found.

string before = searchText.Substring(0, position);

// The length of the encoded bytes is the actual number of UTF8 bytes
// instead of the position.

int bytesBefore = Encoding.UTF8.GetBytes(before).Length;

// This outputs Position is 1 and before is 2.

Console.WriteLine("Position is {0} and before is {1}", position, bytesBefore);
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号