shortest digest of a string_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2022-12-21 15:37 出处：网络

[Description] Given a string of char type, find a shortest digest, which is defined as: a shortest sub-string which contains all the characters in the original string.

相关专题：algorithm

[Description] Given a string of char type, find a shortest digest, which is defined as: a shortest sub-string which contains all the characters in the original string.

[Example]

A = "aaabedacd"

B = "bedac" is the answer.

[My solution]

Define an integer 开发者_运维技巧table with 256 elements, which is used to record the occurring times for each kind of character in the current sub-string.
Scan the whole string, statistic the total kinds of character in the given string by using the above table.
Use two pointers, start, end, which are initially pointing to the start and (start + 1) of the given string. The current kinds of character is 1.
Expand sub-string[start, end) at the end until it contains all kinds of character. Update the shortest digest if possible.
Contract sub-string[start, end] at the start by one character each time, try to restore its digest property if necessary by step 4.

The time cost is O(n), and the extra space cost is constant.

Any better solution without extra space?

That works very poorly, when you consider the fact that characters are not in fact limited to 256; there are closer to 2^32 codepoints in Unicode; and if you try what you are planning on an UTF-8 string, it is going to blow up. In a big way.

A better approach would be to use a digest algorithm like MD5 or FNV, or doing what you are doing, but rather with a ~~linked list~~ sparse array; adding the codepoints of the characters as you encounter them, and concatenating the codepoints afterwards, converting to, say, UTF-8 as you go.

EDIT:

Counterexample: "På japansk heter regn '雨'."

I don't think your algorithm is correct. Consider the string: "baaabedacdc". The correct answer is still "bedac", but your algorithm will advance the start pointer forward until it finds the "e" (the only character with an occurrence count of 1), and then the end pointer backward until it finds the "e" (the only character with an occurrence count of 1), for a result of "e".

I may have misunderstood the algorithm, though.

Why not use a simpler algorithm for this question, may not be most time efficient, but it works ;) :

step 1. (Asssuming we are dealing with character set of 26 alphabets), create a boolean array of size 26 and scan through the string and check the boolean corresponding to the character. For instance, set elem[0] = true when you run into a, elem[1] = true when you run into b etc.

step 2. create a string using the characters where elem[x] = true. so, the string for this case would be "abcde" and its length = 5.

step 3. traverse through the given string a second time while extracting sub-strings of length 5, sorting them in ascending order and matching them with the string from step 2.