开发者

iOS HTML Unicode to NSString?

开发者 https://www.devze.com 2023-04-09 04:18 出处:网络
I\'m in the process of porting an Android app to iOS and I\'ve hit a sm开发者_开发技巧all roadblock.I\'m pulling HTML encoded data from a webpage but some of the data is presented in Unicode to displa

I'm in the process of porting an Android app to iOS and I've hit a sm开发者_开发技巧all roadblock. I'm pulling HTML encoded data from a webpage but some of the data is presented in Unicode to display foreign characters... so characters in Russian (Лети за мной) will be parsed out as, "Лет..."

In android I was able to get around this by calling HTML.fromHTML(). Is there anything similar in iOS?


It's pretty easy to write your own HTML entity decoder. Just scan the string looking for &, read up to the following ;, then interpret the results. If it's "amp", "lt", "gt", or "quot", replace it with the relevant character. If it starts with #, it's a numeric entity. If the # is followed by an "x", treat the rest as hexadecimal, otherwise as decimal. Read the number, and then insert the character into your string (if you're writing to an NSMutableString you can use [str appendFormat:@"%C", thechar]. NSScanner can make the string scanning pretty easy, especially since it already knows how to read hex numbers.

I just whipped up a function that should do this for you. Note, I haven't actually tested this, so you should run it through its paces:

- (NSString *)stringByDecodingHTMLEntitiesInString:(NSString *)input {
    NSMutableString *results = [NSMutableString string];
    NSScanner *scanner = [NSScanner scannerWithString:input];
    [scanner setCharactersToBeSkipped:nil];
    while (![scanner isAtEnd]) {
        NSString *temp;
        if ([scanner scanUpToString:@"&" intoString:&temp]) {
            [results appendString:temp];
        }
        if ([scanner scanString:@"&" intoString:NULL]) {
            BOOL valid = YES;
            unsigned c = 0;
            NSUInteger savedLocation = [scanner scanLocation];
            if ([scanner scanString:@"#" intoString:NULL]) {
                // it's a numeric entity
                if ([scanner scanString:@"x" intoString:NULL]) {
                    // hexadecimal
                    unsigned int value;
                    if ([scanner scanHexInt:&value]) {
                        c = value;
                    } else {
                        valid = NO;
                    }
                } else {
                    // decimal
                    int value;
                    if ([scanner scanInt:&value] && value >= 0) {
                        c = value;
                    } else {
                        valid = NO;
                    }
                }
                if (![scanner scanString:@";" intoString:NULL]) {
                    // not ;-terminated, bail out and emit the whole entity
                    valid = NO;
                }
            } else {
                if (![scanner scanUpToString:@";" intoString:&temp]) {
                    // &; is not a valid entity
                    valid = NO;
                } else if (![scanner scanString:@";" intoString:NULL]) {
                    // there was no trailing ;
                    valid = NO;
                } else if ([temp isEqualToString:@"amp"]) {
                    c = '&';
                } else if ([temp isEqualToString:@"quot"]) {
                    c = '"';
                } else if ([temp isEqualToString:@"lt"]) {
                    c = '<';
                } else if ([temp isEqualToString:@"gt"]) {
                    c = '>';
                } else {
                    // unknown entity
                    valid = NO;
                }
            }
            if (!valid) {
                // we errored, just emit the whole thing raw
                [results appendString:[input substringWithRange:NSMakeRange(savedLocation, [scanner scanLocation]-savedLocation)]];
            } else {
                [results appendFormat:@"%C", c];
            }
        }
    }
    return results;
}


The &#(number); construct in HTML (and XML) is known as a character reference. It's not Unicode-specific, other than in that all characters in HTML are defined in terms of Unicode, whether included verbatim or encoded as a character or entity reference. (Entity references are the named ones that look like &eacute; or &amp; and if you are scraping an HTML page you will certainly have to deal with those as well.)

There isn't a function in the standard library for decoding character or entity references. See this question for approaches to decoding HTML text content. If you only have character references and the standard XML entities like &amp; you can get away with leveraging NSXMLParser to parse an <element>+yourstring+</element>, but this won't handle HTML-specific entities like &eacute;.

In general, screen-scraping is best done using a proper HTML parser, rather than string-hacking. This will convert all text content into text nodes, converting the character and entity references as it goes. However, again, there is no HTML parser available in the standard library. If the target page is well-formed standalone XHTML you can again use NSXMLParser. Otherwise you might like to try libxml2, which offers an HTML parser as well as XML. See this question for some background.


if you get data from a website you will have an NS(Mutable)Data-Object as your receiving-buffer. You just have to transform that NSData into an NSString via:
NSString *myString = [[NSString alloc] initWithData:myRecvData usingEncoding:NSUnicodeStringEncoding]
if your server is sending in Unicode. If your server is sending utf-8 or other then you have to adjust the stringencoding in your receiving-code as well.

here a list of all supported string-encoding-types

edit: take a look at this so-thread.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号