Removing allowable whitespace in PHP using parser tokens_问答_开发者

Removing allowable whitespace in PHP using parser tokens

开发者 https://www.devze.com 2023-02-08 23:35 出处：网络

I am trying to make a simple script that will remove all unnecessary w开发者_如何学编程hitespace from a PHP file/string.

I have been successful at parsing the string using tokens, but don't see a good method to remove "extra" whitespace.

For instance,

function test() { return TRUE; }

should be

function test(){return TRUE;}

and NOT

functiontest(){returnTRUE;}

You end up with the last version if you just remove the T_WHITESPACE token.

Is there something I am missing to remove whitespace, but keep spaces after things like "function " and "return ". Thank you!

$newSource = '';
foreach (token_get_all($source) as $i => $token) {
    if (!is_array($token)) {
        $newSource .= $token;
    }

    if ($token[0] == T_WHITESPACE) {
        if (   isset($tokens[$i - 1])      && isset($tokens[$i + 1])
            && is_array($tokens[$i - 1])   && is_array($tokens[$i + 1])
            && isLabel($tokens[$i - 1][1]) && isLabel($tokens[$i + 1][1])
        ) {
            $newSource .= ' ';
        }
    } else {
        $newSource .= $token[1];
    }
}

function isLabel($str) {
    return preg_match('~^[a-zA-Z0-9_\x7f-\xff]+$~', $str);
}

Removing the whitespace is always allowed, apart from the case where there is a LABEL on both sides of it. I check that and either add nothing or a single space character.

There is just another special case I know about, there whitespace is of importance: T_END_HEREDOC must be followed by either ; or \n. Compacting or stripping the space here is not allowed. So, if that is of importance to you, you can simply add that ;)

Well, T_WHITESPACE can be spaces or newlines, etc. So one trivial approach would be to automatically replace all T_WHITESPACE instances with a new one that consists of exactly one space.

But for a smarter method, just go through the list of parser tokens, and figure out which ones should have a whitespace after it and which should not (something like this):

foreach ($tokens as $k => $val) {
    if (is_array($val) && $val[0] == T_WHITESPACE) {
        if (!is_array($tokens[$k - 1])) {
            //remove this space
        } else {
            switch ($tokens[$k - 1][0]) {
                case T_ABSTRACT:
                case T_FUNCTION:
                //.. other keeps here:
                   continue;
                   break;
                default:
                    //remove the space
             }
         }
    }
}

And one more note, don't do this for performance. If you're using an OPCODE Cache (Such as APC) you'll see no benefit for a lot of work. If you're not using one, why aren't you?

Your effort is futile.

php -w

Allows already to strip whitespace off scripts. It uses a bit more elaborate logic to remove whitespace from the token stream.
Here's the zend_strip() function as found in zend_highlight.c:

while ((token_type=lex_scan(&token TSRMLS_CC))) {
    switch (token_type) {
        case T_WHITESPACE:
            if (!prev_space) {
                zend_write(" ", sizeof(" ") - 1);
                prev_space = 1;
            }
                    /* lack of break; is intentional */
        case T_COMMENT:
        case T_DOC_COMMENT:
            token.type = 0;
            continue;

        case T_END_HEREDOC:
            zend_write(LANG_SCNG(yy_text), LANG_SCNG(yy_leng));
            efree(token.value.str.val);
            /* read the following character, either newline or ; */
            if (lex_scan(&token TSRMLS_CC) != T_WHITESPACE) {
                zend_write(LANG_SCNG(yy_text), LANG_SCNG(yy_leng));
            }
            zend_write("\n", sizeof("\n") - 1);
            prev_space = 1;
            token.type = 0;
            continue;

        default:
            zend_write(LANG_SCNG(yy_text), LANG_SCNG(yy_leng));
            break;
    }

    if (token.type == IS_STRING) {
        switch (token_type) {
            case T_OPEN_TAG:
            case T_OPEN_TAG_WITH_ECHO:
            case T_CLOSE_TAG:
            case T_WHITESPACE:
            case T_COMMENT:
            case T_DOC_COMMENT:
                break;

            default:
                efree(token.value.str.val);
                break;
        }
    }
    prev_space = token.type = 0;
}