开发者

How to identify string pattern within a string but ignore if the match falls inside of identified pattern

开发者 https://www.devze.com 2023-02-11 03:46 出处:网络
I want to search a string for occurences of a string that matches a specific pattern. Then I will write that unique list of found strings separated by commas.

I want to search a string for occurences of a string that matches a specific pattern. Then I will write that unique list of found strings separated by commas. The pattern is to look for "$FOR_something" as long as that pattern does not fall inside of "#LOOKING( )" or "/* */" and the _something part does not have any other special characters.

For example, if I have this string,

  "Not #LOOKING( $FOR_one $FOR_two) /* $FOR_three */ not $$$FOR_four or $FOR_four_b, but $FOR_five; and $FOR_six and not $FOR-seven or $FOR_five again"

The resulting list of found patterns I'm looking for from the above quoted string would be:

$FOR_five, $FOR_six

I started with this example:

import java.lang.StringBuffer;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class testIt {
public static void main(String args[]) {

String myWords = "Not #LOOKING( $FOR_one $FOR_two) /* $FOR_three */ not $$$FOR_four or $FOR_four_b, but $FOR_five; and $FOR_six and not $FOR-seven or $FOR_five again";

StringBuffer sb = new StringBuffer(0);

if ( myWords.toUpperCase().contains("$FOR") )
{
   Pattern p = Pattern.compile("\\$FOR[\\_][a-zA-Z_0-9]+[\\s]*", Pattern.CASE_INSENSITIVE);
   Matcher m = p.matcher(myWords);

   String myFors = "";
   while (m.find())
   {
      myFors = myWords.substring( m.start() , m.end() ).trim开发者_JAVA百科();
      if ( sb.length() == 0 ) sb = sb.append(myFors);
      else
      {
         if ( !(sb.toString().contains(myFors))) sb = sb.append(", " + myFors );
      }
   }
}
System.out.println(sb);
}

}

But it is not giving me what I want. What I want is:

$FOR_five, $FOR_six 

Instead, I get all of the $FOR_somethings. I don't know how to ignore the occurences inside of the /**/ or the #LOOKING(). Any suggestions?


This problem goes beyond regular regex I would say. The $$$ patterns can be fixed with negative lookbehind, the others won't as easily.

What I would recommend you to do is to first use tokenizing / manual string parsing to discard unwanted data, such as /* ... */ or #LOOKING( .... ). This could however also be removed by another regex such as:

myWords.replaceAll("/\\*[^*/]+\\*/", "");      // removes /* ... */
myWords.replaceAll("#LOOKING\\([^)]+\\)", ""); // removes #LOOKING( ... )

Once stripped of context-based content you can use e..g, the following regex:

(?<!\\$)\\$FOR_\\p{Alnum}+(?=[\\s;])

Explanation:

(?<!\\$)         // Match iff not prefixed with $
\\$FOR_          // Matches $FOR_
\\p{Alnum}+      // Matches one or more alphanumericals [a-zA-Z0-9]
(?=[\\s;])       // Match iff followed by space or ';'

Note that the employed (?...) are known as lookahead/lookbehind expressions which are not captured in the result itself. They act only as prefix/suffix conditions in the above sample.

0

精彩评论

暂无评论...
验证码 换一张
取 消