开发者

extract string with regex is unstable

开发者 https://www.devze.com 2023-04-12 01:50 出处:网络
Good Early Morning, I have the following python regex file that we established on a previous post. This is meant to extract whatever info that looks like \'chr\' + number + \':\' + bignumber \"..\"

Good Early Morning,

I have the following python regex file that we established on a previous post. This is meant to extract whatever info that looks like 'chr' + number + ':' + bignumber ".." + bignumber (so that looks like chr1:100000..120000) if chr1 is switched for chrX the regex script doesn't work anymore...

Here is the original script :

    # Opens each file to read/modify
    infile='myfile.txt'
    outfile='outfile.txt'

    #import Regex
    import re

    with open (infile, mode='r', buffering=-1) as in_f, open (outfile, mode='w', buffering=-1) as out_f:
        f = (i for i in in_f if '\t' in i.rstrip())
        for line in f:
            _, k = line.split('\t',1)
            x = re.findall(r'^1..100\t([+-])chr(\d+):(\d+)\.\.(\d+).+$',k)
            if not x:
                continue
            out_f.write(' '.join(x[0]) + '\n')

If I changed this line :

    x = re.findall(r'^1..100\t([+-])chrX(\d+):(\d+)\.\.(\d+).+$',k)

I canno开发者_JAVA技巧t extract specifically whatever looks like chrX etc... Also you should know that some lines could be empty !

Help Please :) Thanks


I don't fully understand your question, but I will attempt to give some advice based on your code.

Here is the most important line:

x = re.findall(r'^1..100\t([+-])chr(\d+):(\d+)\.\.(\d+).+$',k)

Observations:

0) I don't even know what buffering=-1 will do in a call to open(). I recommend you get rid of that, and allow the standard behavior, which is line buffering. It's what you want for this case, where you want to process the file one line at a time. (The default is the same as specifying buffering=1.)

1) re.findall() returns a list of matches. However, by using $ in your pattern you have guaranteed that you will get at most one match, because each line can only have one end-of-line. So you should probably use re.search(). You could even use re.match() since you have a ^ to anchor to the start of the line.

2) I don't recommend your use of the .split() method function to get rid of a leading tab. Just fold a tab into your regular expression. It's simpler and faster.

3) Your pattern requires that each line start with a string like this:

1aa100
100100
1xx100
1xy100

Is this what you wanted? Does each line start with a number that always ends in "100"? If it's always a number you might want to use \d instead of . in the pattern.

4) You require a tab after the number-like thing matched above. Then you have a match group, which matches either a '+' or a '-' and lets you collect the matched value. I'm curious what you will do with it.

5) The pattern chr\d+ will match chr0, chr1, chr11, chr111, etc. Any combination of digits, with a minimum length of 1 digit. I'm not sure if you expect it to actually match a capital 'X' (you talked about matching chrX) but it definitely won't.

6) You match a number, two actual periods, and another number. This looks perfectly correct and good to me. Then, after the second number, you use a . and a + together. This requires one or more extra characters before the end of the line. I am wondering if this is causing your problem. Perhaps you should use .* which matches zero or more extra characters?

7) If you use re.match() instead of re.findall(), you won't need to use x[0] to get to the match group.

8) If you have a match group m, ' '.join(m) does not work. You get a type error. You need to use ' '.join(m.groups()) instead.

9) I think the pattern with chr and two numbers separated by .. is pretty good by itself, so maybe you can relax the rest of the pattern and just match on those.

10) I always like to pre-compile my regular expression patterns. It's faster, and then you can use the method functions on the compiled pattern. For example, if pat is a pre-compiled regular expression, you can use pat.search(line) to search a line of text.

Put together my suggestions, and here is some Python code for you to try out:

import re

infile='myfile.txt'
outfile='outfile.txt'

pat = re.compile(r'([+-])chr([^:]+):(\d+)\.\.(\d+)')

with open(infile, mode='r') as in_f, open(outfile, mode='w') as out_f:
    for line in in_f:
        if '\t' not in line.rstrip():
            continue
        m = pat.search(line)
        if not m:
            continue
        out_f.write(' '.join(m.groups()) + '\n')

EDIT: Since you do seem to want to recognize the string chrX as valid, I changed the above example code. Instead of \d to match a digit, it now uses [^:] to match anything but a colon. The above code should match chr1:, chrX:, or pretty much anything else now.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号