help with regex - extracting text_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-14 10:26 出处：网络

Suppose I have some text files (f1.txt, f2.txt, ...) that looks something like @article {paper1, author = {some author},

Suppose I have some text files (f1.txt, f2.txt, ...) that looks something like

@article {paper1,
author = {some author},
title = {some {T}itle} ,
journal = {journal},
volume = {16},
number = {4},
publisher = {John Wiley & Sons, Ltd.},
issn = {some number},
url = {some url},
doi = {some number},
pages = {1},
year = {1997},
}

I want to extract the content of title and store it in a bash va开发者_StackOverflow社区riable (call it $title), that is, "some {T}itle" in the example. Notice that there may be curly braces in the first set of braces. Also, there might not be white space around "=", and there may be more white spaces before "title".

Thanks so much. I just need a working example of how to extract this and I can extract the other stuff.

Give this a try:

title=$(sed -n '/^[[:blank:]]*title[[:blank:]]*=[[:blank:]]*{/ {s///; s/}[^}]*$//p}' inputfile)

Explanation:

/^[[:blank:]]*title[[:blank:]]*=[[:blank:]]*{/ { - If a line matches this regex
- s/// - delete the matched portion
- s/}[^}]*$//p - delete the last closing curly brace and every character that's not a closing curly brace until the end of the line and print
} - end if

title=$(sed -n '/title *=/{s/^[^{]*{\([^,]*\),.*$/\1/;s/} *$//p}' ./f1.txt)

/title *=/: Only act upon lines which have the word 'title' followed by a '=' after an arbitrary number of spaces
s/^[^{]*{$[^,]*$,.*$/\1/: From the beginning of the line look for the first '{' character. From that point save everything you find until you hit a comma ','. Replace the entire line with everything you saved
s/} *$//p: strip off the trailing brace '}' along with any spaces and print the result.
title=$(sed -n ... ): save the result of the above 3 steps in the bash variable named title

There are definitely more elegant ways, but at 2:40AM:

title=`cat test | grep "^\s*title\s*=\s*" | sed 's/^\s*title\s*=\s*{?//' | sed 's/}?\s*,\s*$//'`

Grep for the line that interests us, strip everything up to and including the opening curly, then strip everything from the last curly to the end of the line