I'm trying to create a regular expression to capture in-text citations.
Here's a few example sentences of in-text citations:
... and the reported results in (Nivre et al., 2007) were not representative ...
... two systems used a Markov chain approach (Sagae and开发者_运维技巧 Tsujii 2007).
Nivre (2007) showed that ...
... for attaching and labeling dependencies (Chen et al., 2007; Dredze et al., 2007).
Currently, the regular expression I have is
\(\D*\d\d\d\d\)
Which matches examples 1-3, but not example 4. How can I modify this to capture example 4?
Thanks!
I’ve been using something like this for that purpose lately:
#!/usr/bin/env perl
use 5.010;
use utf8;
use strict;
use autodie;
use warnings qw< FATAL all >;
use open qw< :std IO :utf8 >;
my $citation_rx = qr{
\( (?:
\s*
# optional author list
(?:
# has to start capitalized
\p{Uppercase_Letter}
# then have a lower case letter, or maybe an apostrophe
(?= [\p{Lowercase_Letter}\p{Quotation_Mark}] )
# before a run of letters and admissible punctuation
[\p{Alphabetic}\p{Dash_Punctuation}\p{Quotation_Mark}\s,.] +
) ? # hook if and only if you want the authors to be optional!!
# a reasonable year
\b (18|19|20) \d\d
# citation series suffix, up to a six-parter
[a-f] ? \b
# trailing semicolon to separate multiple citations
; ?
\s*
) +
\)
}x;
while (<DATA>) {
while (/$citation_rx/gp) {
say ${^MATCH};
}
}
__END__
... and the reported results in (Nivré et al., 2007) were not representative ...
... two systems used a Markov chain approach (Sagae and Tsujii 2007).
Nivre (2007) showed that ...
... for attaching and labelling dependencies (Chen et al., 2007; Dredze et al., 2007).
When run, it produces:
(Nivré et al., 2007)
(Sagae and Tsujii 2007)
(2007)
(Chen et al., 2007; Dredze et al., 2007)
Building on Tex's answer, I've written a very simple Python script called Overcite to do this for a friend (end of semester, lazy referencing you know how it is). It's open source and MIT licensed on Bitbucket.
It covers a few more cases than Tex's which might be helpful (see the test file), including ampersands and references with page numbers. The whole script is basically:
author = "(?:[A-Z][A-Za-z'`-]+)"
etal = "(?:et al.?)"
additional = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p.? [0-9]+)?" # Always optional
year = "(?:, *"+year_num+page_num+"| *\("+year_num+page_num+"\))"
regex = "(" + author + additional+"*" + year + ")"
matches = re.findall(regex, text)
\((.+?)\)
should capture all of them
/\(\D*\d\d\d\d(?:;\D*\d\d\d\d)*\)/
All you need is to insert pattern that matches zero or more occurrences of your pattern for a citation, preceded by a semicolon. Conceptually, it's: \(cite(; cite)*\)
.
The pattern is: \(\D*\d{4}(;\D*\d{4})*\)
.
This is my solution, in C++ with boost regex. Hope it helps someone :-)
#include <string>
#include <boost/algorithm/string.hpp>
#include <boost/algorithm/string_regex.hpp>
#include <boost/regex.h>
using namespace std;
using namespace boost;
int Section::countCitations() {
string Personsname = "([A-Z][a-z'`-]+)"; // Apostrophes like in "D'Alembert" and hyphens like in "Flycht-Eriksson".
string YearPattern = "(, *(19|20)[0-9][0-9]| ?\( *(19|20)[0-9][0-9]\))"; // Either Descartes, 1990 or Descartes (1990) are accepted.
string etal = "( et al.?)"; // You may find this
string andconj = Personsname + " and " + Personsname;
string commaconj = Personsname + ", " + "(" + Personsname + "|"+"et al.?"+")"; // Some authors write citations like "A, B, et al. (1995)". The comma before the "et al" pattern is not rare.
string totcit = Personsname+"?"+etal+"?"+"("+andconj+"|"+commaconj+")*"+etal+"?"+YearPattern;
// Matches the following cases:
// Xig et al. (2004);
// D'Alembert, Rutherford et al (2008);
// Gino, Nino and Rino, Pino (2007)
// (2009)
// Gino, et al. (2005)
cout << totcit << endl;
regex citationform(totcit);
int count = 0;
string_range citation;
string running_text(text.begin(), text.end() );
while ((citation = find_regex(running_text, citationform)) ) { // Getting the last one
++count;
string temp(running_text.begin(), citation.end() );
running_text = running_text.substr( temp.length()-1 );
}
return count;
}
So far, this one works for me:
\s\([^(]*?\d{4}.*?\)
Depending on what you're trying to achieve, you may want to remove the leading white space (\s
). I have it there because I want to remove the captured citations and if I don't include the white space, I will end up with a space between the word before the citation and the punctuation after it.
It captures all the examples mentioned in the question (see https://regex101.com/r/BwBVif/1).
精彩评论