开发者

How can I get XML sub-tags using a regexp for its contents without knowing it's name

开发者 https://www.devze.com 2023-04-12 04:16 出处:网络
I have XML which looks like this when simplified: node_set = Nokogiri::XML(\' <PARENT> <SOME_TAG>12:12:1222</SOME_TAG>

I have XML which looks like this when simplified:

node_set = Nokogiri::XML('
<PARENT>
   <SOME_TAG>12:12:1222</SOME_TAG>
   <HOLY_TAG>12:12:1222</HOLY_TAG>
   <MAJOR_TAG>12:12:1222</MAJOR_TAG>
   <FOO_FOO>12:12:1222</FOO_FOO>
</PARENT&g开发者_运维问答t;'
)

All I know is only how to write a regexp for this like:

(\d+):(\d+):(\d+)

I read some articles for regexp matching on the official site, but there's no answer how to do it. Only the mechanism how to invoke user functions into the xpath method.

How could I can get all these tags without knowing it's name by the regexp?


Nokogiri does not support the XPath 2.0 matches function, so you'll need to use Ruby to perform the regex:

hits = node_set.xpath("//text()").grep(/\d+:\d+:\d+/).map(&:parent)
p hits.map(&:name)
#=> ["SOME_TAG", "HOLY_TAG", "MAJOR_TAG", "FOO_FOO"]

Described:

  1. Find all text nodes throughout the document.
  2. Reduce the list to only those that match the regex desired.
  3. Map the list to the parent elements of each text node.

The Enumerable#grep method is shorthand for .select{ |text| regex === text }.

Alternatively, note that you can define your own custom XPath functions in Nokogiri that call back to Ruby, so you could pretend to be using XPath 2.0 matches:

module FindWithRegex
  def self.matches(nodes,pattern,flags=nil)
    nodes.grep(Regexp.new(pattern,flags))
  end
end

hits = node_set.xpath('//*[matches(text(),"\d+:\d+:\d+")]',FindWithRegex)
p hits.map(&:name)
#=> ["SOME_TAG", "HOLY_TAG", "MAJOR_TAG", "FOO_FOO"]

However, due to the fact that this is re-called for each found node (and thus re-creating a new regexp from a string each time) it's not nearly as efficient:

require 'benchmark'
Benchmark.bm(15) do |x|
  N = 10000
  x.report('grep and map'){ N.times{
    node_set.xpath("//text()").grep(/\d+:\d+:\d+/).map(&:parent)
  }}
  x.report('custom function'){ N.times{
    node_set.xpath('//*[matches(text(),"\d+:\d+:\d+")]',FindWithRegex)
  }}
end

#=>                      user     system      total        real
#=> grep and map     0.437000   0.016000   0.453000 (  0.442044)
#=> custom function  1.653000   0.031000   1.684000 (  1.694170)

You can speed it up by caching the Regex:

module FindWithRegex
  REs = {}
  def self.matches(nodes,pattern,flags=nil)
    nodes.grep(REs[pattern] ||= Regexp.new(pattern,flags))
  end
end

#=>                      user     system      total        real
#=> grep and map     0.437000   0.016000   0.453000 (  0.442044)
#=> cached regex     0.905000   0.000000   0.905000 (  0.896090)


Here is a pure XPath 1.0 solution. Although there is no native RegEx facility in XPath 1.0, this is still possible to achieve using the standard XPath 1.0 functions substring-before(), substring-after(), and translate():

/*/*[not(translate(substring-before(.,':'),
                   '0123456789',
                    ''
                    )
         )
   and
     not(translate
           (substring-before(substring-after(.,':'),
                             ':'
                             ),
           '0123456789',
           ''
           )
          )
   and
     not(translate
           (substring-after(substring-after(.,':'),
                             ':'
                             ),
           '0123456789',
           ''
           )
          )
    ]

XSLT - based verification:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
     <xsl:copy-of select=
     "    /*/*[not(translate(substring-before(.,':'),
                       '0123456789',
                        ''
                        )
             )
       and
         not(translate
               (substring-before(substring-after(.,':'),
                                 ':'
                                 ),
               '0123456789',
               ''
               )
              )
       and
         not(translate
               (substring-after(substring-after(.,':'),
                                 ':'
                                 ),
               '0123456789',
               ''
               )
              )
        ]
"/>
 </xsl:template>

</xsl:stylesheet>

This XSLT transformation just selects using the above expression and outputs the selected nodes. When applied on this XML document (the provided one with added "invalid" elements):

<PARENT>
   <SOME_TAG>12:12:1222</SOME_TAG>
   <SOME_TAG2>12a:12:1222</SOME_TAG2>
   <HOLY_TAG>12:12:1222</HOLY_TAG>
   <HOLY_TAG2>12:12b:1222</HOLY_TAG2>
   <MAJOR_TAG>12:12:1222</MAJOR_TAG>
   <MAJOR_TAG2>12:12:1222c</MAJOR_TAG2>
   <FOO_FOO>12:12:1222</FOO_FOO>
</PARENT>

the wanted, correctly selected nodes are output:

<SOME_TAG>12:12:1222</SOME_TAG>
<HOLY_TAG>12:12:1222</HOLY_TAG>
<MAJOR_TAG>12:12:1222</MAJOR_TAG>
<FOO_FOO>12:12:1222</FOO_FOO>
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号