开发者

R and regexp: Extract name source from news

开发者 https://www.devze.com 2023-04-13 04:15 出处:网络
I have a scrap news by R like the following: > View(mydf$title) <name of the news> <dash> <source name>

I have a scrap news by R like the following:

> View(mydf$title)
<name of the news> <dash> <source name>    
Матч КХЛ перенесен на 2 дня - Газета.Ru
Всероссийская универсиада 2010 - Interfax Russia
Звезда хоккея снялся в клипе популярного рэпера. ВИДЕО - Ura.ru
Трактор – Тролейбус 2:1 14.04.2011 – YouTube

I need to split mydf$title on the title news and name of source (- Газета.ru, - Interfax Russia, - Ura.ru, etc)

I try this library(stringr):

mydf$sourse <- str_extract(mydf$title, '\\- [A-Za-zА-Яа-я0-9." ]{0,}$')
mydf$sourse <- str_extract(mydf$title, "\\-[:space:[:alpha:][:punct:][:space:]]{0,}$")
mydf$sourse <- str_extract(mydf$title, '\\-\\s[A-Za-zА-Яа-я0-9[:punct:]\\s]{0,}')
mydf$sourse <- str_开发者_JS百科extract(mydf$title, "\\s-\\s[\\w+\\s.]{0,}$")
mydf$sourse <- str_extract(mydf$title, "\\s-\\s[:alpha:][:print:]$")

But does not work very well. How do I split a string optimally? Thanks for the tips. Спасибо.

Note: mydf is data.frame:

> str(mydf)
'data.frame':   100 obs. of  6 variables:
 $ title      : Factor w/ 100 levels...
 $ link       : Factor w/ 100 levels...
 $ guid.text  : Factor w/ 100 levels...
 $ guid..attrs: Factor w/ 1 level...
 $ pubDate    : Factor w/ 100 levels...
 $ description: Factor w/ 100 levels...


Try using strsplit, but I note that your separator is in fact two different types of dash:

strsplit(mydf$title, split=" [–-] ", useBytes=TRUE)

This will give you a list of elements. (As you can see, I couldn't get the encoding to be correct on my machine, but even so, it's clear that the news agency is always the last element of each list. The only other issue that you will have to deal with then is that sometimes the source can also inlude a dash. If this happens you will have to use paste to combine all but the last element of each list.

[[1]]
[1] "<U+041C><U+0430><U+0442><U+0447> <U+041A><U+0425><U+041B> <U+043F><U+0435><U+0440><U+0435><U+043D><U+0435><U+0441><U+0435><U+043D> <U+043D><U+0430> 2 <U+0434><U+043D><U+044F>"
[2] "<U+0413><U+0430><U+0437><U+0435><U+0442><U+0430>.Ru"                                                                                                                           

[[2]]
[1] "<U+0412><U+0441><U+0435><U+0440><U+043E><U+0441><U+0441><U+0438><U+0439><U+0441><U+043A><U+0430><U+044F> <U+0443><U+043D><U+0438><U+0432><U+0435><U+0440><U+0441><U+0438><U+0430><U+0434><U+0430> 2010"
[2] "Interfax Russia"                                                                                                                                                                                       

[[3]]
[1] "<U+0417><U+0432><U+0435><U+0437><U+0434><U+0430> <U+0445><U+043E><U+043A><U+043A><U+0435><U+044F> <U+0441><U+043D><U+044F><U+043B><U+0441><U+044F> <U+0432> <U+043A><U+043B><U+0438><U+043F><U+0435> <U+043F><U+043E><U+043F><U+0443><U+043B><U+044F><U+0440><U+043D><U+043E><U+0433><U+043E> <U+0440><U+044D><U+043F><U+0435><U+0440><U+0430>. <U+0412><U+0418><U+0414><U+0415><U+041E>"
[2] "Ura.ru"                                                                                                                                                                                                                                                                                                                                                                                  

[[4]]
[1] "<U+0422><U+0440><U+0430><U+043A><U+0442><U+043E><U+0440>"                               
[2] "<U+0422><U+0440><U+043E><U+043B><U+0435><U+0439><U+0431><U+0443><U+0441> 2:1 14.04.2011"
[3] "YouTube"   


Perhaps you are overcomplicating things:

strsplit(c("before - after", "123 - 456"), " - ", fixed=TRUE)
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号