R and regexp: Extract name source from news_问答_开发者

R and regexp: Extract name source from news

开发者 https://www.devze.com 2023-04-13 04:15 出处：网络

I have a scrap news by R like the following: > View(mydf$title) <name of the news> <dash> <source name>

相关专题：

I have a scrap news by R like the following:

> View(mydf$title)
<name of the news> <dash> <source name>    
Матч КХЛ перенесен на 2 дня - Газета.Ru
Всероссийская универсиада 2010 - Interfax Russia
Звезда хоккея снялся в клипе популярного рэпера. ВИДЕО - Ura.ru
Трактор – Тролейбус 2:1 14.04.2011 – YouTube

I need to split mydf$title on the title news and name of source (- Газета.ru, - Interfax Russia, - Ura.ru, etc)

I try this library(stringr):

mydf$sourse <- str_extract(mydf$title, '\\- [A-Za-zА-Яа-я0-9." ]{0,}$')
mydf$sourse <- str_extract(mydf$title, "\\-[:space:[:alpha:][:punct:][:space:]]{0,}$")
mydf$sourse <- str_extract(mydf$title, '\\-\\s[A-Za-zА-Яа-я0-9[:punct:]\\s]{0,}')
mydf$sourse <- str_开发者_JS百科extract(mydf$title, "\\s-\\s[\\w+\\s.]{0,}$")
mydf$sourse <- str_extract(mydf$title, "\\s-\\s[:alpha:][:print:]$")

But does not work very well. How do I split a string optimally? Thanks for the tips. Спасибо.

Note: mydf is data.frame:

> str(mydf)
'data.frame':   100 obs. of  6 variables:
 $ title      : Factor w/ 100 levels...
 $ link       : Factor w/ 100 levels...
 $ guid.text  : Factor w/ 100 levels...
 $ guid..attrs: Factor w/ 1 level...
 $ pubDate    : Factor w/ 100 levels...
 $ description: Factor w/ 100 levels...

Try using strsplit, but I note that your separator is in fact two different types of dash:

strsplit(mydf$title, split=" [–-] ", useBytes=TRUE)

This will give you a list of elements. (As you can see, I couldn't get the encoding to be correct on my machine, but even so, it's clear that the news agency is always the last element of each list. The only other issue that you will have to deal with then is that sometimes the source can also inlude a dash. If this happens you will have to use paste to combine all but the last element of each list.

[[1]]
[1] "<U+041C><U+0430><U+0442><U+0447> <U+041A><U+0425><U+041B> <U+043F><U+0435><U+0440><U+0435><U+043D><U+0435><U+0441><U+0435><U+043D> <U+043D><U+0430> 2 <U+0434><U+043D><U+044F>"
[2] "<U+0413><U+0430><U+0437><U+0435><U+0442><U+0430>.Ru"                                                                                                                           

[[2]]
[1] "<U+0412><U+0441><U+0435><U+0440><U+043E><U+0441><U+0441><U+0438><U+0439><U+0441><U+043A><U+0430><U+044F> <U+0443><U+043D><U+0438><U+0432><U+0435><U+0440><U+0441><U+0438><U+0430><U+0434><U+0430> 2010"
[2] "Interfax Russia"                                                                                                                                                                                       

[[3]]
[1] "<U+0417><U+0432><U+0435><U+0437><U+0434><U+0430> <U+0445><U+043E><U+043A><U+043A><U+0435><U+044F> <U+0441><U+043D><U+044F><U+043B><U+0441><U+044F> <U+0432> <U+043A><U+043B><U+0438><U+043F><U+0435> <U+043F><U+043E><U+043F><U+0443><U+043B><U+044F><U+0440><U+043D><U+043E><U+0433><U+043E> <U+0440><U+044D><U+043F><U+0435><U+0440><U+0430>. <U+0412><U+0418><U+0414><U+0415><U+041E>"
[2] "Ura.ru"                                                                                                                                                                                                                                                                                                                                                                                  

[[4]]
[1] "<U+0422><U+0440><U+0430><U+043A><U+0442><U+043E><U+0440>"                               
[2] "<U+0422><U+0440><U+043E><U+043B><U+0435><U+0439><U+0431><U+0443><U+0441> 2:1 14.04.2011"
[3] "YouTube"

Perhaps you are overcomplicating things:

strsplit(c("before - after", "123 - 456"), " - ", fixed=TRUE)

R and regexp: Extract name source from news

精彩评论

关注公众号

热门标签

图文推荐

R and regexp: Extract name source from news

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：