1

I have a URL (https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine) to scrape the posts from. Some of these posts are replies which has initial text as "Originally Posted by ...". I want to scrape all the data within the posts excluding the initial Originally posted by text. For example,

User  df_text
 A    Hi, how are you ?
 B    This is beautiful!
 C    Heuwi
 D    Originally posted by C Heuwi 
      Hellou
 E    Hello guys
 F    Originally posted by A Hi, how are you ?
      I am doing good
 G    Whats going on ?

For user D, "Originally Posted by.." is under div.quote_container class (child class) and "I am doing good" is under blockquote.postcontent.restore, which is parent class.

Expected results:

User  df_text
 A    Hi, how are you ?
 B    This is beautiful!
 C    Heuwi
 D    Hellou
 E    Hello guys
 F    I am doing good
 G    Whats going on ?

I tried the following code:

url<-"https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine"
review <- read_html(url)
threads<- cbind(review %>% html_nodes("blockquote.postcontent.restore:not(.quote_container)") %>% html_text())

Tried few other ones too:

threads <- cbind(review %>% html_nodes(xpath = '//div[@class="blockquote.postcontent.restore"]/node()[not(self::div)]') %>% html_text())

or

threads <- review %>% html_nodes(".content")
close_nodes <- threads %>% html_nodes(".quote_container")
chk <- xml_remove(close_nodes)

None of these worked. Please help me to find a way to scrape all the posts data excluding child class. Thanks in advance!!

1 Answer 1

2

This turns out to be a relevantly easy solution by using the xml_remove function which is a part of the xml2 library (loaded automatically with rvest)

library(rvest)
#read page
url<-"https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine"
review <- read_html(url)

#find parent nodes
threads<- review %>% html_nodes("blockquote.postcontent.restore:not(.quote_container)")
#find children nodes to exclude
toremove<-threads %>% html_node("div.bbcode_container")
#remove nodes
xml_remove(toremove)

#convert the parent nodes to text
threads %>% html_text(trim=TRUE)

From the documentation for xml_remove: "Care needs to be taken when using xml_remove()". Please review, use caution and save frequently.

5
  • Thank you. It works perfectly. Can this be used within a loop as well ? Commented Feb 20, 2019 at 14:56
  • @gamyanaidu, yes this method should be able to extend to a loop.
    – Dave2e
    Commented Feb 20, 2019 at 15:19
  • is it possible to exclude multiple nodes here? I tried the code it does not work url <- "https://forums.vwvortex.com/showthread.php?3087297-How-to-solve-(or-prevent)-Eos-Roof-leaks" review <- read_html(url) threads<- review %>% html_nodes("blockquote.postcontent.restore:not(.quote_container):not(.cms_table)") #find children nodes to exclude toremove<-threads %>% html_node("div.bbcode_container") tormove2<-threads %>% html_node("tbody") #remove nodes xml_remove(toremove) xml_remove(tormove2) #convert the parent nodes to text thread <-threads %>% html_text(trim=TRUE) Commented Feb 28, 2019 at 21:05
  • @gamyanaidu, I am not sure, you have to be careful since one is changing the structure of the xml page. Thus you may experience some unexpected behavior.
    – Dave2e
    Commented Mar 1, 2019 at 0:27
  • By changing the "tbody" with immediate child node, it works. Commented Mar 1, 2019 at 13:25

Not the answer you're looking for? Browse other questions tagged or ask your own question.