Topic: question about using hpricot to scrape content
Hi all,
I'm using Hpricot to scrape some content from a site and reformat for another site. I have got the application up and running pretty well, but I'm stumped by a couple things.
1st, how can I get the tag type from an Hpricot element? I want to navigate through the DOM using next_sibling and previous_sibling, but once I've moved on to that sibling, I want to test to see what kind of tag it is, and handle it accordingly.
Also, how can I remove a tag while preserving it's contents within the tag's parent object. For example, I have content that has some <i> tags in it that I'd like to pull out. Here's some pseudocode:
content = Hpricot.parse(open(url) // GET CONTENT AS HPRICOT TREE
content.search('i'){|i| // FIND ALL <I> TAGS IN THE CONTENT AND . . .
i_text = i.inner_html // GET THE TEXT NODE INSIDE THE <I> TAG
between_the_other_text_nodes = ?? // HOW DO I IDENTIFY THE CURRENT LOCATION WITHIN THE TREE?
i.remove // DUMP THE <I> TAG
// PUT THE TEXT THAT WAS INSIDE THE <I> TAG INTO THE CONTENT AT THE POINT THAT THE <I> TAG WAS REMOVED
content.insert(i_text, between_the_other_text_nodes)
}
Any help is greatly appreciated with this!