Topic: question about using hpricot to scrape content

Hi all,

I'm using Hpricot to scrape some content from a site and reformat for another site.  I have got the application up and running pretty well, but I'm stumped by a couple things. 

1st, how can I get the tag type from an Hpricot element?  I want to navigate through the DOM using next_sibling and previous_sibling, but once I've moved on to that sibling, I want to test to see what kind of tag it is, and handle it accordingly.

Also, how can I remove a tag while preserving it's contents within the tag's parent object.  For example, I have content that has some <i> tags in it that I'd like to pull out.  Here's some pseudocode:

content = Hpricot.parse(open(url) // GET CONTENT AS HPRICOT TREE
content.search('i'){|i| // FIND ALL <I> TAGS IN THE CONTENT AND . . .
  i_text = i.inner_html // GET THE TEXT NODE INSIDE THE <I> TAG
  between_the_other_text_nodes = ?? // HOW DO I IDENTIFY THE CURRENT LOCATION WITHIN THE TREE?
  i.remove // DUMP THE <I> TAG
  // PUT THE TEXT THAT WAS INSIDE THE <I> TAG INTO THE CONTENT AT THE POINT THAT THE <I> TAG WAS REMOVED
  content.insert(i_text, between_the_other_text_nodes)
}

Any help is greatly appreciated with this!

Re: question about using hpricot to scrape content

1st, how can I get the tag type from an Hpricot element?

You can use the 'name' method, e.g.

doc = Hpricot('<p>This is a test of <span>test</span> <i>of sorts</i></p>')
doc.root.each_child { |c| puts(c.name) if c.is_a?(Hpricot::Elem) }

# Outputs
span
i


To replace your <i> elements you could do this:

content = Hpricot.parse(open(url))
content.search('i') { |i| i.swap(i.inner_text) }

Rob Anderton
TheWebFellas

Re: question about using hpricot to scrape content

Thanks!  One more question:

I'm getting question marks replacing any special characters (funky quotes, &nbsp;'s, etc) all over my output once I've run it through Hpricot.  It appears to handle these special chars fine for one use and not the other.  For example, I have this which doesn't replace any special chars:

@title = (page_text/'h2').inner_text # TEXT FOR THE TITLE

But if I do some tweaking of the content, I do get character replacement:

content = (page_text/'div.content')[1] # GET THE CONTENT DIV THAT WE'RE INTERESTED IN        
content.search('em') { |i| i.swap(i.inner_text) } # DROP ANY EM TAGS, KEEP CONTENT
content.search('div'){|i|i.swap(i.inner_text)} # DROP ANY DIV TAGS, KEEP CONTENT
content.search('p'){|i|i.swap(i.inner_text)} # DROP ANY P TAGS, KEEP CONTENT
just_text = content.children.select{|e| e.inner_text} # GET ANY INNER_TEXT LEFT
@content = just_text # NOW WE HAVE QUESTION MARKS EVERYWHERE???

So, if I have a title that's "Fort Lawton (Seattle)

Re: question about using hpricot to scrape content

Where are the question marks showing up? In the console, or are you rendering @content as part of a view?

It could be that the page you're processing isn't UTF-8 encoded: Hpricot assumes  UTF-8 so you might need to convert the character set before loading the HTML into Hpricot. Something like this:

f = open(uri)
page_text = Hpricot(Iconv.conv('UTF-8', f.charset, f.read))
f.close

Rob Anderton
TheWebFellas

Re: question about using hpricot to scrape content

Hi, thanks for the response,  I decided to modify my Hpricot::XChar like this:

Hpricot::XChar::PREDEFINED_U.merge!({"&nbsp;" => 32, "&rdquo;" => 34, "&ldquo;" => 34, "&lsquo;" => 39, "&rsquo;" => 39})

This substitutes out the annoying characters that are turning themselves into question marks when it's being rendered.

Re: question about using hpricot to scrape content

An Iconv question: when I try to implement Rob's code above I get this error:

undefined method 'charset' for #<File:my_file.html> (NoMethodError)

Is there something I need to do to figure out the file's charset type before I try to convert it?

Re: question about using hpricot to scrape content

The charset method is added by open-uri which is what you're using when you call open with a URI.

If you're opening a local file then you won't have access to it, an alternative would be to scan the HTML for a charset in a meta tag and use that.

Rob Anderton
TheWebFellas