[Ruby] gem for extracting main text?

Philip Hallstrom philip at pjkh.com
Thu Jun 12 10:55:20 PDT 2008


> I'm trying to make a little instapaper.com clone for purely personal use,
> and one of the features they added recently was grabbing the main text of a
> page. Is there any ruby gem that makes this easier, or is this going to
> entail some fun DOM parsing and intelligent guessing?

I'm guessing it will require guessing :(  Hpricot might make it easier 
though.  You could set it to search for divs named "content" or "main", 
etc.

If you read a lot of the same sites you may want to build parsers "per 
site" so you can target them individually.  Or perhaps identify the page's 
blog engine and parse that way.  Just a thought.

-philip



More information about the Ruby mailing list