[Ruby] gem for extracting main text?

Phil Hagelberg phil at hagelb.org
Thu Jun 12 11:00:07 PDT 2008


"Ben Cohen" <heliostatic at gmail.com> writes:

> I'm trying to make a little instapaper.com clone for purely personal use,
> and one of the features they added recently was grabbing the main text of a
> page. Is there any ruby gem that makes this easier, or is this going to
> entail some fun DOM parsing and intelligent guessing?

I wish. A generalized strategy is very difficult to implement without
some level of natural language parsing and sentence splitting. The best
naieve solution I've come up with is to manually keep a mapping of
domains to DOM IDs and tie it into Hpricot, but you can't add a site to
the listing without human intervention. =\

-Phil



More information about the Ruby mailing list