[Ruby] gem for extracting main text?
Phil Hagelberg
phil at hagelb.org
Thu Jun 12 11:00:07 PDT 2008
"Ben Cohen" <heliostatic at gmail.com> writes:
> I'm trying to make a little instapaper.com clone for purely personal use,
> and one of the features they added recently was grabbing the main text of a
> page. Is there any ruby gem that makes this easier, or is this going to
> entail some fun DOM parsing and intelligent guessing?
I wish. A generalized strategy is very difficult to implement without
some level of natural language parsing and sentence splitting. The best
naieve solution I've come up with is to manually keep a mapping of
domains to DOM IDs and tie it into Hpricot, but you can't add a site to
the listing without human intervention. =\
-Phil
More information about the Ruby
mailing list