[Ruby] gem for extracting main text?
Ben Cohen
heliostatic at gmail.com
Thu Jun 12 11:19:06 PDT 2008
On Thu, Jun 12, 2008 at 11:00 AM, Phil Hagelberg <phil at hagelb.org> wrote:
> "Ben Cohen" <heliostatic at gmail.com> writes:
>
> > I'm trying to make a little instapaper.com clone for purely personal
> use,
> > and one of the features they added recently was grabbing the main text of
> a
> > page. Is there any ruby gem that makes this easier, or is this going to
> > entail some fun DOM parsing and intelligent guessing?
>
> I wish. A generalized strategy is very difficult to implement without
> some level of natural language parsing and sentence splitting. The best
> naieve solution I've come up with is to manually keep a mapping of
> domains to DOM IDs and tie it into Hpricot, but you can't add a site to
> the listing without human intervention. =\
>
> -Phil
> _______________________________________________
> Ruby at zenspider.com - Seattle.rb non-commercial list
> http://www.zenspider.com/seattle.rb
> http://www.zenspider.com/mailman/listinfo/ruby
>
Yes, I've been looking at scRUBYt, and it certainly takes some of the
drudgery out. I'm really curious how Marco at instapaper is doing it. Since
mine is for personal use, tweaking the code to accommodate new sites is not
a problem, but I'm interested in a more robust solution for its own sake.
More information about the Ruby
mailing list