[Ruby] gem for extracting main text?
Alex Vollmer
alex.vollmer at gmail.com
Thu Jun 12 10:57:46 PDT 2008
On Jun 12, 2008, at Jun 12, 10:52 AM, Ben Cohen wrote:
> I'm trying to make a little instapaper.com clone for purely personal
> use,
> and one of the features they added recently was grabbing the main
> text of a
> page. Is there any ruby gem that makes this easier, or is this going
> to
> entail some fun DOM parsing and intelligent guessing?
That's a tough problem to solve in the general case. Where I work we
have a guy that works full time on just trying to identify the content
of any given web page. But if you have an idea of what the page looks
like and the parts you want to extract, you can use scubyt [http://scrubyt.org/
]. The docs are a little lacking, but if you stare at them and the
examples long enough it should make sense. There is a similar tool
called scrAPI which I've used before but found to be a little obtuse
at times. Google for that and compare it with scrubyt and see if that
gets you anywhere.
Cheers,
Alex V.
----
Musings & Notes
http://blog.livollmers.net
More information about the Ruby
mailing list