[Ruby] gem for extracting main text?

Alex Vollmer alex.vollmer at gmail.com
Thu Jun 12 10:57:46 PDT 2008


On Jun 12, 2008, at Jun 12, 10:52 AM, Ben Cohen wrote:

> I'm trying to make a little instapaper.com clone for purely personal  
> use,
> and one of the features they added recently was grabbing the main  
> text of a
> page. Is there any ruby gem that makes this easier, or is this going  
> to
> entail some fun DOM parsing and intelligent guessing?

That's a tough problem to solve in the general case. Where I work we  
have a guy that works full time on just trying to identify the content  
of any given web page. But if you have an idea of what the page looks  
like and the parts you want to extract, you can use scubyt [http://scrubyt.org/ 
]. The docs are a little lacking, but if you stare at them and the  
examples long enough it should make sense. There is a similar tool  
called scrAPI which I've used before but found to be a little obtuse  
at times. Google for that and compare it with scrubyt and see if that  
gets you anywhere.

Cheers,

Alex V.

----
Musings & Notes
http://blog.livollmers.net






More information about the Ruby mailing list