Extract text from a webpage

Extract main textual content from a webpage.

Today I am going to discuss some of the libraries which can be used to extract main textual content and remove boilerplate or clutter content from a webpage.
We see tons of pages every day with full of advertisement, copyright statements, links, images etc. These are not the actual relevance content of webpage but the boilerplate contents.
There are many Java supported libraries which we can use to extract textual content from Wikipedia, news article, blog content etc.
Before exploring library it is important to know that –

  • Each page has different structure (in terms of tags).
  • Actual data are segregated by different paragraph, heading, div with content class etc.
  • For example when you search “Obama” and see the source of first two links i.e. http://en.wikipedia.org/wiki/Barack_Obama and http://www.barackobama.com/.
    Both the page has different structure.

No parser has any Artificial intelligence; it is just the heuristic algorithm with well-defined rule which works behind the scene. They work on DOM (document object model). Most of the parser or HTML page stripper require user to supply tag name to get data of individual tag or it return the whole page text.
These libraries don’t work on all the pages due to vary nature of page content in terms of tags.

We will see example of following libraries:

Boilerpipe: Boilerpipe is a Java library written by Christian Kohlschütter. It is based on Boilerplate Detection using Shallow Text Features. You can read here more about shallow text feature .
There is also a test page deployed on Google app engine where you can enter a link and it will give you page text.
URL:- http://boilerpipe-web.appspot.com/
Boilerpipe is very easy to use. Add following dependency to POM

There are five types of extractor –
ARTICLE_EXTRACTOR: Works very well for most types of Article-like HTML.
CANOLA_EXTRACTOR: Trained on krdwrd Canola (different definition of “boilerplate”). You may give it a try.
DEFAULT_EXTRACTOR: Usually worse than ArticleExtractor, but simpler/no heuristics.
KEEP_EVERYTHING_EXTRACTOR: Dummy Extractor; should return the input text. Use this to double-check that your problem is within a particular BoilerpipeExtractor, or somewhere else.
LARGEST_CONTENT_EXTRACTOR: Like DefaultExtractor, but keeps the largest text block only.

Java Example

JSoup: As per Jsoup official page – jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
JSoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
Test page for JSoup: http://try.jsoup.org/

How to use JSoup:

You can use JQuery like selector to get the content of a tag.

Enter following entry to POM-

Java Code

Apache Tika: Apache Tika is a content analysis tool which can be used to extracts metadata and text content from various documents.

Enter following dependency to POM-

Java Code

  • I should work if you are able to open the page in browser. All the apis just work on the page content which are loaded in browser.

  • This is very helpful blog, but one question: Will it work with HTTPS websites?

    When I was working on HttpUnit to login to SSL secured website and perform some automated operations, special process to download certificate and add it in to cacerts file in jre/lib using JDK’s keytool.exe program was required.