Dynamic Programming

What is Dynamic programming?

  • Dynamic programming was invented by a prominent U.S. mathematician, Richard Bellman, in the 1950s.
  • Dynamic Programming is nothing to do with computer programming. Here programming means planning.
  • It is a method for optimizing multistage decision making process.
  • It is an algorithm design technique always considered to solve optimization problem.
  • More efficient than “brute-force methods”, which solve the same subproblems over and over again.

When to use?

  • Dynamic programming is used to solve problems which have overlapping sub-problems.
  • When problem breaks down into recurring small dependent sub-problems.
  • When the solution can be recursively described in terms of solutions to sub-problems.

How it works?
Dynamic programming algorithm finds solutions to sub-problems and stores them in memory for later use. It is sometimes considered the opposite of recursion. Where a recursive solution starts at the top and breaks the problem down, solving all small problems until the complete problem is solved, a dynamic programming solution starts at the bottom, solving small problems and combining them to form an overall solution to the big problem. It is used to convert algorithm of complexity 2n to O(n3) or O(n2).

It’s always better if you understand the Dynamic programming with the help of problems.
We will solve following problems with the help of Dynamic programming

  1. Fibonacci Series
  2. Coin Row Problem
  3. Change Making Problem
  4. Minimum Coin Change
  5. Robot Coin Collection
  6. Levenshtein distance

Let’s start with the easiest one Fibonacci series.
Fibonacci Series : As we know that Fibonacci series is the element of sequence

0,1,1,2,3,5,8,13,21,34,55. . . .

And nth Fibonacci number is defined by the recurrence relation
If n=0 then f(0) = 0
If n=1 then f(1) = 1
Otherwise f(n) = f(n-2)+f(n-1)

We can define a recursive method using this recurrence relation.

    public int fibonacci_rec(int n) {
        if (n <= 0) {
            return 0;
        }
        if (n == 1) {
            return 1;
        }
        return fibonacci_rec(n - 1) + fibonacci_rec(n - 2);
    }

For example: suppose that we want to calculate 6th Fibonacci number using above method.
Recursion tree for 6th Fibonacci number
Dynamic-Programming

To compute F(6), there is one call to F(5) and F(4). However, since
F(5) recursively makes a call to F(4) and F(3), there are actually two separate calls
to compute F(4). If you will trace out the entire algorithm, you will observe that F(4) is computed two times, F(3) is computed three times, f(2) is computed five times, F(1) is computed eight times, and so on. We are re-computing the same values many times. The growth of redundant calculation is exponential.

We can improve it using Dynamic programming. We can use a single dimensional array as a cache to store the result of each computed Fibonacci number and before computing we will check in this array that if the value already exist. If exist then no need to recomputed.

public class FibonacciNumber {

    private static final int MAXN = 45;
    private Integer[] f = new Integer[MAXN];

    public int fibonacci(int n) {
        f[0] = 0;
        f[1] = 1;
        for (int i = 2; i <= n; i++) {             f[i] = f[i - 1] + f[i - 2];
        }
        return f[n];
    }
}

Actually to calculate Nth Fibonacci we don’t need all the Fibonacci number from 1 to n-1, we just need previous two numbers ie (n-2) and (n-1).

 public int fibonacci(int n) {
        if(n==1){
            return 0;
        }
        int back2 = 0;
        int back1 = 1;
        int next;
        for (int i = 2; i < n; i++) {
            next = back2 + back1;
            back2 = back1;
            back1 = next;

        }
        return back2 + back1;
    }

Coin Row Problem : Coin-rowproblem There is a row of n coins whose values are some positive integers c1, c2, . . . , cn, not necessarily distinct. The goal is to pick up the maximum amount of money subject to the constraint that no two coins adjacent in the initial row can be picked up.

Or in other words

There is an integer array consisting positive numbers only. Find maximum possible sum of elements such that there are no 2 consecutive elements present in the sum.

For ex: Coins[] = {5, 22, 26, 15, 4, 3,11}
Max Sum = 22 + 15 + 11 = 48

To solve this problem using Dynamic Programming first we will have to define recurrence relation.
Let F[n] is the array which will contain the maximum sum at n for any given n. The recurrence relation will be.

F(n) = max{Coins[n] + F[n − 2], F[n − 1]} for n > 1,
F(0) = 0, F(1) = Coins[1].

This is very easy to understand. While calculating F[n] we are taking maximum of coins[n]+the previous of preceding element and the previous element.
For example F[2] = max{coins[0]+ F[2-2], F[2-1]} // No consecutive

The algorithm:

public class CoinRow {

    public static int getMaximumAmount(int[] coins) {
        assert coins.length > 0 : "no coins to select";

        int[] C = new int[coins.length + 1];
        for (int i = 0; i < coins.length; i++) {
            C[i + 1] = coins[i];// make room at first place
        }

        int[] F = new int[coins.length + 1];
        F[0] = 0;
        F[1] = C[1];

        for (int i = 2; i <= coins.length; i++) {
            F[i] = max(C[i] + F[i - 2], F[i - 1]);
        }
        return F[coins.length];
    }

    private static Integer max(int i, int j) {
        return i > j ? i : j;
    }

    public static void main(String[] args) {
        int[] coins = { 5, 22, 26, 10, 4, 8};
        System.out.println(getMaximumAmount(coins));
    }
}
Output
-------
40

Lets trace the algorithm.
Coin-Row-1

Coin-Row-2

F[6] will have the final value.

Go to the next page – Click on the below red circle with page number.

Extract text from a webpage

Extract main textual content from a webpage.

Today I am going to discuss some of the libraries which can be used to extract main textual content and remove boilerplate or clutter content from a webpage.
We see tons of pages every day with full of advertisement, copyright statements, links, images etc. These are not the actual relevance content of webpage but the boilerplate contents.
There are many Java supported libraries which we can use to extract textual content from Wikipedia, news article, blog content etc.
Before exploring library it is important to know that –

  • Each page has different structure (in terms of tags).
  • Actual data are segregated by different paragraph, heading, div with content class etc.
  • For example when you search “Obama” and see the source of first two links i.e. http://en.wikipedia.org/wiki/Barack_Obama and http://www.barackobama.com/.
    Both the page has different structure.

No parser has any Artificial intelligence; it is just the heuristic algorithm with well-defined rule which works behind the scene. They work on DOM (document object model). Most of the parser or HTML page stripper require user to supply tag name to get data of individual tag or it return the whole page text.
These libraries don’t work on all the pages due to vary nature of page content in terms of tags.

We will see example of following libraries:

Boilerpipe: Boilerpipe is a Java library written by Christian Kohlschütter. It is based on Boilerplate Detection using Shallow Text Features. You can read here more about shallow text feature .
There is also a test page deployed on Google app engine where you can enter a link and it will give you page text.
URL:- http://boilerpipe-web.appspot.com/
Boilerpipe is very easy to use. Add following dependency to POM

<repository>
    <id>boilerpipe-m2-repo</id>
    <url>http://boilerpipe.googlecode.com/svn/repo/</url>
</repository>
<dependency>
   <groupId>de.l3s.boilerpipe</groupId>
   <artifactId>boilerpipe</artifactId>
   <version>1.2.0</version>
</dependency>

There are five types of extractor –
ARTICLE_EXTRACTOR: Works very well for most types of Article-like HTML.
CANOLA_EXTRACTOR: Trained on krdwrd Canola (different definition of “boilerplate”). You may give it a try.
DEFAULT_EXTRACTOR: Usually worse than ArticleExtractor, but simpler/no heuristics.
KEEP_EVERYTHING_EXTRACTOR: Dummy Extractor; should return the input text. Use this to double-check that your problem is within a particular BoilerpipeExtractor, or somewhere else.
LARGEST_CONTENT_EXTRACTOR: Like DefaultExtractor, but keeps the largest text block only.

Java Example

package com.test;

import java.net.URL;

import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.CommonExtractors;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;
import de.l3s.boilerpipe.sax.HTMLDocument;
import de.l3s.boilerpipe.sax.HTMLFetcher;

public class BoilerpipeTextExtraction {
	public static void main(String[] args) throws Exception {
		final HTMLDocument htmlDoc = HTMLFetcher.fetch(new URL("http://www.basicsbehind.com/stack-data-structure/"));
		final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
		String content = CommonExtractors.ARTICLE_EXTRACTOR.getText(doc);
		System.out.println(content);
	}
}

JSoup: As per Jsoup official page – jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
JSoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
Test page for JSoup: http://try.jsoup.org/

How to use JSoup:

You can use JQuery like selector to get the content of a tag.

Enter following entry to POM-

<dependency>
   <groupId>org.jsoup</groupId>
   <artifactId>jsoup</artifactId>
   <version>1.7.3</version>
</dependency>

Java Code

package com.test;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JSoupExtractor {
	public static void main(String[] args) throws Exception {

		Document doc = Jsoup.connect("http://www.basicsbehind.com/stack-data-structure/").get();
		// select title of the page
		System.out.println(doc.title());
		// select text of whole page
		System.out.println(doc.text());
		// select text of body
		System.out.println(doc.getElementsByTag("body").text());
		// select text of paragraph
		System.out.println(doc.getElementsByTag("p").text());
	}

}

Apache Tika: Apache Tika is a content analysis tool which can be used to extracts metadata and text content from various documents.

Enter following dependency to POM-

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.5</version>
</dependency>

Java Code

package com.test;

import java.io.InputStream;
import java.net.URL;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.LinkContentHandler;
import org.apache.tika.sax.TeeContentHandler;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.xml.sax.ContentHandler;
public class ApacheTikaExtractor {
	public static void main(String[] args) throws Exception{
		URL url = new URL("http://www.basicsbehind.com/stack-data-structure/");
	    InputStream input = url.openStream();
	    LinkContentHandler linkHandler = new LinkContentHandler();
	    ContentHandler textHandler = new BodyContentHandler();
	    ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();
	    TeeContentHandler teeHandler = new TeeContentHandler(linkHandler, textHandler, toHTMLHandler);
	    Metadata metadata = new Metadata();
	    ParseContext parseContext = new ParseContext();
	    HtmlParser parser = new HtmlParser();
	    parser.parse(input, teeHandler, metadata, parseContext);
	    System.out.println("title:\n" + metadata.get("title"));
	    System.out.println("links:\n" + linkHandler.getLinks());
	    System.out.println("text:\n" + textHandler.toString());
	    System.out.println("html:\n" + toHTMLHandler.toString());
	}
	

}