Java, Programming

Java Use Dom4j and NekoHTML to read HTML

java-use-dom4j-and-nekohtml-to-read-html

How do you read HTML in Java Use java.net.url to open streaming and use a String function to extract the Text

It is a stupid method to read HTML. Here has a method can make you easy to read HTML document. This method requires an XPath skill. If you don’t know what is it, you can go to W3Schools to learn about XPath.

Link: http://www.w3schools.com/xpath/

Before the start, we need to download Dom4j and NekoHTML library.
Dom4j: http://www.dom4j.org/dom4j-1.6.1/
NekoHTML: http://nekohtml.sourceforge.net/

In this tutorial, I have chosen a simple to get a stock bid price and stock ask price from Yahoo Finance:

Here is source code:

package com.open-tutorial.www;
import java.io.IOException;

import org.cyberneko.html.parsers.DOMParser;
import org.dom4j.Document;
import org.dom4j.Node;
import org.dom4j.io.DOMReader;
import org.xml.sax.SAXException;

public class Main {

  /**
   * @param args
   */
  public static void main(String[] args) {
    try{
      String url = "http://finance.yahoo.com/q?s=GOOG";

      DOMParser parser = new DOMParser();
      parser.parse(url);

      org.w3c.dom.Document document = parser.getDocument();
      DOMReader domReader = new DOMReader();
      Document doc = domReader.read(document);

//Element name should be upper case
      Node name = doc.selectSingleNode("//DIV[@id='yfi_investing_head']/H1/node()");
      Node bid = doc.selectSingleNode("//DIV[@id='yfi_quote_summary_data']/TABLE/TBODY/TR[6]/TD[1]/SPAN");
      Node ask = doc.selectSingleNode("//DIV[@id='yfi_quote_summary_data']/TABLE/TBODY/TR[7]/TD[1]/SPAN");

      System.out.println(name.getText());
      System.out.println("Bid: " + bid.getText());
      System.out.println("Ask: " + ask.getText());
    } catch (SAXException e) {
      System.out.println(e.toString());
    } catch (IOException e) {
      System.out.println(e.toString());
    }
  }
}

Line 20 – create a NekoHTML DomParser to ready to read a http://finance.yahoo.com/q?s=GOOG

Line 25 – convert W3C Document object to Dom4j Document object

Line 28, 29, 30 – using XPath to select target element, element name must be an upper case string, because NekoHTML was changed all elements name to upper case string.

Result:

Google Inc.
Bid: 490.60
Ask: 493.75