Uncategorized

Java Use Dom4j and NekoHTML to read HTML

Java Use Dom4j and NekoHTML to read HTML

How do you read HTML in Java Use java.net.url to open streaming and use String function to extract the Text

It is a stupid method to read HTML. Here has a method can make you easily to read HTML document. This method require XPath skill. If you don’t know what is it, you can go to W3Schools to learn about XPath.

Link: http://www.w3schools.com/xpath/

Before start, we need download Dom4j and NekoHTML library.
Dom4j: http://www.dom4j.org/dom4j-1.6.1/
NekoHTML: http://nekohtml.sourceforge.net/

In this tutorial, I have choose a simple to get a stock bid price and stock ask price from Yahoo Finance:

Here is a source code:

package com.open-tutorial.www;

import java.io.IOException;

import org.cyberneko.html.parsers.DOMParser;
import org.dom4j.Document;
import org.dom4j.Node;
import org.dom4j.io.DOMReader;
import org.xml.sax.SAXException;

public class Main {

    /**
    * @param args
    */
    public static void main(String[] args) {
        try{
            String url = "http://finance.yahoo.com/q?s=GOOG";

            DOMParser parser = new DOMParser();
            parser.parse(url);

            org.w3c.dom.Document document = parser.getDocument();
            DOMReader domReader = new DOMReader();
            Document doc = domReader.read(document);

            //Element name should be upper case
            Node name = doc.selectSingleNode("//DIV[@id='yfi_investing_head']/H1/node()");
            Node bid = doc.selectSingleNode("//DIV[@id='yfi_quote_summary_data']/TABLE/TBODY/TR[6]/TD[1]/SPAN");
            Node ask = doc.selectSingleNode("//DIV[@id='yfi_quote_summary_data']/TABLE/TBODY/TR[7]/TD[1]/SPAN");

            System.out.println(name.getText());
            System.out.println("Bid: " + bid.getText());
            System.out.println("Ask: " + ask.getText());
        } catch (SAXException e) {
            System.out.println(e.toString());
        } catch (IOException e) {
            System.out.println(e.toString());
        }
    }
}

Line 20 – create a NekoHTML DomParser to ready to read a http://finance.yahoo.com/q?s=GOOG

Line 25 – convert W3C Document object to Dom4j Document object

Line 28, 29, 30 – using XPath to select target element, element name must be a upper case string, because NekoHTML was changed all elements name to upper case string.

Result:

Google Inc.
Bid: 490.60
Ask: 493.75