Jul 28, 2010

Java Use Dom4j and NekoHTML to read HTML




How do you read HTML in Java Use java.net.url to open streaming and use String function to extract the Text

It is a stupid method to read HTML. Here has a method can make you easily to read HTML document. This method require XPath skill. If you don't know what is it, you can go to W3Schools to learn about XPath.

Link: http://www.w3schools.com/xpath/


Before start, we need download Dom4j and NekoHTML library.
Dom4j: http://www.dom4j.org/dom4j-1.6.1/
NekoHTML: http://nekohtml.sourceforge.net/

In this tutorial, I have choose a simple to get a stock bid price and stock ask price from Yahoo Finance:

Here is a source code:
package com.open-tutorial.www;

import java.io.IOException;

import org.cyberneko.html.parsers.DOMParser;
import org.dom4j.Document;
import org.dom4j.Node;
import org.dom4j.io.DOMReader;
import org.xml.sax.SAXException;

public class Main {

    /**
    * @param args
    */
    public static void main(String[] args) {
        try{
            String url = "http://finance.yahoo.com/q?s=GOOG";

            DOMParser parser = new DOMParser();
            parser.parse(url);

            org.w3c.dom.Document document = parser.getDocument();
            DOMReader domReader = new DOMReader();  
            Document doc = domReader.read(document);

            //Element name should be upper case
            Node name = doc.selectSingleNode("//DIV[@id='yfi_investing_head']/H1/node()");
            Node bid = doc.selectSingleNode("//DIV[@id='yfi_quote_summary_data']/TABLE/TBODY/TR[6]/TD[1]/SPAN");
            Node ask = doc.selectSingleNode("//DIV[@id='yfi_quote_summary_data']/TABLE/TBODY/TR[7]/TD[1]/SPAN");

            System.out.println(name.getText());
            System.out.println("Bid: " + bid.getText());
            System.out.println("Ask: " + ask.getText());
        } catch (SAXException e) {
            System.out.println(e.toString());
        } catch (IOException e) {
            System.out.println(e.toString());
        }
    }
}

Line 20 - create a NekoHTML DomParser to ready to read a http://finance.yahoo.com/q?s=GOOG

Line 25 - convert W3C Document object to Dom4j Document object

Line 28, 29, 30 - using XPath to select target element, element name must be a upper case string, because NekoHTML was changed all elements name to upper case string.

Result:
Google Inc.
Bid: 490.60
Ask: 493.75


5 comments:

brattak said...

Hey it is exactly what I need!

But I have some errors with your code, I uploaded the libraries and I don't understand....:
"Exception in thread "main" java.lang.NoClassDefFoundError: org/jaxen/JaxenException
at org.dom4j.DocumentFactory.createXPath(DocumentFactory.java:230)
at org.dom4j.tree.AbstractNode.createXPath(AbstractNode.java:207)
at org.dom4j.tree.AbstractNode.selectSingleNode(AbstractNode.java:183)
at test.main(test.java:22)
Caused by: java.lang.ClassNotFoundException: org.jaxen.JaxenException
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 4 more
"

Lawrence Cheung said...

jaxen library is stored in dom4j.zip -> lib folder

brattak said...

thanks a lot

Andrew Huang said...

Hi,

I have inlcuded jaxen and Xerces libs.

however, it still has the following err msg:

Could you kindly help?

D:\Programing\>java Main
Exception in thread "main" java.lang.NoClassDefFoundError: org/cyberneko/html/parsers/DOMParser
at Main.main(Main.java:13)
Caused by: java.lang.ClassNotFoundException: org.cyberneko.html.parsers.DOMParser
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 1 more

Andrew Huang said...

I fixed it. Thx

Post a Comment

Twitter Delicious Facebook Digg Google Favorites More

 
Design by Free WordPress Themes | Bloggerized by Lasantha - Premium Blogger Themes | Best Hostgator Coupon Code