The art of programming

Many a times you might have felt a need to fetch images links from a web page into your java apps. Me too came across need to parse HTML page, suddenly that came to my mind is to use SAX parser in java.
Now the problem with SAX parser is it is XML parser not suitable to parse HTML since HTML is not pure XML !! . Obviously it threw lot of fatal errors.
After surfing web I came across "jsoup" library for HTML parsing . Here is the code snippet of that

// This code prints all the links in google.co.in

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Main {

    /**
    * @param args
    */
    public static void main(String[] args) {
       // TODO Auto-generated method stub



           Document doc;
           try {
               doc = Jsoup.connect("https://www.google.co.in/").get();

                Elements links = doc.select("a[href]");


           for(Element link : links)
           {
               System.out.println(link.attributes().get("href"));
           }
           }
           catch (IOException e) {
                   // TODO Auto-generated catch block
                   e.printStackTrace();
               }

    }

}

jsoup is free and open Source what else you need :) . For more information refer jsoup.org/. Wish you happy HTML parsing :)

The art of programming - Java

Friday, May 4, 2012

HTML Parsing in Java

Followers

Blog Archive

About Me