Friday, May 4, 2012

HTML Parsing in Java


Many a times you might have felt a need to fetch images links from a web page  into your java apps.  Me too came across need to parse HTML page, suddenly that came to my mind is to use SAX parser in java.
Now the problem with SAX parser is it is XML parser not suitable to parse HTML since HTML is not pure XML !! . Obviously it threw lot of fatal errors.
 After surfing web I came across "jsoup" library for HTML parsing . Here is the code snippet of that


// This code prints all the links in google.co.in

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


public class Main {

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
      
      
          
             Document doc;
            try {
                doc = Jsoup.connect("https://www.google.co.in/").get();
          
                Elements links = doc.select("a[href]");
              
              
            for(Element link : links)
            {
                System.out.println(link.attributes().get("href"));
            }
            }
             catch (IOException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }
      
    }

}

  jsoup is free and open Source  what else you need :) . For more information refer jsoup.org/. Wish you happy HTML parsing :) 

No comments:

Post a Comment