Thursday, 21 November 2013

JAVA : Get html Page Source through Website URL

ok
This code will grab the HTML source from a given URL. Change "website here.com" to a real URL starting with http:// and the program will display the index pages source code in the console. The nice thing about this code is it spoofs the connection to make it look like its a web browser. This enables you to navigate to sites like google that normally block connections from non web browser applications.There are several ways to get the HTML content of a URL from Java. There are even more ways to get the HTML using open source java. For example, url is www.google.com and my servlet needs to read the html source code but for this You don't need servlet to read data from a remote server. You can just use java.net.URL or java.net.URLConnection class to read remote content from HTTP server.Some Web Sites do not allow visitors to view the HTML source of their web pages. They can disable our right mouse button to prevent accessing the "view source" menu option and some other web sites open their pages in a special window which has no menu bar to select the "Source" option from "View" menu.

Apache Commons HttpClient
You can also use the Apache Commons HttpClient for a slightly easier to use library.
HttpClient client = new HttpClient();
HttpMethod method = new GetMethod("http://mohanraghuwanshi.blogspot.in/");
try {
 client.executeMethod(method);
 byte[] responseBody = method.getResponseBody();
 System.out.println(new String(responseBody));
} catch (Exception e) {
 e.printStackTrace();
} finally {
 method.releaseConnection();
}
Java Development Kit (JDK)
In this Class I used  InputStream method.
public class WebsiteSource
{
       public static void main(String[] args) throws IOException{
      
       URL url = new URL("http://www.infostretch.com");
       System.out.println(" portno:" +url.getPort());
       System.out.println("Host:" + url.getHost());
//     System.out.println("file:"+url.getFile());
       URLConnection connection = url.openConnection();
       long date = connection.getDate();
       if (date == 0){
              System.out.println("No Date found");
       }
       else{
              System.out.println("Date is :" + new Date(date));
       }
       if(connection.getExpiration()==0)
       {
              System.out.println("no expiration date found");
       }
       else{
              System.out.println("Exp Date :"+new Date(date));
       }
       if(connection.getLastModified()==0)
       {
              System.out.println("no last modified date found");
       }
      
   else{
       System.out.println("Modified Date :"+new Date(date));
       }
       int len = connection.getContentLength();
       if(len == -1)
              System.out.println("Content length unavailable.");
              else
              System.out.println("Content-Length: " + len);
       if(len != 0) {
              System.out.println("=== Html Contents page ===");
              InputStream input = connection.getInputStream();
              int htmlSrc ;
              while (((htmlSrc = input.read()) != -1)) {
              System.out.print((char)htmlSrc);
              }
              input.close();
              } else {
              System.out.println("No content available.");
              }
       }
}

No comments:

Post a Comment

Getting started with Elasticsearch and Node.js

  In this article we're going to look at using Node to connect to an Elasticsearch deployment, index some documents and perform a simple...