[TOC]

org.apache.nutch.util

Class URLUtil


public class URLUtil
extends Object

Utility class for URL analysis

Constructor Summary

Constructors Constructor and Description URLUtil()

Method Summary

Methods Modifier and Type Method and Description static String chooseRepr(String src, String dst, boolean temp) Given two urls, a src and a destination of a redirect, it returns the representative url. static String getDomainName(String url) Returns the domain name of the url. static String getDomainName(URL url) Returns the domain name of the url. static DomainSuffix getDomainSuffix(String url) Returns the DomainSuffix corresponding to the last public part of the hostname static DomainSuffix getDomainSuffix(URL url) Returns the DomainSuffix corresponding to the last public part of the hostname static String getHost(String url) Returns the lowercased hostname for the url or null if the url is not well formed. static String[] getHostSegments(String url) Partitions of the hostname of the url by "." static String[] getHostSegments(URL url) Partitions of the hostname of the url by "." static String getPage(String url) Returns the page for the url. static String getProtocol(String url) static String getProtocol(URL url) static String getTopLevelDomainName(String url) Returns the top level domain name of the url. static String getTopLevelDomainName(URL url) Returns the top level domain name of the url. static boolean isSameDomainName(String url1, String url2) Returns whether the given urls have the same domain name. static boolean isSameDomainName(URL url1, URL url2) Returns whether the given urls have the same domain name. static void main(String[] args) For testing static URL resolveURL(URL base, String target) Resolve relative URL-s and fix a java.net.URL error in handling of URLs with pure query targets. static String toASCII(String url) static String toUNICODE(String url)

-    

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

-  

URLUtil

public URLUtil()

Method Detail

-  

resolveURL

public static URL resolveURL(URL base,
             String target)
                      throws MalformedURLException

Resolve relative URL-s and fix a java.net.URL error in handling of URLs with pure query targets.

  - Parameters:
  - <code>base</code> - base url
  - <code>target</code> - target url (may be relative) 
  - Returns:
  - resolved absolute url. 
  - Throws: 
  - <code>MalformedURLException</code>       
-  

getDomainName

public static String getDomainName(URL url)

Returns the domain name of the url. The domain name of a url is the substring of the url's hostname, w/o subdomain names. As an example

getDomainName(conf, new URL(http://lucene.apache.org/))

will return

apache.org

-  

getDomainName

public static String getDomainName(String url)
                            throws MalformedURLException

Returns the domain name of the url. The domain name of a url is the substring of the url's hostname, w/o subdomain names. As an example

getDomainName(conf, new http://lucene.apache.org/)

will return

apache.org

  - Throws: 
  - <code>MalformedURLException</code>       
-  

getTopLevelDomainName

public static String getTopLevelDomainName(URL url)
                                    throws MalformedURLException

Returns the top level domain name of the url. The top level domain name of a url is the substring of the url's hostname, w/o subdomain names. As an example

getTopLevelDomainName(conf, new http://lucene.apache.org/)

will return

org

  - Throws: 
  - <code>MalformedURLException</code>       
-  

getTopLevelDomainName

public static String getTopLevelDomainName(String url)
                                    throws MalformedURLException

Returns the top level domain name of the url. The top level domain name of a url is the substring of the url's hostname, w/o subdomain names. As an example

getTopLevelDomainName(conf, new http://lucene.apache.org/)

will return

org

  - Throws: 
  - <code>MalformedURLException</code>       
-  

isSameDomainName

public static boolean isSameDomainName(URL url1,
                       URL url2)

Returns whether the given urls have the same domain name. As an example,

isSameDomain(new URL("http://lucene.apache.org&#34;) , new URL("http://people.apache.org/&#34;)) will return true.

  - Returns:
  - true if the domain names are equal       
-  

isSameDomainName

public static boolean isSameDomainName(String url1,
                       String url2)
                                throws MalformedURLException

Returns whether the given urls have the same domain name. As an example,

isSameDomain("http://lucene.apache.org&#34; ,"http://people.apache.org/&#34;) will return true.

  - Returns:
  - true if the domain names are equal 
  - Throws: 
  - <code>MalformedURLException</code>       
-  

getDomainSuffix

public static DomainSuffix getDomainSuffix(URL url)

Returns the DomainSuffix corresponding to the last public part of the hostname

-  

getDomainSuffix

public static DomainSuffix getDomainSuffix(String url)
                                    throws MalformedURLException

Returns the DomainSuffix corresponding to the last public part of the hostname

  - Throws: 
  - <code>MalformedURLException</code>       
-  

getHostSegments

public static String[] getHostSegments(URL url)

Partitions of the hostname of the url by "."

-  

getHostSegments

public static String[] getHostSegments(String url)
                                throws MalformedURLException

Partitions of the hostname of the url by "."

  - Throws: 
  - <code>MalformedURLException</code>       
-  

chooseRepr

public static String chooseRepr(String src,
                String dst,
                boolean temp)

Given two urls, a src and a destination of a redirect, it returns the representative url.

This method implements an extended version of the algorithm used by the Yahoo! Slurp crawler described here:

How does the Yahoo! webcrawler handle redirects?

  - Choose target url if either url is malformed. 
  - If different domains the keep the destination whether or not the redirect is temp or perm 
  - a.com -&gt; b.com* 
  - If the redirect is permanent and the source is root, keep the source. 
  - *a.com -&gt; a.com?y=1 || *a.com -&gt; a.com/xyz/index.html 
  - If the redirect is permanent and the source is not root and the destination is root, keep the destination 
  - a.com/xyz/index.html -&gt; a.com* 
  - If the redirect is permanent and neither the source nor the destination is root, then keep the destination 
  - a.com/xyz/index.html -&gt; a.com/abc/page.html* 
  - If the redirect is temporary and source is root and destination is not root, then keep the source 
  - *a.com -&gt; a.com/xyz/index.html 
  - If the redirect is temporary and source is not root and destination is root, then keep the destination 
  - a.com/xyz/index.html -&gt; a.com* 
  - If the redirect is temporary and neither the source or the destination is root, then keep the shortest url. First check for the shortest host, and if both are equal then check by path. Path is first by length then by the number of / path separators.  
  - a.com/xyz/index.html -&gt; a.com/abc/page.html* 
  - *www.a.com/xyz/index.html -&gt; www.news.a.com/xyz/index.html  
  - If the redirect is temporary and both the source and the destination are root, then keep the shortest sub-domain 
  - *www.a.com -&gt; www.news.a.com 

While not in this logic there is a further piece of representative url logic that occurs during indexing and after scoring. During creation of the basic fields before indexing, if a url has a representative url stored we check both the url and its representative url (which should never be the same) against their linkrank scores and the highest scoring one is kept as the url and the lower scoring one is held as the orig url inside of the index.

  - Parameters:
  - <code>src</code> - The source url.
  - <code>dst</code> - The destination url.
  - <code>temp</code> - Is the redirect a temporary redirect. 
  - Returns:
  - String The representative url.       
-  

getHost

public static String getHost(String url)

Returns the lowercased hostname for the url or null if the url is not well formed.

  - Parameters:
  - <code>url</code> - The url to check. 
  - Returns:
  - String The hostname for the url.       
-  

getPage

public static String getPage(String url)

Returns the page for the url. The page consists of the protocol, host, and path, but does not include the query string. The host is lowercased but the path is not.

  - Parameters:
  - <code>url</code> - The url to check. 
  - Returns:
  - String The page for the url.       
-  

getProtocol

public static String getProtocol(String url)
-  

getProtocol

public static String getProtocol(URL url)
-  

toASCII

public static String toASCII(String url)
-  

toUNICODE

public static String toUNICODE(String url)
-  

main

public static void main(String[] args)

For testing

Copyright © 2014 The Apache Software Foundation