Home
Web design
Software
Articles
Site Map

> Articles > PHP articles > Building the list of all links used in the web site

Building the list of all links used in the web site


Note: here we use termins URL and link. They both mean the same, we just write URL when talking about particular web page URL and link when we mean some link found at the web page.
  • 1. Getting web page contents.

Let $strURL be the URL of the web page we want to retrieve.
Then the natural way of getting web page contents is:

$fd 
fopen ($url"rb");
while (!
feof ($fd))
    
$buffer .= fgets($fd4096);
fclose ($fd);

BUT Versions prior to PHP 4.0.5 do not handle HTTP redirects. Because of this, directories must include trailing slashes.
Thus there are two choices. Eigher write
    if (substr($strText, -1) != '/'$strText .= "/";
or use fsockopen function to connect to the web server, send the request, check for "Location ..." line in the header, repeat. See the sampe in the article Retreiving web page contents handling HTTP redirects.


  • 2. Building the list of all links at the web page.

Here we have to use the power of regular expressions.
    preg_match_all(
        
    "/<A[ \r\n\t]{1}[^>]*HREF[^=]*=[ '\"\n\r\t]*([^ \"'>\r\n\t#]+)[ \"'>\r\n\t#>][^>]*>/isU",
        
    $strPageText,
        
    $aUrls);
now the array $aUrls[1] contains all the links contained in $strPageText.


  • 3. Creating the class to collect all the links in the given web site starting from given URL.

class CLinkScanner
{
    var 
$aUrlsToProcess;
    
/* $aUrlsToProcess is associative array of url's not yet scanned for links.
        If $link is to be processed, $aUrlsToProcess[$link] = true */
    
var $aProcessedUrls;
    
/* $aProcessedUrls is associative array of url's already scanned for links.
        If $url is already processed, $aProcessedUrls[$url] = true */
    
var $strSiteBaseUrl;
    
/* Algorithm won't process url's which don't begin with $strSiteBaseUrl. */

    /*
        Function RetrieveLinks scans $strText for links.
        If new links are found, they are added to $aUrlsToProcess.
    */
    
function RetrieveLinks($strPageText$strBaseUrl)
    {
        
preg_match_all(
            
"/<A[ \r\n\t]{1}[^>]*HREF[^=]*=[ '\"\n\r\t]*([^ \"'>\r\n\t#]+)[ \"'>\r\n\t#>][^>]*>/isU",
            
$strPageText,
            
$aUrls);
        foreach(
$aUrls[1] as $strUrl)
        {
            
trim($strUrl);
            
// skipping email addresses
            
if (substr($strUrl07) == "mailto:") continue;
            
// skipping javascript code
            
if (substr($strUrl011) == "javascript:") continue;
            
// if $strUrl is not in the canonical form, adding current web page url
            
if (substr($strUrl07) != "http://")
            {
                if (
$strBaseUrl[strlen($strBaseUrl)-1] != '/' && $strUrl[0] != '/')
                    
$strUrl $strBaseUrl.'/'.$strUrl;
                else
                    
$strUrl $strBaseUrl.$strUrl;
            }
            
/* If $strUrl points outside of web site, skip it. */
            
if (strlen($strUrl) < strlen($this->strSiteBaseUrl) ||
                
substr($strUrl0strlen($this->strSiteBaseUrl)) !=
                    
$this->strSiteBaseUrl) continue;

            
/* If web page $strUrl is now scanned for links, adding
                it to the list of not yet processed url's. */
            
if (isset($this->aProcessedUrls[$strUrl]) == false)
                
$this->aUrlsToProcess[$strUrl] = true;
        }
    }


    
/* Now, creating a function which will repeatly call
        RetrieveLinks until the list of url's to be processed is empty. */
    
function Start()
    {
        do
        {
            
// getting first URL from the list of url's to be processed
            
reset($this->aUrlsToProcess);
            
$strUrl key($this->aUrlsToProcess);
            
// removing that URL from the list of url's to be processed
            
unset($this->aUrlsToProcess[$strUrl]);
            
// adding that URL to the list of already processed url's
            
$this->aProcessedUrls[$strUrl] = true;

            
/* Here using CDWHttpFile class to retreive the web page with url $strUrl.
                You can see CDWHttpFile source code in the article
                Retreiving web page contents handling HTTP redirects.*/
            
$httpFile = new CDWHttpFile($strUrl);
            if (
$httpFile->bResult == true// if the web page is retrieved
            
{
                
/* In case if we got to another URL because of HTTP redirect,
                adding new url to the list of processed URL's, and removing it
                (if it exists there) from the list of URL's to be processed. */
                
$strUrl $httpFile->strLocation;
                
$this->aProcessedUrls[$strUrl] = true;
                unset(
$this->aUrlsToProcess[$strUrl]);
                
// Finally, retreiving links
                
$this->RetrieveLinks($httpFile->strFile$httpFile->strLocation);
            }
        
// Repeating untill the list of URL's to be processed is empty.
        
} while (count($this->aUrlsToProcess) != 0);
    }
    
    
/* Finishing up, writing a function which will start the whole process. */
    
function Process($strBaseUrl$strEntryUrl// starting from $strUrl
    
{
        
$this->strSiteBaseUrl $strBaseUrl;
        
// Adding entry point to the list of URL's to be processed.
        
$this->aUrlsToProcess[$strUrl] = true;
        
$this->Start(); // Starting the link retrieval process.
    
}
};


  • 4. Usage.

Now all we have to do is to execute the following code:

    $linkScanner 
= new CLinkScanner();
    
$linkScanner->Process(
        
"http://www.digiways.com",
        
"http://www.digiways.com/articles/");
    foreach(
$linkScanner->aProcessedUrls as $strUrl => $bTrue)
        echo 
"$strUrl<br>";

which will retrieve all the links from the web site http://www.digiways.com starting from http://www.digiways.com/articles/ and then will print them.

  • 5. Notes.

There's a number of things to take care of before using this code.
    • One can easily create a web site with virtually infinite number of webpages and with infinite number of different working links using Apache module mod_rewrite. (For instance, this web site doesn't have all the directories you see in the URL. All the pages are generated by the single index.php script.) So, in function CLinkScanner::Start() in the do/while loop we have to replace the while condition with

          
      while (count($this->aUrlsToProcess) != &&
                      (
      count($this->aUrlsToProcess) +
                          
      count($this->aProcessedUrls) <
                              
      SOME_CONSTANT));

    • This code doesn't scan image tags. To get the list of all images used in the web site we have to add to the class CLinkScanner member variable var $aImagesand the function CLinkScanner::RetreiveImages and call if from the function CLinkScanner::Start just before or right after the function CLinkScanner::RetrieveLinks .

      function RetreiveImages($strPageText$strBaseUrl)
      {
          
      preg_match_all(
              
      "/<IMG[ \r\n\t]{1}[^>]*SRC[^=]*=[ '\"\n\r\t]*([^ \"'>\r\n\t#]+)[^>]*>/si",
              
      $strPageText,
              
      $aImgs);
          foreach(
      $aImgs[1] as $strImg)
          {
              
      trim($strImg);
              
      // if $strImg is not in the canonical form, adding current web page url
              
      if (substr($strImg07) != "http://")
              {
                  if (
      $strBaseUrl[strlen($strBaseUrl)-1] != '/' && $strImg[0] != '/')
                      
      $strImg $strBaseUrl.'/'.$strImg;
                  else
                      
      $strImg $strBaseUrl.$strImg;
              }
              
      /* If $strImg points outside of web site, skip it. */
              
      if (strlen($strImg) < strlen($this->strSiteBaseUrl) ||
                  
      substr($strImg0strlen($this->strSiteBaseUrl)) !=
                      
      $this->strSiteBaseUrl) continue;

              
      /* Adding $strImg to the list of images. */
              
      $this->aImages[$strImg] = true;
          }
      }

    • Running this script on anything bigger than a site with a few pages may require reasonable ammounts of time, so, probably you will have to put set_time_limit(SOME_CONSTANT); somewhere inside the loop in CLinkScanner::Start.

    • Here's the regular expression to retrieve links and their associated text from between <a ...> and </a> tags.

      preg_match_all
      (
          
      "/<A[ \r\n\t]{1}[^>]*HREF[^=]*=[ '\"\n\r\t]*([^ \"'>\r\n\t#]+)[ \"'>\r\n\t#>][^>]*>(.*)<\/a[ \r\n\t]*>/isU",
          
      $strPageText,
          
      $aUrls);

      Now $aUrls[1] is the array of all the links and $aUrls[2]is the arral of link names. You can output all that using
          
      for ($i 0$i count($aUrls[1]); $i++)
              echo 
      "URL: ".$aUrls[1][$i]."  Name: ".
                  
      htmlspecialchars($aUrls[2][$i])."<br>";




Copyright © Val Samko, DigiWays. Written by Valentin Samko mailto:val@digiways.com