Home
Web design
Software
Articles
Site Map

> Articles > PHP articles > Retreiving web page contents with HTTP header handling redirects

Retreiving web page contents with HTTP header handling redirects

  • 1. When do we need it and what for.

When retreiving web page contents, server sometimes may redirect us to another URL. The most common case is opening http://www.somethingunexistant.com/somedirectory without a slash in the end. When we request web page using that URL, server redirects us to the "proper" URL, which in this case is http://www.somethingunexistant.com/somedirectory/ .

There're several ways to get web page contents using PHP. One way is to use file(URL); function to read the entire web page in an array of lines, or to use

$fd 
fopen ($url"rb");
while (!
feof ($fd))
    
$buffer .= fgets($fd4096);
fclose ($fd);

construction.

Both lead to the same two problems.

  1. Versions prior to PHP 4.0.5 do not handle HTTP redirects. Because of this, directories must include trailing slashes. If directory doesn't include trailing slash, you won't get anything at all.
  2. Both methods don't return you HTTP headers set by the server.

  • 2. How do we do it.

Another way to retreive web page is directly connecting to the specified (or default) port of web server using socket functions, sending HTTP request and receiving all the data including HTTP headers and web page itself. That way we can handle HTTP redirects ourselves.

  • 3. Implementation.

class CDWHttpFile
{
    
/* $strLocation - URL of the last web page retreived (could be different
        from what was requiested in case of HTTP redirect.) */
    
var $strLocation;
    var 
$aHeaderLines// headers of last web page
    
var $strFile// last web page retreived
    /* $bResult - contains true if last web page was
        retrieved successfully, false otherwise. */
    
var $bResult;

    
/* ReadHttpFile - the function that does all the work.
        $strUrl - URL of the page we want to get.
        $iHttpRedirectMaxRecursiveCalls - maximum number of
        times following HTTP redirection. */        
    
function ReadHttpFile($strUrl$iHttpRedirectMaxRecursiveCalls 20)
    {
        
// parsing the url getting web server name/IP, path and port.
        
$url parse_url($strUrl);
        
// setting path to "/" if not present in $strUrl
        
if (isset($url["path"]) == false$url["path"] = "/";
        
// setting port to default HTTP server port 80
        
if (isset($url["port"]) == false$url["port"] = 80;
        
// connecting to the server
        
$fp fsockopen ($url["host"], $url["port"], $errno$errstr30);


        
// reseting class data        
        
$this->bResult false;
        unset(
$this->strFile);
        unset(
$this->aHeaderLines);
        
$this->strLocation $strUrl;

        
/* Return if the socket was not open $this->bResult is set to false. */
        
if (!$fp)
            return;
        else
        {
            
// composing HTTP request
            
$strQuery "GET ".$url["path"];
            if (isset(
$url["query"]) == true$strQuery .= "?".$url["query"];
            
$strQuery .= " HTTP/1.0\r\n\r\n";
            
// sending the request to the server
            
fputs($fp$strQuery);
            
/* $bHeader is set to true while we receive the HTTP header
            and after the empty line (end of HTTP header) it's set to false. */
            
$bHeader true;
            
// continuing untill there's no more text to read from the socket
            
while (!feof($fp))
            {
                
/* reading a line of text from the socket
                    not more than 8192 symbols. */
                
$strLine fgets($fp8192);
                
// removing trailing \n and \r characters.
                
$strLine ereg_replace("[\r\n]"""$strLine);
                if (
$bHeader == false
                    
$this->strFile .= $strLine."\n";
                else
                    
$this->aHeaderLines[] = trim($strLine);
                if (
strlen($strLine) == 0$bHeader false;
            }
            
fclose ($fp);
        }

        
/* Processing all HTTP header lines and checking for
            HTTP redirect directive 'Location:'. */
        
for ($i 0$i count($this->aHeaderLines); $i++)
            if (
strcasecmp(substr($this->aHeaderLines[$i], 09), "Location:") == 0)
            {
                
$url trim(substr($this->aHeaderLines[$i], 9));
                
// $url now is the URL of the web page we are relocated to
                // If $url is the same page we are requesting, just continue
                
if ($url != $strUrl)
                {
                    
/* If the maximum number of redirects is reached,
                        just return. $this->bResult is set to false. */
                    
if ($iHttpRedirectMaxRecursiveCalls == 0) return;
                    
/* Calling the function recursively with the new URL
                    and the maximum number of redirections reduced by one. */
                    
return $this->ReadHttpFile(
                                
$url,
                                
$iHttpRedirectMaxRecursiveCalls-1);
                }
            }

        
/* We should get here if there was no HTTP redirect directive found.
            Setting $this->bResult to true. Web page was retreived successfully. */
        
$this->bResult true;
        
        
/* If magic_quotes_runtime is enabled in php.ini, then all the quotes
            in the received text will be prefixed with slashes. */
        
if (ini_get("magic_quotes_runtime"))
        {
            
$this->strFile stripslashes($this->strFile);
            for (
$i 0$i count($this->aHeaderLines); $i++)
                
$this->aHeaderLines[$i] = stripslashes($this->aHeaderLines[$i]);
        }
    }

    
/* Just to make it easier to use this class, adding contructor
        which accepts URL as a parameter and calls ReadHttpFile functions. */
    
function CDWHttpFile($strUrl "")
    {
        if (
strlen($strUrl) > 0)
            
$this->ReadHttpFile($strUrl);
    }
};
  • 4. Usage.

    $httpFile = new CDWHttpFile("http://www.digiways.com/arcicles/");
    if (
    $httpFile->bResult == true)
    {
        echo 
    "URL: $httpFile->strLocation <br>";
        foreach(
    $httpFile->aHeaderLines as $strHeaderLine)
            echo 
    "Header line: ".htmlspecialchars($strHeaderLine)."<br>";
        echo 
    "Contents: <hr>".htmlspecialchars($httpFile->strFile)."<hr>";
    }
  • 5. Notes.

    • This code won't work properly if web page containes lines of text longer than 8192 symbols. To fix that, either we have to increase that number, or to use freadfunction instead of fgets, but in that case we will have to split retrived text by lines ourselves.
    • As we compose HTTP request we don't add login/password part. If you intend to work with password protected pages, just modify that part.
    • If you are not using PHP version prior to 4.0.5 and if you don't need HTTP headers, don't use that code. Always remember, one PHP function works much faster than a set of different PHP functions you have to call to get the same result.
    • This code is a good start if you want to write web page retreiver which also submits some data to the server using POST method.



Copyright © Val Samko, DigiWays. Written by Valentin Samko mailto:val@digiways.com