> Articles > PHP articles > Retreiving web page contents with HTTP header handling redirects
Retreiving web page contents with HTTP header handling redirects 1. When do we need it and what for.
When retreiving web page contents, server sometimes may redirect us to another
URL. The most common case is opening http://www.somethingunexistant.com/somedirectory
without a slash in the end. When we request web page using that URL, server redirects
us to the "proper" URL, which in this case is http://www.somethingunexistant.com/somedirectory/ .
There're several ways to get web page contents using PHP. One way is to use
file(URL); function
to read the entire web page in an array of lines, or to use
$fd = fopen ($url, "rb"); while (!feof ($fd)) $buffer .= fgets($fd, 4096); fclose ($fd);
construction.
Both lead to the same two problems.
- Versions prior to PHP 4.0.5 do not handle HTTP redirects. Because of this, directories must include trailing slashes.
If directory doesn't include trailing slash, you won't get anything at all.
- Both methods don't return you HTTP headers set by the server.
Another way to retreive web page is directly connecting to the specified (or default) port of web server
using socket functions, sending HTTP request and receiving all the data including HTTP headers and
web page itself. That way we can handle HTTP redirects ourselves.
class CDWHttpFile { /* $strLocation - URL of the last web page retreived (could be different from what was requiested in case of HTTP redirect.) */ var $strLocation; var $aHeaderLines; // headers of last web page var $strFile; // last web page retreived /* $bResult - contains true if last web page was retrieved successfully, false otherwise. */ var $bResult;
/* ReadHttpFile - the function that does all the work. $strUrl - URL of the page we want to get. $iHttpRedirectMaxRecursiveCalls - maximum number of times following HTTP redirection. */ function ReadHttpFile($strUrl, $iHttpRedirectMaxRecursiveCalls = 20) { // parsing the url getting web server name/IP, path and port. $url = parse_url($strUrl); // setting path to "/" if not present in $strUrl if (isset($url["path"]) == false) $url["path"] = "/"; // setting port to default HTTP server port 80 if (isset($url["port"]) == false) $url["port"] = 80; // connecting to the server $fp = fsockopen ($url["host"], $url["port"], $errno, $errstr, 30);
// reseting class data $this->bResult = false; unset($this->strFile); unset($this->aHeaderLines); $this->strLocation = $strUrl;
/* Return if the socket was not open $this->bResult is set to false. */ if (!$fp) return; else { // composing HTTP request $strQuery = "GET ".$url["path"]; if (isset($url["query"]) == true) $strQuery .= "?".$url["query"]; $strQuery .= " HTTP/1.0\r\n\r\n"; // sending the request to the server fputs($fp, $strQuery); /* $bHeader is set to true while we receive the HTTP header and after the empty line (end of HTTP header) it's set to false. */ $bHeader = true; // continuing untill there's no more text to read from the socket while (!feof($fp)) { /* reading a line of text from the socket not more than 8192 symbols. */ $strLine = fgets($fp, 8192); // removing trailing \n and \r characters. $strLine = ereg_replace("[\r\n]", "", $strLine); if ($bHeader == false) $this->strFile .= $strLine."\n"; else $this->aHeaderLines[] = trim($strLine); if (strlen($strLine) == 0) $bHeader = false; } fclose ($fp); }
/* Processing all HTTP header lines and checking for HTTP redirect directive 'Location:'. */ for ($i = 0; $i < count($this->aHeaderLines); $i++) if (strcasecmp(substr($this->aHeaderLines[$i], 0, 9), "Location:") == 0) { $url = trim(substr($this->aHeaderLines[$i], 9)); // $url now is the URL of the web page we are relocated to // If $url is the same page we are requesting, just continue if ($url != $strUrl) { /* If the maximum number of redirects is reached, just return. $this->bResult is set to false. */ if ($iHttpRedirectMaxRecursiveCalls == 0) return; /* Calling the function recursively with the new URL and the maximum number of redirections reduced by one. */ return $this->ReadHttpFile( $url, $iHttpRedirectMaxRecursiveCalls-1); } }
/* We should get here if there was no HTTP redirect directive found. Setting $this->bResult to true. Web page was retreived successfully. */ $this->bResult = true; /* If magic_quotes_runtime is enabled in php.ini, then all the quotes in the received text will be prefixed with slashes. */ if (ini_get("magic_quotes_runtime")) { $this->strFile = stripslashes($this->strFile); for ($i = 0; $i < count($this->aHeaderLines); $i++) $this->aHeaderLines[$i] = stripslashes($this->aHeaderLines[$i]); } }
/* Just to make it easier to use this class, adding contructor which accepts URL as a parameter and calls ReadHttpFile functions. */ function CDWHttpFile($strUrl = "") { if (strlen($strUrl) > 0) $this->ReadHttpFile($strUrl); } };
$httpFile = new CDWHttpFile("http://www.digiways.com/arcicles/"); if ($httpFile->bResult == true) { echo "URL: $httpFile->strLocation <br>"; foreach($httpFile->aHeaderLines as $strHeaderLine) echo "Header line: ".htmlspecialchars($strHeaderLine)."<br>"; echo "Contents: <hr>".htmlspecialchars($httpFile->strFile)."<hr>"; }
- This code won't work properly if web page containes lines of text longer than
8192 symbols. To fix that, either we have to increase that number, or to use
freadfunction instead of
fgets, but in that case we will have to split retrived text by lines ourselves.
- As we compose HTTP request we don't add login/password part. If you intend to work with password
protected pages, just modify that part.
- If you are not using PHP version prior to 4.0.5 and if you don't need HTTP headers, don't use that code.
Always remember, one PHP function works much faster than a set of different PHP functions you have
to call to get the same result.
- This code is a good start if you want to write web page retreiver which also submits some data to the server
using POST method.
Copyright © Val Samko, DigiWays. Written by Valentin Samko mailto:val@digiways.com
|