HomeWeb Technology Made Really Easy HTTP Made Really Easy
DonateTable of ContentsFootnotesDecember 10, 2012– Updated the links about robots.
HTTP is the network protocol of the Web. It is both simple and powerful. Knowing HTTP enables you to write Web browsers, Web servers, automatic page downloaders, link-checkers, and other useful tools.
This tutorial explains the simple, English-based structure of HTTP communication, and teaches you the practical details of writing HTTP clients and servers. It assumes you know basic socket programming. HTTP is simple enough for a beginning sockets programmer, so this page might be a good followup to asockets tutorial. ThisSockets FAQfocuses on C, but the underlying concepts are language-independent.
Since youre reading this, you probably already use CGI. If not, it makes sense tolearn that first.
The whole tutorial is about 15 printed pages long, including examples. The first half explains basic HTTP1.0, and the second half explains the new requirements and features of HTTP1.1. This tutorial doesnt cover everything about HTTP; it explains the basic framework, how to comply with the requirements, and where to find out more when you need it. If you plan to use HTTP extensively, you should readthe specificationas well– see the end of this document for more details.
Before getting started, understand the following two paragraphs:
Writing HTTP or other network programs requires more care than programming for a single machine.Of course, you have to follow standards, or no one will understand you. But even more important is the burden you place on other machines. Write a bad program for your own machine, and you waste your own resources (CPU time, bandwidth, memory). Write a bad network program, and you waste other peoples resources. Write areallybad network program, and you waste many thousands of peoples resources at the same time. Sloppy and malicious network programming forces network standards to be modified, made safer but less efficient. So be careful, respectful, and cooperative, for everyones sake.
In particular, dont be tempted to write programs that automatically follow Web links(calledrobotsorspiders) before you really know what youre doing. They can be useful, but a badly-written robot is one of the worst kinds of programs on the Web, blindly following a rapidly increasing number of links and quickly draining server resources. If you plan to write anything like a robot, pleaseread more about them. There may already be aworking programto do what you want. If you really need to write your own, please support therobots.txtde-facto standard.
Top of PageUsing HTTP1.0What is HTTP?What are Resources?Structure of HTTP TransactionsInitial Request LineInitial Response Line (Status Line)Header LinesThe Message BodySample HTTP ExchangeOther HTTP Methods, Like HEAD and POSTThe HEAD MethodThe POST MethodHTTP ProxiesBeing Tolerant of OthersConclusionUpgrading to HTTP1.1HTTP1.1HTTP1.1 ClientsHost: HeaderChunked Transfer-EncodingPersistent Connections and the Connection:close HeaderThe 100Continue ResponseHTTP1.1 ServersRequiring the Host: HeaderAccepting Absolute URLsChunked Transfer-EncodingPersistent Connections and the Connection:close HeaderUsing the 100Continue ResponseThe Date: HeaderHandling Requests with If-Modified-Since: or If-Unmodified-Since: HeadersSupporting the GET and HEAD methodsSupporting HTTP1.0 RequestsAppendixThe HTTP Specification
Several related topics are discussed on afootnotes page:Sample HTTP ClientUsing GET to Submit Query or Form DataURL-encodingManually Experimenting with HTTP
HTTP stands forHypertext Transfer Protocol. Its the network protocol used to deliver virtually all files and other data (collectively calledresources) on the World Wide Web, whether theyre HTML files, image files, query results, or anything else. Usually, HTTP takes place through TCP/IP sockets (and this tutorial ignores other possibilities).
A browser is anHTTP clientbecause it sends requests to anHTTP server(Web server), which then sends responses back to the client. The standard (and default) port for HTTP servers to listen on is 80, though they can use any port.What are Resources?
HTTP is used to transmitresources, not just files. A resource is some chunk of information that can be identified by a URL (its theRinURL). The most common kind of resource is a file, but a resource may also be a dynamically-generated query result, the output of a CGI script, a document that is available in several languages, or something else.
While learning HTTP, it may help to think of a resource as similar to a file, but more general. As a practical matter, almost all HTTP resources are currently either files or server-side script output.
Like most network protocols, HTTP uses the client-server model: AnHTTP clientopens a connection and sends arequest messageto anHTTP server; the server then returns aresponse message, usually containing the resource that was requested. After delivering the response, the server closes the connection (making HTTP astatelessprotocol, i.e. not maintaining any connection information between transactions).
The format of the request and response messages are similar, and English-oriented. Both kinds of messages consist of:an initial line,zero or more header lines,a blank line (i.e. a CRLF by itself), andan optional message body (e.g. a file, or query data, or query output).
Put another way, the format of an HTTP message is:initial line, different for request vs. response Header1: value1 Header2: value2 Header3: value3 optional message body goes here, like file contents or query data; it can be many lines long, or even binary data $&*%@!^$@
Initial lines and headers should end in CRLF, though you should gracefully handle lines ending in just LF. (More exactly, CR and LF here mean ASCII values 13 and 10, even though some platforms may use different characters.)
Return to Table of ContentsInitial Request Line
The initial line is different for the request than for the response. A request line has three parts, separated by spaces: amethodname, the local path of the requested resource, and the version of HTTP being used. A typical request line is:GET /path/to/file/index.html HTTP/1.0
Notes:GETis the most common HTTP method; it says give me this resource. Other methods includePOSTandHEAD– more on thoselater. Method names are always uppercase.The path is the part of the URL after the host name, also called therequest URI(a URI is like a URL, but more general).The HTTP version always takes the formHTTP/x.x, uppercase.
Return to Table of ContentsInitial Response Line (Status Line)
The initial response line, called thestatus line, also has three parts separated by spaces: the HTTP version, aresponse status codethat gives the result of the request, and an Englishreason phrasedescribing the status code. Typical status lines are:HTTP/1.0 200 OK
Notes:The HTTP version is in the same format as in the request line,HTTP/x.x.The status code is meant to be computer-readable; the reason phrase is meant to be human-readable, and may vary.The status code is a three-digit integer, and the first digit identifies the general category of response:1xxindicates an informational message only2xxindicates success of some kind3xxredirects the client to another URL4xxindicates an error on the clients part5xxindicates an error on the servers partThe most common status codes are:200 OKThe request succeeded, and the resulting resource (e.g. file or script output) is returned in the message body.404 Not FoundThe requested resource doesnt exist.301 Moved Permanently
303 See Other(HTTP1.1 only)The resource has moved to another URL (given by theLocation:response header), and should be automatically retrieved by the client. This is often used by a CGI script to redirect the browser to an existing file.500 Server ErrorAn unexpected server error. The most common cause is a server-side script that has bad syntax, fails, or otherwise cant run correctly.
A complete list of status codes is inthe HTTP specification(section 9 for HTTP1.0, and section 10 for HTTP1.1).
Header lines provide information about the request or response, or about the object sent in the message body.
The header lines are in the usual text header format, which is: one line per header, of the formHeader-Name:value, ending with CRLF. Its the same format used for email and news postings, defined inRFC822, section 3. Details about RFC822 header lines:As noted above, they should end in CRLF, but you should handle LF correctly.The header name is not case-sensitive (though the value may be).Any number of spaces or tabs may be between the : and the value.Header lines beginning with space or tab are actually part of the previous header line, folded into multiple lines for easy reading.
Thus, the following two headers are equivalent:Header1: some-long-value-1a, some-long-value-1bHEADER1: some-long-value-1a, some-long-value-1b
HTTP1.0 defines 16 headers, though none are required. HTTP1.1 defines 46 headers, and one (Host:) is required in requests. For Net-politeness, consider including these headers in your requests:TheFrom:header gives the email address of whoevers making the request, or running the program doing so. (Thismustbe user-configurable, for privacy concerns.)TheUser-Agent:header identifies the program thats making the request, in the formProgram-name/x.xx, wherex.xxis the (mostly) alphanumeric version of the program. For example, Netscape 3.0 sends the headerUser-agent:Mozilla/3.0Gold.
These headers help webmasters troubleshoot problems. They also reveal information about the user. When you decide which headers to include, you must balance the webmasters logging needs against your users needs for privacy.
If youre writing servers, consider including these headers in your responses:TheServer:header is analogous to theUser-Agent:header: it identifies the server software in the formProgram-name/x.xx. For example, one beta version ofApachesserver returnsServer:Apache/1.2b3-dev.TheLast-Modified:header gives the modification date of the resource thats being returned. Its used in caching and other bandwidth-saving activities. Use Greenwich Mean Time, in the formatLast-Modified: Fri, 31 Dec 1999 23:59:59 GMT
An HTTP message may have a body of data sent after the header lines. In a response, this is where the requested resource is returned to the client (the most common use of the message body), or perhaps explanatory text if theres an error. In a request, this is where user-entered data or uploaded files are sent to the server.
If an HTTP message includes a body, there are usually header lines in the message that describe the body. In particular,TheContent-Type:header gives the MIME-type of the data in the body, such astext/htmlorimage/gif.TheContent-Length:header gives the number of bytes in the body.
first open a socket to the host, port 80 (use the default port of 80 because none is specified in the URL). Then, send something like the following through the socket:GET /path/file.html HTTP/1.0 From: User-Agent: HTTPTool/1.0 [blank line here]
The server should respond with something like the following, sent back through the same socket:HTTP/1.0 200 OK Date: Fri, 31 Dec 1999 23:59:59 GMT Content-Type: text/html Content-Length: 1354 html body h1Happy New Millennium!/h1 (more file contents) . . . /body /htmlAfter sending the response, the server closes the socket.
To familiarize yourself with requests and responses,manually experimentwith HTTP using telnet.
Besides GET, the two most commonly used methods are HEAD and POST.The HEAD Method
A HEAD request is just like a GET request, except it asks the server to return the response headers only, and not the actual resource (i.e. no message body). This is useful to check characteristics of a resource without actually downloading it, thus saving bandwidth. Use HEAD when you dont actually need a files contents.
The response to a HEAD request mustnevercontain a message body, just the status line and headers.
A POST request is used to send data to the server to be processed in some way, like by a CGI script. A POST request is different from a GET request in the following ways:Theres a block of data sent with the request, in the message body. There are usually extra headers to describe this message body, likeContent-Type:andContent-Length:.Therequest URIis not a resource to retrieve; its usually a program to handle the data youre sending.The HTTP response is normally program output, not a static file.
The most common use of POST, by far, is to submit HTML form data to CGI scripts. In this case, theContent-Type:header is usuallyapplication/x-, and theContent-Length:header gives the length of the URL-encoded form data (heres anote on URL-encoding). The CGI script receives the message body through STDIN, and decodes it. Heres a typical form submission, using POST:POST /path/script.cgi HTTP/1.0 From: User-Agent: HTTPTool/1.0 Content-Type: application/x-www-form-urlencoded Content-Length: 32 home=Cosby&favorite+flavor=flies
You can use a POST request to send whatever data you want, not just form submissions. Just make sure the sender and the receiving program agree on the format.
The GET method can also be used to submit forms. The form data isURL-encodedand appended to the request URI. Here aremore details.
If youre writing HTTP servers that support CGI scripts, you should read theNCSAs CGI definitionif you havent already, especially whichenvironment variablesyou need to pass to the scripts.
AnHTTP proxyis a program that acts as an intermediary between a client and a server. It receives requests from clients, and forwards those requests to the intended servers. The responses pass back through it in the same way. Thus, a proxy has functions of both a client and a server.
Proxies are commonly used in firewalls, for LAN-wide caches, or in other situations. If youre writing proxies, read theHTTP specification; it contains details about proxies not covered in this tutorial.
When a client uses a proxy, it typically sends all requests to that proxy, instead of to the servers in the URLs. Requests to a proxy differ from normal requests in one way: in the first line, they use the complete URL of the resource being requested, instead of just the path. For example,GET HTTP/1.0
That way, the proxy knows which server to forward the request to (though the proxy itself may use another proxy).
As the saying goes (in network programming, anyway), Be strict in what you send and tolerant in what you receive. Other clients and servers you interact with may have minor flaws in their messages, but you should try to work gracefully with them. In particular, theHTTP specificationsuggests the following:Even though header lines should end with CRLF, someone might use a single LF instead. Accept either CRLF or LF.The three fields in the initial message line should be separated by a single space, but might instead use several spaces, or tabs. Accept any number of spaces or tabs between these fields.
The specification has other suggestions too, like how to handle varying date formats. If your program interprets dates from other programs, read the Tolerant Applications section of the specification.
Thats the basic structure of HTTP. If you understand everything so far, you have a good overview of HTTP communication, and should be able to write simple HTTP1.0 programs. See thisexampleto get started. Again, before you do anything heavy-duty, readthe specification.
The rest of this document tells how to upgrade your clients and servers to use HTTP1.1. There is a list of new client requirements, and a list of new server requirements. You can stop here if HTTP1.0 satisfies your current needs (though youll probably need HTTP1.1 in the future).
Note: As of early 1997, the Web is moving from HTTP1.0 to HTTP1.1. Whenever practical, use HTTP1.1. Its more efficient overall, and by using it, youll help the Web perform better for everyone.
Like many protocols, HTTP is constantly evolving. HTTP1.1 has recently been defined, to address new needs and overcome shortcomings of HTTP1.0. Generally speaking, it is a superset of HTTP1.0. Improvements include:Faster response, by allowing multiple transactions to take place over a singlepersistent connection.Faster response and great bandwidth savings, by adding cache support.Faster response for dynamically-generated pages, by supportingchunked encoding, which allows a response to be sent before its total length is known.Efficient use of IP addresses, by allowing multiple domains to be served from a single IP address.
HTTP1.1 requires a few extra things from both clients and servers. The next two sections detail how to makeclientsandserverscomply with HTTP1.1. If youre only writing clients, you can skip the section on servers. If youre writing servers, read both sections.
Onlyrequirementsfor HTTP1.1 compliance are described here. HTTP1.1 has many optional features you may find useful; readthe specificationto learn more.
To comply with HTTP1.1, clients mustinclude theHost:header with each requestaccept responses withchunkeddataeither supportpersistent connections, or include theConnection:closeheader with each requesthandle the100Continueresponse
Starting with HTTP1.1, one server at one IP address can bemulti-homed, i.e. the home of several Web domains. For example, and can live on the same server.
Several domains living on the same server is like several people sharing one phone: a caller knows who theyre calling for, but whoever answers the phone doesnt. Thus, every HTTP request must specify which host name (and possibly port) the request is intended for, with theHost:header. A complete HTTP1.1 request might beGET /path/file.html HTTP/1.1 Host: [blank line here]except the:80isnt required, since thats the default HTTP port.
Host:is the only required header in an HTTP1.1 request.Its also the most urgently needed new feature in HTTP1.1.Without it, each host name requires a unique IP address, and were quickly running out of IP addresses with the explosion of new domains.
Return to Table of ContentsChunked Transfer-Encoding
If a server wants to start sending a response before knowing its total length (like with long script output), it might use the simplechunked transfer-encoding, which breaks the complete response into smaller chunks and sends them in series. You can identify such a response because it contains theTransfer-Encoding:chunkedheader. All HTTP1.1 clients must be able to receive chunked messages.
A chunked message body contains a series ofchunks, followed by a line with 0 (zero), followed by optional footers (just like headers), and a blank line. Each chunk consists of two parts:a line with the size of the chunk data, in hex, possibly followed by a semicolon and extra parameters you can ignore (none are currently standard), and ending with CRLF.the data itself, followed by CRLF.
So a chunked response might look like the following:HTTP/1.1 200 OK Date: Fri, 31 Dec 1999 23:59:59 GMT Content-Type: text/plain Transfer-Encoding: chunked 1a; ignore-stuff-here abcdefghijklmnopqrstuvwxyz 10 1234567890abcdef 0 some-footer: some-value another-footer: another-value [blank line here]
Note the blank line after the last footer. The length of the text data is 42 bytes (1a+10, in hex), and the data itself isabcdefghijklmnopqrstuvwxyz1234567890abcdef. The footers should be treated like headers, as if they were at the top of the response.
The chunks can contain any binary data, and may be much larger than the examples here. The size-line parameters are rarely used, but you should at least ignore them correctly. Footers are also rare, but might be appropriate for things like checksums or digital signatures.
For comparison, heres the equivalent to the above response, without using chunked encoding:HTTP/1.1 200 OK Date: Fri, 31 Dec 1999 23:59:59 GMT Content-Type: text/plain Content-Length: 42 some-footer: some-value another-footer: another-value abcdefghijklmnopqrstuvwxyz1234567890abcdef
Return to Table of ContentsPersistent Connections and the Connection:close Header
In HTTP1.0 and before, TCP connections are closed after each request and response, so each resource to be retrieved requires its own connection. Opening and closing TCP connections takes a substantial amount of CPU time, bandwidth, and memory. In practice, most Web pages consist of several files on the same server, so much can be saved by allowing several requests and responses to be sent through a singlepersistent connection.
Persistent connections are the default in HTTP1.1, so nothing special is required to use them. Just open a connection and send several requests in series (calledpipelining), and read the responses in the same order as the requests were sent. If you do this, be very careful to read the correct length of each response, to separate them correctly.
If a client includes theConnection:closeheader in the request, then the connection will be closed after the corresponding response.Use this if you dont support persistent connections, or if you know a request will be the last on its connection. Similarly, if a response contains this header, then the server will close the connection following that response, and the client shouldnt send any more requests through that connection.
A server might close the connection before all responses are sent, so a client must keep track of requests and resend them as needed. When resending, dont pipeline the requests until you know the connection is persistent. Dont pipeline at all if you know the server wont support persistent connections (like if it uses HTTP1.0, based on a previous response).
Return to Table of ContentsThe 100Continue Response
During the course of an HTTP1.1 client sending a request to a server, the server might respond with an interim100Continueresponse. This means the server has received the first part of the request, and can be used to aid communication over slow links. In any case, all HTTP1.1 clients must handle the 100 response correctly (perhaps by just ignoring it).
The100Continueresponse is structured like any HTTP response, i.e. consists of a status line, optional headers, and a blank line. Unlike other responses, it is always followed by another complete, final response.
So, further extending the last example, the full data that comes back from the server might consist of two responses in series, likeHTTP/1.1 100 Continue HTTP/1.1 200 OK Date: Fri, 31 Dec 1999 23:59:59 GMT Content-Type: text/plain Content-Length: 42 some-footer: some-value another-footer: another-value abcdefghijklmnoprstuvwxyz1234567890abcdef
To handle this, a simple HTTP1.1 client might read one response from the socket; if the status code is 100, discard the first response and read the next one instead.
To comply with HTTP1.1, servers must:require theHost:header from HTTP1.1 clientsaccept absolute URLs in a requestaccept requests withchunkeddataeither supportpersistent connections, or include theConnection:closeheader with each responseuse the100Continueresponse appropriatelyinclude theDate:header in each responsehandle requests withIf-Modified-Since:orIf-Unmodified-Since:headerssupport at least the GET and HEAD methodssupport HTTP1.0 requests
Return to Table of ContentsRequiring the Host: Header
Because of the urgency of implementing the newHost:header, servers are not allowed to tolerate HTTP1.1 requests without it. If a server receives such a request, it must return a400BadRequestresponse, likeHTTP/1.1 400 Bad Request Content-Type: text/html Content-Length: 111 htmlbody h2No Host: header received/h2 HTTP1.1 requests must include the Host: header. /body/html
This requirement appliesonlyto clients using HTTP1.1, not any future version of HTTP. If the request uses an HTTP version later than 1.1, the server can accept an absolute URL instead of aHost:header (see next section). If the request uses HTTP1.0, the server may accept the request without any host identification.
Return to Table of ContentsAccepting Absolute URLs
TheHost:header is actually an interim solution to the problem of host identification. In future versions of HTTP, requests will use an absolute URL instead of a pathname, likeGET HTTP/1.2
To enable this protocol transition, HTTP1.1 servers must accept this form of request, even though HTTP1.1 clients wont send them. The server must still report an error if an HTTP1.1 client leaves out theHost:header, as described in theprevious section.
Return to Table of ContentsChunked Transfer-Encoding
Just as HTTP1.1 clients must accept chunked responses, servers must accept chunked requests (an unlikely scenario, but possible). See the earlier section onHTTP1.1 Clientsfor details of the chunked data format.
Servers arent required to generate chunked messages; they just have to be able to receive them.
Return to Table of ContentsPersistent Connections and the Connection:close Header
If an HTTP1.1 client sends multiple requests through a single connection, the server should send responses back in the same order as the requests– this is all it takes for a server to support persistent connections.
If a request includes theConnection:closeheader, that request is the final one for the connection and the server should close the connection after sending the response. Also, the server should close an idle connection after some timeout period (can be anything; 10 seconds is fine).
If you dont want to support persistent connections, include theConnection:closeheader in the response. Use this header whenever you want to close the connection, even if not all requests have been fulfilled. The header says that the connection will be closed after the current response, and a valid HTTP1.1 client will handle it correctly.
Return to Table of ContentsUsing the 100Continue Response
As described in the section onHTTP1.1 Clients, this response exists to help deal with slow links.
When an HTTP1.1 server receives the first line of an HTTP1.1 (or later) request, it must respond with either100Continueor an error. If it sends the100Continueresponse, it must also send another, final response, once the request has been processed. The100Continueresponse requires no headers, but must be followed by the usual blank line, like:HTTP/1.1 100 Continue [blank line here] [another HTTP response will go here]
Dont send100Continueto HTTP1.0 clients, since they dont know how to handle it.
Caching is an important improvement in HTTP1.1, and cant work without timestamped responses. So, servers must timestamp every response with aDate:header containing the current time, in the formDate: Fri, 31 Dec 1999 23:59:59 GMT
All responses except those with 100-level status (but including error responses) must include theDate:header.
All time values in HTTP use Greenwich Mean Time.
Return to Table of ContentsHandling Requests with If-Modified-Since: or If-Unmodified-Since: Headers
To avoid sending resources that dont need to be sent, thus saving bandwidth, HTTP1.1 defines theIf-Modified-Since:andIf-Unmodified-Since:request headers. The former says only send the resource if it has changed since this date; the latter says the opposite. Clients arent required to use them, but HTTP1.1 servers are required to honor requests that do use them.
Unfortunately, due to earlier HTTP versions, the date value may be in any of three possible formats:If-Modified-Since: Fri, 31 Dec 1999 23:59:59 GMT If-Modified-Since: Friday, 31-Dec-99 23:59:59 GMT If-Modified-Since: Fri Dec 31 23:59:59 1999
Again, all time values in HTTP use Greenwich Mean Time (though try to be tolerant of non-GMT times). If a date with a two-digit year seems to be more than 50 years in the future, treat it as being in the past– this helps with the millennium bug. In fact, do this with any date handling in HTTP1.1.
Although servers must accept all three date formats, HTTP1.1 clients and servers must only generate the first kind.
If the date in either of these headers is invalid, or is in the future, ignore the header.
If, without the header, the request would result in an unsuccessful (non-200-level) status code, ignore the header and send the non-200-level response. In other words, only apply these headers when you know the resource would otherwise be sent.
TheIf-Modified-Since:header is used with a GET request. If the requested resource has been modified since the given date, ignore the header and return the resource as you normally would. Otherwise, return a304NotModifiedresponse, including theDate:header and no message body, likeHTTP/1.1 304 Not Modified Date: Fri, 31 Dec 1999 23:59:59 GMT [blank line here]
TheIf-Unmodified-Since:header is similar, but can be used with any method. If the requested resource hasnotbeen modified since the given date, ignore the header and return the resource as you normally would. Otherwise, return a412PreconditionFailedresponse, likeHTTP/1.1 412 Precondition Failed [blank line here]
Return to Table of ContentsSupporting the GET and HEAD methods
To comply with HTTP1.1, a server must support at least the GET and HEAD methods. If youre handling CGI scripts, you should probably support the POST method too.
Four other methods (PUT, DELETE, OPTIONS, and TRACE) are defined in HTTP1.1, but are rarely used. If a client requests a method you do