Reverse Proxying With Apache 2.0
December 18, 2003
The
previous Apache-focused tutorial published on ServerWatch discussed the
benefits of a proxy server for the network, and how it can speed up
access, reduce bandwidth requirements, and perform basic information
filtering tasks. This type of proxy is a forward proxy -- it forwards
requests from a network to the Internet.
However,
if the proxy model is flipped on its head, a different type of proxy
server is created -- a reverse proxy. In this instance, instead of
requests from a client being forwarded (and optionally cached) through
the proxy to the Internet, requests are forwarded (and cached) to one
or more Web servers, as illustrated in Figure 1.
Figure 1

Interesting, you're thinking. But what is the benefit of this?
Reverse
proxies are useful for reasons similar to why forward proxies are
useful. The performance and security aspects are similar to those
provided by a forward proxy. The other, and less obvious, advantage is
that a reverse proxy provides a unified interface to Web servers.
Reverse Proxy Gateway Operation
One of
the problems with supporting a modern Web site is that as the site
grows, the level and quantity of information requested and returned
also increases. A number of solutions have been developed to resolve
this issue. The most obvious is to just build a bigger, more powerful
server by adding more CPUs, RAM, disk space, and network interfaces.
Ultimately, however, a physical or practical limit is reached that
makes it impossible to expand any further.
Other
solutions involve simple, or complex, load balancing techniques,
clustering tools, or manual (and generally complex) methods of
splitting up the site into different areas, and manually redirecting
users to different machines to handle the requests and load.
With a
reverse proxy, a single machine is inserted to act as a gateway to the
real servers in the network. Now, instead of multiple machines directly
handling the requests from clients, a single machine is responsible for
accepting and redirecting the requests to the real servers. This means
that a single domain continues to appear as a single machine, while
still having the flexibility of multiple machines working behind the
scenes to honor the actual requests.
The
unified interface is, in essence, the same as using a forward proxy for
Internet access. However, instead of being a single interface to the
Internet, it becomes a single interface into the Web server network.
Caching of Static Data
Another
problem with most Web sites, even those based on static content, is
that the information must be read off of the disk each time it is
supplied to a client. With a bit work within Apache we can use
mod_cache (and the mod_mem_cache module) to keep some documents in
memory.
A
reverse proxy can provide an in-memory cache on a single machine,
servicing the requests from clients for a number of different real
servers because the proxy server is caching only requests.
|
Caching Dynamic Data
As with
a forward proxy, we can cache content from the individual Web servers,
enabling the reverse proxy to appear as a single machine. Because the
information is cached, it can be returned significantly quicker than
from a typical static or dynamic solution. In a Web server design where
individual pages are generated from a large number of dynamic
components this can be a significant benefit.
Consider
a typical page, made up of 10 different dynamic elements. If 100
clients attempt to access those dynamic elements simultaneously, then
1,000 requests must be loaded. For some sites, these 'dynamic' elements
are nothing more than data extracted from a database. The actual basis
page data doesn't change much, but when used with a dynamic site, we
must still process that same database request each time the page is
accessed.
If a
reverse proxy is placed between the clients and the Web servers, we can
cache the basic content, reducing the load on the database that
provides the information and on the server that must execute the
application to load that data and convert it into a page.
The
reverse proxy could cache the entire content of the dynamic elements in
memory, or on disk, and return them to clients much quicker than the
dynamic process. You could also set the cache to be updated (through
the cache expiry system) to provide the latest versions of stories and
data for the site.
Security
Because
all requests give the appearance of coming in through a single server,
not to one of the many back-end servers that actually support the site,
reverse proxying enables us to provide a single point of
authentication. Users log in to the proxy server -- as the gateway to
the Web site -- and need never log in again, even though they may be
accessing other machines to obtain information. This can be done with
either Apache's own authentication systems (using a cookie or
database-based authentication) or SSL-based communication. Using a
reverse proxy, you need manage only one certificate on one machine.
The
same basic principles also apply when restricting access. If you were
supporting an intranet and wanted to support connectivity from specific
hosts, domains, or through a VPN connection, then you could open up the
connectivity through a reverse proxy without having to open up the main
servers.
If you
have a firewall, then you can use the proxy server on the public side,
or within your DMZ using secured (VPN) or restricted communication
links between the reverse proxy and the real servers behind the
firewall, as shown in Figure 2.
Figure 2

Basic Reverse Proxy Configuration
From a
client perspective, a reverse proxy looks just like a standard Web
server. It doesn't require any special configuration to operate (and if
it did, it wouldn't be anywhere near as useful).
The only real requirement is to ensure the forward proxy is switched off, which is done using the ProxyRequests directive:
But we
do need to configure the reverse proxy to tell it where it should be
redirecting or caching information for clients that request
information. The system redirects specific directories within the
hostname assigned to the proxy server to an alternative host. For
example, Figure 3 shows three back-end servers, and a front-end reverse
proxy identified as www.mcslp.com.
Figure 3

When a
user requests www.mcslp.com/marketing, the admin actually wants the
content on marketing.mcslp.com to be returned instead. For this he must
edit the Apache httpd.conf file on the reverse proxy, or the machine
being used as a front end to the Web site, and then set the ProxyPass
directive for the requested directory to point to the URL of the real
data. For example:
ProxyPass /marketing http://marketing.mcslp.com |
The
above line would cause the proxy server to supply the data from
marketing.mcslp.com when a request for an object within /marketing was
requested. For example, the content of the URL
www.mcslp.com/marketing/index.html would actually come from
marketing.mcslp.com/index.shtml.
ProxyPass
generates an internal proxy request from the remote directory and then
returns the information, just as a forward proxy does with a proxy
request from a client. This is not redirection -- the information is
loaded to the proxy server from the real host and sent back to the
client from the reverse proxy as if the data were from the proxy
server.
You can
also configure the same effect from within a Location directive by
simply omitting the directory (because Apache gets the directory
context from the Location directive):
ProxyPass http://market.mcslp.com |
The redirection for all three directories requires something like:
ProxyPass /marketing http://marketing.mcslp.com
ProxyPass /accounts http://finance.mcslp.com
ProxyPass /sales http://sales.mcslp.com |
The second argument is a URL, so it could point to a sub-directory on a remote machine, too (e.g., the directive).
However,
ProxyPass /contact http://sales.mcslp.com/contact |
would redirect requests from www.mcslp.com/contact to the same directory on the sales Web server.
You can
also stop subdirectories of a directory being passed through by using
an exclamation mark (!) as the destination URL. For example, to reverse
proxy /marketing, but not /marketing/contact you would use:
ProxyPass /marketing/contact !
ProxyPass /marketing http://marketing.mcslp.com |
Proper Reverse Proxy Configuration
The
only problem with the ProxyPass directive is that it's not "clean"
reverse proxying. Although the directive will correctly pass data
through to the remote host, the HTTP headers (some of which contain the
true location of the data) will remain unchanged. So, for example, when
accessing www.mcslp.com/marketing/index.html, the client browser will
be able to identify the true source of the data as
marketing.mcslp.com/index.html just by looking at the HTTP headers
returned.
The one
downside is that this can cause problems with relative links in pages
that would ultimately point to the true server, not the proxy server
we're trying to hide behind. Solving this problem requires an
additional directive, ProxyPassReverse. This forces the proxy module to
rewrite the HTTP header fields Location, Content-Location, and URI with
the address of the proxy server, not the true server.
A true reverse proxy configure requires two lines:
ProxyPass /marketing http://marketing.mcslp.com
ProxyPassReverse /marketing http://marketing.mcslp.com |
The first line triggers the proxy request for the real data; the second handles the rewriting.
Important
to note is that at no point does Apache rewrite the content of the
information it is sending back, which can cause a few problems.
Luckily, if you are already using a single server and replacing it with
multiple servers and a reverse proxy interface, you shouldn't have to
make changes on the site, as the references you are already using will
continue to be valid in the new setup.
|
|