Wednesday, January 10, 2007

Scaling Dynamic Websites with Apache Modules

This article presents our experience on setting up a mod_perl-based server on a VPS. The article also touches upon the utility of mod_proxy in this context, and that of mod_deflate in general. Using the right technologies will not only keep hosting costs under control, but also reduce them when deployed correctly.

First, the disclaimer: this article is merely for informational purposes. The techniques and tools, while proven to work for us, may not work in your particular case. You are free to try them at your own risk. Furthermore, this article is likely to be useful to those who host and manage their own websites and rely on Perl for CGI, or those resellers who provide Perl CGI applications to their clients, potentially sharing them among the clients.

For the unfamiliar, mod_perl and mod_cgi are two ways to run Perl scripts/applications on a web server. mod_perl is much more efficient, as it keeps a Perl interpreter in memory. Depending on the setup, precompiled scripts and modules may also be resident in the memory. mod_cgi, on the other hand, invokes the Perl interpreter on every request, loading and compiling the scripts and modules for execution. Therefore, it is much slower. Benchmark tests have shown mod_perl to be anywhere between 10 to 100 or more times as fast as mod_cgi (results may vary based on your application; see Practical mod_perl (O'Reilly, 2003).

Our Problem and First Solution

We migrated our sites from shared hosting to a VPS a while ago. The shared host did not allow mod_perl, only mod_cgi. In spite of it being a very powerful hosting system (one of the largest providers of its kind), the CGI applications were becoming heavy for it, and we suffered from slow response. Additionally, there was no shell access, which severely restricted our ability to monitor/test the various parts of the system. The obvious choice was to get to a platform where we had better control and could use the power of mod_perl.

We opted for a CPanel-based system to create and manage the few domains we had. With the system initially running under mod_cgi, the 15-minute load factor (referred to as "load factor" from now on) reached values greater than 10 during peak time. About 10 percent of our Perl application gets used 90 percent of the time. To run everything with the mod_perl Registry handler, which caches precompiled scripts into memory, would have been an overkill. Instead, we opted to run the highly used 10 percent of the application with mod_perl Registry, and left the rest alone under mod_cgi.

The reason for this was two-fold: Apache processes that use mod_perl Registry can grow very quickly in size as you cache more pre-compiled scripts. Keeping 100 percent of the system resident when only 10 percent of it is used 90 percent of the time didn't make sense. Second, our programs were more than 80,000 lines of Perl code. We had written them for mod_cgi, and it would take a while to port and test all of them to run under mod_perl. (On a more technical note, PerlRun was a possible replacement for mod_cgi, but that didn't work in our case at that time, due to reasons that are beyond the scope of this article.)

After transferring 90 percent of our CGI requirements to mod_perl Registry, the server load came under control. From peak time values of 10-plus, the load factor dropped to below 1, even though we were still serving 10 percent with mod_cgi. It seemed like a very good solution.

However, like spring, it didn't last very long. As our overall traffic grew, both the parts of the application running under mod_perl and those running under mod_cgi, grew. We have reason to believe that the mod_cgi portion may have had a larger growth. The result was that the load factor started rising. Over a period of several months, the increase was sufficient to take it beyond 5 at our peak times. The situation again was getting difficult for the server.

A Better Solution

This time, we wanted a longer-lasting solution. Practical mod_perl has excellent treatment of deploying mod_perl, as well as server setup and administration strategies. We came to a conclusion that we need to run two instances of Apache httpd, one to serve plain objects (static HTML, images, CSS, JS, etc.) and one to handle the Perl application. We followed the book's nomenclature, calling the first httpd_docs and the second httpd_perl.

httpd_docs listens to port 80 like any other web server. We also configured it with mod_proxy to act as a transparent proxy between the world and httpd_perl. httpd_perl itself runs on port 8000 and is not directly accessible from the outside.

This setup offers many advantages. In the previous setup, all the Apache processes ran with mod_perl, which meant that the Perl interpreter and parts of our application were present in memory for each and every Apache process. That, in turn, meant that even a request as simple as that for a 300-byte image used a large Apache process, instead of a much smaller Apache process. Moreover, mod_proxy also buffers the output, which allows httpd_perl to let httpd_docs carry out the mundane task of transferring data to clients with slow connections, and itself be free and ready to serve another Perl request.

With our newly specialized setup, we opted for a VPS without any control panel. Another factor in this decision was that we have very few domains, and now we were comfortable managing them without any control panel. We think that it is possible to set this up using CPanel also.

On the new VPS, which already came with Apache 2.2.0 and Perl 5.8.0, we built Apache 2.2.2, customized to our needs, in the two flavors (one with mod_proxy, and the other with mod_perl version 2.0). Prior to this, we also downloaded and built version 5.8.8 of Perl. After configuring both httpd_docs and httpd_perl, it was time to bring on the application.

On this occasion, we went with full blown mod_perl for the more heavily used scripts in the 10 percent part of the application that runs under mod_perl, and made them into Apache Registry handlers. It implies that not only is the Perl interpreter loaded at the startup of the web server, but all the handlers are also preloaded. They also share more memory this way and are also faster. For the rest of our application, we are now able to use PerlRun, which uses preloaded modules, and only loads and compiles the scripts as necessary. Our website is available in the form of subdomains, configured as Apache virtual hosts, in about half dozen languages. This setup made it possible to share the entire Perl application among the virtual hosts--rather like an application that a shared hosting provider may make available to their clients.

On an unrelated note, we also deployed mod_deflate on httpd_docs, the plain Apache, to compress the data before sending it to clients who accept compressed data.

Our Results

In the new setup, we have chosen to run a much larger number of httpd_docs than httpd_perl--about five times as many. This basically fits the usage pattern of our site and keeps the memory requirement to an optimum level. The load factor on the server now is nearly zero most of the time, and below 0.2 almost all the time. Our benchmark tests have revealed that the server can easily serve 8 to 10 times as many requests with the load factor still keeping around the 0.2 to 0.3 mark. We would have very likely hit the bandwidth limitations before we hit CPU or RAM limitations with this setup, but the deployment of mod_deflate has reduced our bandwidth usage by 40 to 50 percent.

Our experience has shown that using the power of computing can keep hosting costs to a minimum. To be able to increase the number of customers served 10-fold without bringing costs up is really cool, if I may say so. Even better, it can often translate directly into a similar increase in revenue. In the short term, we could, in fact, reduce our hosting costs by downgrading our package to a more suitable level until the expected growth happens. Of course, it goes without saying that the improvement achieved depends on how good or bad the starting point was. In our case, we do believe that the Perl scripts were well written, but not setting them up properly to run under mod_perl was the killer.

Our next step is to increase the use of Ajax to further reduce our bandwidth and CPU usage, and, perhaps, open up the possibility of increasing the customer base another 10-fold without increasing costs.