December 04, 2006 (LinuxWorld) -- Linux has long provided an outstanding operating system for a wide range of users in a variety of settings. However, high-performance computing users, who must run applications on thousands of nodes, historically have faced challenges that Linux could not effectively address.
These issues arise for several reasons. In the first place, installing a full, untuned copy of Linux -- or of any full-scale operating system -- on each node of a large-scale HPC system interferes with the efficient use of processor and communication resources. HPC users have also found that some inherent attributes of Linux, such as various daemons and services that run by default, can impede application performance, as the operating system scales to larger numbers of processors.
Given these issues, the largest-scale HPC facilities have traditionally employed alternative specialized lightweight operating systems on compute nodes, while using Linux at the system level. Unfortunately, this strategy is not viable for all types of HPC users. After all, a specialized operating system tuned explicitly for a particular application environment simply cannot provide the breadth of services and features that may be required by users in companies and other types of HPC environments.
The ideal solution for many HPC users would be a combination of full-blown Linux at the system level, with compute nodes employing a lightweight Linux that is optimized for HPC systems. Today, Cray and others in the HPC community are working to deliver just that. In the short term, this "Linux on Compute Node" strategy will offer the greatest benefits to users of larger-scale HPC systems, allowing them to achieve better application performance without sacrificing the familiarity and feature set of Linux. However, as enterprise HPC users and applications continually demand greater scalability and more processors, this innovation ultimately may extend significant advantages to users in all types of HPC environments.
Conventional operating system approaches in HPC systems
The biggest problem that HPC users have with using full-blown Linux on all compute nodes is that Linux was designed to operate primarily in an enterprise environment, supporting desktop and server workloads. As a result, Linux is optimized for "capacity operation," for providing the greatest possible throughput in an environment in which the operating system must handle many small jobs, and for single-node interactive response time, providing, for example, prompt processing of Web server requests. In an HPC environment, however, users are more concerned about "capability operation," or achieving the best possible performance of a single application running across the entire system.
In fact, the very features that make Linux ideal for enterprise environments -- primarily operating system features and daemons that are designed to make the most efficient use of resources both when running many small jobs and when providing good interactive response -- can cause serious performance issues in HPC systems. These performance issues, which tend to arise when any full-featured operating system is used in a large-scale system, are referred to as "operating system jitter." Additionally, while the full implementation of demand-paged virtual memory used in Linux is quite appropriate for the standard Linux target market, it is not as well suited for HPC environments.
Historically, these problems have been manageable or even negligible in smaller-scale HPC systems, and have primarily affected only the largest-scale system users, such as those at Advanced Strategic Computing Initiative (ASCI) facilities. However, enterprise-scale HPC users should not assume that they are immune from these issues. According to IDC studies of technical server clusters, the average cluster configuration has jumped from 683 processors (322 nodes) in 2004 to 4,148 processors (954 nodes) in 2006. This represents a six-fold increase in processor count and a threefold jump in node count in just two years, and users can expect these trends to continue. As more systems expand to thousands of nodes, whether through the adoption of multicore processors or the growth of multinode and multisocket systems, these issues will begin to significantly impede application performance for a growing class of users. Naturally, more and more HPC users are beginning to search for an alternative approach.
Specialized lightweight operating systems optimized for HPC
Given the scalability issues of full-scale operating systems in HPC environments, the largest supercomputing facilities have long employed alternatives to Linux on compute nodes. For these users, specialized lightweight compute node operating systems, such as Catamount, developed initially by Sandia National Laboratories and now used on its Cray XT3 System, have provided a viable product.
Catamount is well suited for many large-scale supercomputing facilities and offers a number of advantages in these environments. First, it is truly lightweight. The operating system is very small in size and performs only minimal interactions with the virtual memory system, processor context and the network interface. Catamount is not responsible for memory allocation, scheduling or job launch functions. These tasks are performed through a "user mode" process. Since most system processes and services are handled outside of compute nodes, Catamount also produces few sources of operating system jitter.
Unlike full-blown Linux, when Catamount provides memory allocation, it ensures that memory allocated on a per-segment basis is physically contiguous. This allows kernel drivers to program direct memory accesses (DMA) more efficiently and with less overhead. Catamount is also very well tuned for Message Passing Interface (MPI) programming environment applications, which constitute the bulk of ASCI applications. Additionally, although large-scale HPC environments do require file I/O from compute node operating systems, some of them do not require sockets, threads and many other types of conventional operating system services. By omitting such services, Catamount and other specialized operating systems are able to provide significant advantages over full-scale Linux for many HPC applications. In fact, the systems holding the top three spots on the Top500.org list of the 500 most powerful HPC systems all run specialized, lightweight compute operating systems.
However, while Catamount may be ideal for many large-scale supercomputing applications, the particular programming model-focused tuning of the kernel done for such applications means that many users and other applications will have requirements that Catamount cannot easily meet. For example, because Catamount moves significant functionality into the application code, the specialized operating system may limit the functionality that applications can draw on from the compute nodes, and ultimately, from the system. For many scalable programming models and applications, for which the specialized compute node operating system has been designed and written specifically to support, this will not be an issue. However, in other environments, such as in companies, users may have little control over which programming environment an application is written for and which compute node operating system functions the application will require.
Catamount was designed and optimized specifically for MPI programming. The simplicity and success of Catamount has been based on having support only for critical features. Catamount and its predecessors have not provided support for symmetric multiprocessing, and it provides no support for alternative programming models such as Global Address Space languages (Universal Parallel C; Co-Array Fortran) or for OpenMP, because such support would interfere with the performance of the target applications and programming environment. Catamount also does not support sockets, threading, shared file systems or other traditional operating system services that many enterprise users require -- again, because these features often interfere with the performance of the applications that it targets. Finally, Catamount development has been limited exclusively to Sandia and Cray. So Catamount users cannot benefit from the extensive code review, debugging and ongoing new feature development that characterize the Linux development community.
An alternative strategy: Lightweight Linux implementations
Cray and others in the HPC community have been exploring a new approach to the HPC compute node operating system problem. Lightweight Linux implementations, or what Cray calls Compute Node Linux (CNL), can combine the performance advantages of a specialized compute node operating system with the familiarity and functionality of Linux, while eliminating many of the disadvantages associated with a full-blown operating system. When fully realized, CNL will offer several advantages for large-scale HPC environments, and will allow users of even smaller-scale HPC systems to realize the kind of performance gains that ASCI users have enjoyed for years with products such as Catamount.
First, CNL will provide a performance-tuned operating system in a standard environment, instead of requiring a highly specialized solution. For the thousands of HPC users today who are very comfortable with Linux, the emergence of a "slimmed-down" Linux for compute nodes may present an attractive option. CNL will also provide the rich set of operating system services and system calls that users and developers expect, and that their applications may require. CNL will support sockets, OpenMP and various types of alternative file systems (such as log-structured, parallel).