Supercomputer

This document provides (or will provide) the following information:

A very brief overview of some selected cluster platforms
How I approached the task of building my own supercomputer using Kerrighed
Detailed instructions to build your own supercomputer using Kerrighed
Detailed instructions to benchmark your Kerrighed supercomputer
Detailed instructions to do more advanced things with your Kerrighed supercomputer

This document does not answer the following questions:

What is a supercomputer?
What are the basic concepts of parallel programing?
Why do you have so many computers stacked up in your basement?
Where is the restroom?

Purpose

I have a bunch of older machines. There's nothing wrong with them, and they're not really all that old. I have a need for a multi-CPU system for some development work and for running Virtual Machines. Cluster computing and parallel processing have always interested me, so why not build a supercomputer in my basement?

Choosing A Cluster Platform

There are quite a few open source cluster platforms. My requirements include:

Must be free (as in no monetary cost), preferably Open Source
Must not require any special compilers, procedures, or programming languages - I want to run programs and have them execute somewhere in the cloud automatically, and transparently.
Must support a graphical environment - I want to have a multi-monitor setup (starting with 2, maybe up to 6 at a later date) to fit all the programs I'm running in parallel and to impress the less technically savvy.
Preferably Linux based - since I know it, and it is no monetary cost
Preferably supports running virtual machines that can use cores from multiple cluster nodes simultaneously, or at least allows some way to specify that the virtual machine's process should consume all available resources on one cluster node
Preferably using a recent kernel, if kernel patches are required
Preferably supported by a reasonably active community, if open source

NOTE: The analysis below was undertaken in mid-November 2009. Things change over time. Do not let my findings discourage you from looking into the systems I have not used. Most of these are really neat, but do not meet my specific requirements for one reason or another.

Kerrighed

Kerrighed will be used for this project. Reasons why others were dismissed are discussed in each section.

Kerrighed is a Single Server Image (SSI) system, meaning that all the nodes share the same file system. Kerrighed combines process management date from all nodes, which means we can use normal tools (ps, top) to see everything that is happening on the cluster as if it were all one one single machine. Process IDs are unique throughout the cluster, all memory is shared, all processors are visible from all nodes. Processes can migrate, or move, to other nodes automatically to balance the load.

How it measures up against the requirements:

It's free and open source
Normal processes are migrated - there is no need for special compilers or toolkits
There does not seem to be anything in the way of running a graphical environment, but it may be difficult to set up only one or two nodes with high-powered graphics cards - this requires further investigation
It's Linux based - Kerrighed is a set of kernel patches and modules with some supporting tools
It seems like it's possible to run virtual machines, assuming the virtual machine server can be compiled on the same kernel version as Kerrighed. I'm not sure if it is possible to span a single VM across multiple physical systems yet - this requires further research and experimentation. Even if I can't, so long as VMs can be run, that is acceptable.
Currently using kernel 2.6.20, which is fine
Updates as recent as last month

OpenSSI

Latest Release: August 2006
Dismissed because it does not support recent kernels right now

openMosix

Project terminated on March 1, 2008

MOSIX

Not free or open source
Free only for researchers and students

LinuxPMI

Continuation of openMosix
Work on updating for newer kernels is ongoing
Dismissed because documentation is lacking, and it doesn't seem like the stuff for newer kernels is ready yet

PelicanHPC

PellicanHPC

A Knoppix-based LiveCD for creating clusters quickly
Dismissed because I want a full installation, not a LiveCD
Seems like an active project, and uses a recent kernel

Chromium

Dismissed because it is specific to graphics and rendering, not general purpose computing

Sun Grid Engine

Grid Engine

Dismissed because it doesn't support moving normal processes around the cluster - all "jobs" are MPI based, and must be explicitly started using a special tool
Jobs can be moved around, and there is load balancing
There are features to move jobs off of systems that are actively being used as workstations

SLURM

Simple Linux Utility for Resource Management

Similar looking to Sun Grid Engine

OpenNebula

A virtual machine based "cloud computing" system
Dismissed because there doesn't seem to be a way to have a single VM use processors from more than one physical system simultaneously. (Meaning I would have a bunch of single-CPU virtual machines with no parallel computation features)
Supports expanding resource usage to the Amazon Cloud

Eucalyptus

Similar concept to OpenNebula
More commercialized

Ubuntu Cloud

Based on Eucalyptus
Fairly commercialized product, minimal documentation

Scyld

Beowulf style cluster
Not free or open source

Implementation Plan

Stage 1 - Proof of Concept (Started 14-Nov-2009, finished 22-Nov-2009): This stage will use minimal hardware (1 storage node, 2 cluster nodes, 10/100 networking) to demonstrate that the solution can meet the requirements set forward. An installation procedure and power up/down procedures will also be produced during this stage.

Stage 2 - Basic Implementation (Started 22-Nov-2009): This stage will use expanded hardware (maximum number of cluster nodes available - probably about 6, Gigabit networking) to demonstrate the full potential of the cluster. This stage will start with the installation used in the Proof of Concept stage. The setup created here will be expanded into the final implementation. Procedures for managing processes throughout the cluster will be investigated and developed. A benchmark procedure will be investigated, developed, and demonstrated.

Stage 3 - Expanded Implementation: This stage will build on the implementation from the Basic Implementation stage - no new computing hardware will be added. Virtual machine host installation and usage procedures will be investigated, developed, and demonstrated. The graphical environment will be installed and tested. Multi-monitor support will be added. Scripts will be created to automatically power up all nodes, to initialize the cluster, to shutdown all nodes, and for other tasks.

Stage 4 - Performance Evaluation: Benchmark tests will be run to test the system's capability. If problems are found, they will be addressed and the tests will be run again. The affected procedures will be updated accordingly.

Procedures

Setup Overview and Network Configuration

There will be one storage node and an arbitrary number of cluster nodes. The cluster nodes will reside on their own private network, managed by the storage node.

Storage node: The storage node will not run a Kerrighed kernel. Its responsibilities are only to provide the nodes with network addresses, the PXE boot image, and the shared file system. This system will have 2 network cards - one connected to the external network, the other to the cluster network. DHCP and PXE services will only be provided to the cluster network. For my setup, the storage node will be named "frank". Its IP address on the external network will be assigned by DHCP somewhere in the 192.168.80.0/24 subnet. The external network will be connected to the eth0 network adapter. The storage node's IP address on the cluster network will be 192.168.81.1. The cluster network will be connected to the eth1 network adapter.

Cluster node: Each cluster node must be capable of PXE booting. They should all be able to run x86/i386 operating systems, since that's what the storage node will be providing.

External Network: The external network must provide internet access, to make the install process go more quickly and easily. The storage node will be the only system in the cluster that is connected directly to the external network switch. For my setup, the external network will use the 192.168.80.0/24 subnet.

Cluster Network: All cluster nodes will be connected to a private cluster network. The storage node will provide IP addresses to this network using DHCP. The storage node will act as a gateway to the external network for the cluster nodes. For my setup, the cluster network will use the 192.168.81.0/24 subnet.

Installation and Configuration

There is quite a bit of assumed knowledge here. Read all the directions before you start, and make sure you understand what is going on.

Most of this procedure happens on the storage node, as root. The only exception is when you use SSH to connect to a cluster node - even still, you will probably be at the keyboard of the storage node.

Operating System for the Storage Node

Install a Linux distribution - this step takes a while... Have a sandwich ready.
- I've used Debian Lenny, typical install (first CD only)
- It doesn't technically matter what we use, but these directions are tested with Lenny, and use Debian packages
- The cluster nodes will run Debian, regardless of what you pick to run the storage node with.
Configure the network by making /etc/network/interfaces look something like this:

# The loopback network interface
auto lo
iface lo inet loopback

# The external network interface
allow-hotplug eth0
iface eth0 inet dhcp

# The cluster network interface
auto eth1
iface eth1 inet static
	address 192.168.81.1
	netmask 255.255.255.0

Get Software for the Storage Node

Get all the packages we need by running the command below.
- DHCP Server to provide IP addresses and trigger PXE booting on the cluster network
- TFTP (trivial FTP) to provide the boot images to the cluster nodes
- NFS (Network File System) to provide the shared file system, which will be used by all cluster nodes (uses portmap for RPC calls)
- syslinux, a bootloader for our cluster nodes
- debootstrap, a program to make a Linux installation when provided a web URL

apt-get install dhcp3-server tftpd-hpa portmap syslinux nfs-kernel-server nfs-common debootstrap

Configure TFTP Server

Make /etc/default/tftpd-hpa look like this (if it doesn't already):

RUN_DAEMON="yes"
OPTIONS="-l -s /var/lib/tftpboot"

Make /etc/inetd.conf contain this line (if it doesn't already):

tftp           dgram   udp     wait    root  /usr/sbin/in.tftpd /usr/sbin/in.tftpd -s /var/lib/tftpboot

Copy the PXE bootloader to the TFTP directory by running:
- ```
cp /usr/lib/syslinux/pxelinux.0 /var/lib/tftpboot
```
Make the configuration directory:
- ```
mkdir /var/lib/tftpboot/pxelinux.cfg
```
Set up a default configuration
- This will be what is used if no other configurations apply - since we're not making any, this will be the case all the time
- Make /var/lib/tftpboot/pxelinux.cfg/default look like this (substituting in your storage node's IP, of course):

LABEL linux
KERNEL vmlinuz-2.6.20-krg
APPEND console=tty1 root=/dev/nfs nfsroot=192.168.81.1:/nfsroot/kerrighed ip=dhcp rw session_id=1

Configure as a Gateway

Install the Network Gateway script, per instructions

Configure DHCP Server

Set the interface that the DHCP server will run on - we don't want to conflict with DHCP on the external network...
- Make /etc/default/dhcp3-server look something like this (substituting your external interface):
- ```
INTERFACES="eth1"
```
Configure the DHCP server to provide network configuration and boot images
- Make /etc/dhcp3/dhcpd.conf look like this:

option domain-name-servers 192.168.80.1; # change this to your DNS server
default-lease-time 86400;
max-lease-time 604800;
authoritative;

subnet 192.168.81.0 netmask 255.255.255.0 { # change this to whatever subnet you like
        range 192.168.81.101 192.168.81.200; # change this to whatever range you like
        filename "pxelinux.0"; # we copied this into the TFTP server earlier
        next-server 192.168.81.1;
        option subnet-mask 255.255.255.0;
        option broadcast-address 192.168.81.255;
        option routers 192.168.81.1; # This means all the cluster nodes will use the storage node as a gateway
}

# if you want to set specific IPs for certain machines, uncomment and modify this to you needs:
#host node1 {
#        fixed-address 192.168.81.101; # pick an address
#        hardware ethernet FF:FF:FF:FF:FF:FF; # put the machine's MAC address here
#}

Configure NFS

Make a place for the shared file system
- ```
mkdir /nfsroot/kerrighed
```

Add the following line to /etc/exports (substitute your storage node's IP)

/nfsroot/kerrighed 192.168.81.0/255.255.255.0(rw,no_subtree_check,async,no_root_squash)

Export the filesystem
- ```
exportfs -avr
```

Operating System for Cluster Nodes

Make a Debian installation for the cluster nodes

debootstrap –arch i386 lenny /nfsroot/kerrighed http://ftp.us.debian.org/debian

Copy the apt sources list into the new system so we can install new stuff on cluster nodes
- ```
cp /etc/apt/sources.list /nfsroot/kerrighed/etc/apt/sources.list
```
chroot into the new system
- ```
chroot /nfsroot/kerrighed
```
Set the root password
- ```
passwd
```
Get a /proc directory
- ```
mount -t proc none /proc
```
Avoid lots of annoying warnings
- ```
export LC_ALL=C
```
Get all the stuff we'll need to talk to the storage node and let people use the cluster nodes
- ```
apt-get update && apt-get install dhcp3-common nfs-common nfsbooted openssh-server
```
Make the "localhost" hostname work by putting the following into /etc/hosts
- ```
127.0.0.1 localhost
```

Have the filesystem mount automatically for us

ln -sf /etc/network/if-up.d/mountnfs /etc/rcS.d/S34mountnfs

Add a user so we can log in to the cluster nodes
- ```
adduser
```
Configure the network by making /etc/network/interfaces look like this:

auto lo
iface lo inet loopback
iface eth0 inet dhcp

Compile the Kerrighed Kernel

(Still chroot'ed into the cluster node image)

Get the packages that we'll need to build the Kerrighed kernel

apt-get install automake autoconf libtool pkg-config gawk rsync bzip2 libncurses5 libncurses5-dev wget lsb-release xmlto patchutils xutils-dev build-essential

Get in the source directory
- ```
cd /usr/src
```

Get the Kerrighed source, decompress it, and make it easier to find later

wget http://gforge.inria.fr/frs/download.php/23356/kerrighed-2.4.1.tar.gz

gzip -dc kerrighed-2.4.1.tar.gz | tar xf -

```
ln -s kerrighed-2.4.1/ kerrighed
```

Get the Linux 2.6.20 source (this is the ONLY version Kerrighed 2.4.1 works with), and extract it

wget -O /usr/src/linux-2.6.20.tar.bz2 http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.20.tar.bz2

```
tar jxf linux-2.6.20.tar.bz2
```

Get into the Kerrighed source
- ```
cd kerrighed
```
Run the configure script to configure things correctly
- ```
./configure
```
Get into the kernel directory, and run defconfig
- ```
cd kernel
```
- ```
make defconfig
```
Run menuconfig to set all the options that we need
- ```
make menuconfig
```
While in menuconfig, do the following:
- Make sure any network cards that you need to support are installed in the kernel - NOT as modules. You can find this under Device drivers --> Network device support
- Make sure NFS filesystem support is compiled into the kernel - NOT as a module. You can find this under File Systems --> Network File Systems
- Make sure Loadable module support --> Automatic kernel module loading is activated
- Exit menuconfig
Go to the Kerrighed directory, and make the kernel
- ```
cd ..
```
- ```
make kernel
```
Assuming there were no errors, it's time for the big one. You'll have time for a slice of pie...
- ```
make
```
Once that's done, do the wrapup work to install everything in the right places
- ```
make kernel-install
```
- ```
make install
```
- ```
ldconfig
```
Exit the chroot
- ```
exit
```

Configure Kerrighed

Make the Kerrighed kernel avaliable via TFTP

cp /nfsroot/kerrighed/boot/vmlinuz-2.6.20-krg /var/lib/tftpboot/

Make a place that we can mount configfs
- ```
mkdir /config
```
Mount the configfs by adding the following line to /etc/fstab (configfs is used by the Kerrighed scheduler)
- ```
configfs /config configfs defaults 0 0
```
Make sure the following line appears in /nfsroot/kerrighed/etc/default/kerrighed
- ```
ENABLE=true
```
Make /nfsroot/kerrighed/etc/kerrighed_nodes look like this:

session=1 #Value can be 1 – 254
nbmin=2 #2 nodes starting up with the Kerrighed kernel.
192.168.81.101:1:eth0
192.168.81.102:2:eth0

Power Up

If you just finished the installation procedure, either reboot, or set up the network and start all the servers - like this:

```
ifconfig eth1 192.168.81.1
```
```
/etc/network/if-up.d/gateway
```
```
/etc/init.d/tftpd-hpa start
```
```
/etc/init.d/dhcp3-server start
```
```
/etc/init.d/portmap start
```
```
/etc/init.d/nfs-kernel-server start
```

Enable PXE boot on each cluster node (usually a BIOS setting)
Make sure the nodes are connected to the correct switch
Turn on the nodes - they should boot

Initialize Cluster

Get a shell connection to a node

ssh {user you made earlier}@192.168.81.101

Do all this to start up the cluster, and allow processes launched in this session to migrate around

krgadm cluster start
krgcapset -d +CAN_MIGRATE
krgcapset -k $$ -d +CAN_MIGRATE
krgcapset -d +USE_REMOTE_MEMORY

Power Down

It's pretty simple...

krgadm cluster poweroff

That will shut down all the nodes. Shut down the storage node in the normal way (if you want).

Performance Benchmark

Hardinfo

dhrystone

whetstone

Kerrighed test directory

http://lxr.kerlabs.com/kerrighed/source/tests/benchmark/?v=devel-kdfs

BogoMIPS

Script Installation

Network Gateway

Here's a script to make the storage node act as a network gateway so the cluster nodes can have access to the external network and beyond. The script I suggest here is very restrictive - you may want to allow more incoming connections.

Save this on the storage node in /etc/network/if-up.d/gateway (I didn't write this script^[1], I just swapped the interfaces)

#!/bin/sh

PATH=/usr/sbin:/sbin:/bin:/usr/bin

#
# delete all existing rules.
#
iptables -F
iptables -t nat -F
iptables -t mangle -F
iptables -X

# Always accept loopback traffic
iptables -A INPUT -i lo -j ACCEPT


# Allow established connections, and those not coming from the outside
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A INPUT -m state --state NEW -i ! eth0 -j ACCEPT
iptables -A FORWARD -i eth0 -o eth1 -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow outgoing connections from the LAN side.
iptables -A FORWARD -i eth1 -o eth0 -j ACCEPT

# Masquerade.
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

# Don't forward from the outside to the inside.
iptables -A FORWARD -i eth0 -o eth0 -j REJECT

# Enable routing.
echo 1 > /proc/sys/net/ipv4/ip_forward

And allow execution, like so:

chmod +x /etc/network/if-up.d/gateway

Next time the network comes up, this script will run. You can run it manually to apply the changes immediately.

VMWare Installation

Log in to one of the nodes as root
Download VMWare Server
Follow the normal installation procedure
Now a server instance will run on each node - we need to make it run on only one node...

Journal

November 2009

Started this page, outlined the plans
Wrote up analysis of cluster platforms
First cluster nodes PXE booted! Nov 17
Started to add information about benchmarking software packages
Wrote installation, configuration, startup, and shutdown guides

References

↑ http://www.debian-administration.org/articles/23

[1] ttp://www.debian-administration.org/articles/23

[1]