Supercomputer

From JimboWiki
Revision as of 04:34, 13 November 2009 by J (Talk | contribs)

Jump to: navigation, search

Purpose

I have a bunch of older machines. There's nothing wrong with them, and they're not really all that old. I have a need for a multi-CPU system for some development work and for running Virtual Machines. Cluster computing and parallel processing have always interested me, so why not build a supercomputer in my basement?

Choosing A Cluster Platform

There are quite a few open source cluster platforms. My requirements include:

  • Must be free (as in no monetary cost), preferably Open Source
  • Must not require any special compilers, procedures, or programming languages - I want to run programs and have them execute somewhere in the cloud automatically, and transparently.
  • Must support a graphical environment - I want to have a multi-monitor setup (starting with 2, maybe up to 6 at a later date) to fit all the programs I'm running in parallel and to impress the less technically savvy.
  • Preferably Linux based - since I know it, and it is no monetary cost
  • Preferably supports running virtual machines that can use cores from multiple cluster nodes simultaneously, or at least allows some way to specify that the virtual machine's process should consume all available resources on one cluster node
  • Preferably using a recent kernel, if kernel patches are required
  • Preferably supported by a reasonably active community, if open source

NOTE: The analysis below was undertaken in mid-November 2009. Things change over time. Do not let my findings discourage you from looking into the systems I have not used. Most of these are really neat, but do not meet my specific requirements for one reason or another.

Kerrighed

Kerrighed will be used for this project. Reasons why others were dismissed are discussed in each section.

Kerrighed is a Single Server Image (SSI) system, meaning that all the nodes share the same file system. Kerrighed combines process management date from all nodes, which means we can use normal tools (ps, top) to see everything that is happening on the cluster as if it were all one one single machine. Process IDs are unique throughout the cluster, all memory is shared, all processors are visible from all nodes. Processes can migrate, or move, to other nodes automatically to balance the load.

How it measures up against the requirements:

  • It's free and open source
  • Normal processes are migrated - there is no need for special compilers or toolkits
  • There does not seem to be anything in the way of running a graphical environment, but it may be difficult to set up only one or two nodes with high-powered graphics cards - this requires further investigation
  • It's Linux based - Kerrighed is a set of kernel patches and modules with some supporting tools
  • It seems like it's possible to run virtual machines, assuming the virtual machine server can be compiled on the same kernel version as Kerrighed. I'm not sure if it is possible to span a single VM across multiple physical systems yet - this requires further research and experimentation. Even if I can't, so long as VMs can be run, that is acceptable.
  • Currently using kernel 2.6.20, which is fine
  • Updates as recent as last month

OpenSSI

OpenSSI

  • Latest Release: August 2006
  • Dismissed because it does not support recent kernels right now

openMosix

openMosix

  • Project terminated on March 1, 2008

MOSIX

MOSIX

  • Not free or open source
  • Free only for researchers and students

LinuxPMI

LinuxPMI

  • Continuation of openMosix
  • Work on updating for newer kernels is ongoing
  • Dismissed because documentation is lacking, and it doesn't seem like the stuff for newer kernels is ready yet

PelicanHPC

PellicanHPC

  • A Knoppix-based LiveCD for creating clusters quickly
  • Dismissed because I want a full installation, not a LiveCD
  • Seems like an active project, and uses a recent kernel

Chromium

Chromium

  • Dismissed because it is specific to graphics and rendering, not general purpose computing

Sun Grid Engine

Grid Engine

  • Dismissed because it doesn't support moving normal processes around the cluster - all "jobs" are MPI based, and must be explicitly started using a special tool
  • Jobs can be moved around, and there is load balancing
  • There are features to move jobs off of systems that are actively being used as workstations

SLURM

Simple Linux Utility for Resource Management

  • Similar looking to Sun Grid Engine

OpenNebula

OpenNebula

  • A virtual machine cluster management system
  • Dismissed because there doesn't seem to be a way to have a single VM use processors from more than one physical system simultaneously. (Meaning I would have a bunch of single-CPU virtual machines with no parallel computation features)

Scyld

Scyld

  • Not free or open source


Implementation Plan

Stage 1 - Proof of Concept
This stage will use minimal hardware (1 storage server, 2 cluster nodes, 10/100 networking) to demonstrate that the solution can meet the requirements set forward. An installation procedure and power up/down procedures will also be produced during this stage.
Stage 2 - Basic Implementation
This stage will use expanded hardware (maximum number of cluster nodes available - probably about 6, Gigabit networking) to demonstrate the full potential of the cluster. This stage will start with a fresh install, using the procedure created in the Proof-of-concept. The setup created here will be expanded into the final implementation. Procedures for managing processes throughout the cluster will be investigated and developed. A benchmark procedure will be investigated, developed, and demonstrated.
Stage 3 - Expanded Implementation
This stage will build on the implementation from Stage 2 - no new computing hardware will be added. The graphical environment will be installed and tested. Multi-monitor support will be added. Scripts will be created to automatically power up all nodes, to initialize the cluster, to shutdown all nodes, and for other tasks.
Stage 4 - Performance Evaluation
Benchmark tests will be run to test the system's capability. If problems are found, they will be addressed and the tests will be run again. The affected procedures will be updated accordingly.

Procedures

Installation and Configuration

Power Up

Initialize Cluster

Power Down

Performance Benchmark

Script Installation

Journal

November 2009

  • Started this page, outlined the plans
  • Wrote up analysis of cluster platforms

References