Saturday, October 9, 2010

IBM p550 at home with Debian and Infiniband


It's time for yet another blog post on getting once expensive machines running at home.

Introduction

I'm working with an IBM pSeries machine called a p550 (specifically, a model 9113-550 in IBM speak). It was built in 2004, had a list price of some 10s of kilo-bucks new, has 4 x 1.5GHz POWER5, 8GB RAM, and runs AIX up through the most recent releases of AIX 7.

It has a built-in hypervisor and what IBM calls "LPAR" support, which is a mode of virtualization which gives you "Logical PARtitions" of the memory and CPUs in the machine, with a granularity of 1/10th of a CPU. LPAR support requires a desktop machine that IBM calls an HMC, or "Hardware Management Console", which breaks out all of the logical consoles on the machine, and allows you to configure resource allocation and things like virtual ethernet switches and virtual SCSI adapters. In addition, a piece of software for the machine called VIOS or "Virtual I/O Server" is required for LPAR mode if you want to share hardware adapters (eg, ethernet, SCSI or Fiberchannel adapters) between OSes. Since I have neither of those, I am just running the machine in "bare metal" mode, with only one OS instance.

For I/O, the system has a built-in SCSI raid, gigabit ethernet, a Service Processor which controls functions like power on the machine, 5 internal hot-plug PCI-X slots, and an external link that allows for more I/O trays with disk and PCI-X cards to be added. I have installed a Mellanox Infinihost Infiniband card, to hook up to my Infiniband fabric.

Making the Service Processor work for you

In addition to the serial console port, the system has a pair of ethernet ports, which are designed to connect to a system HMC, but which also allow https-based access to the service processor menus. By default, it will try to get an address via dhcp, or you can configure it through the serial port. The Service processor requires you to log in to do anything. I believe that the default username/password combination is admin/admin. That's what we had it set to on the machines at work.

To set the IP address, you need to navigate through the menus:
5. Network Services
1. Network Configuration
1. Configure interface Eth0
Then, chose either static or dynamic, and enter information as needed.

In order to get this to work for me, I had to use Firefox, and enable an SSL option, because while it uses https, it uses a somewhat insecure method of doing SSL that is disabled by default. To enable this, put "about:config" in the address bar, and change the option "security.ssl3.rsa_null_md5" to "true". Once you do that, you can get to the web version of the service processor menus (ASPI in IBM-speak) at https://1.2.3.4 (replacing 1.2.3.4 with the IP address you set above).

One additional thing you will probably want to set up is "Serial Port Snoop" under System Service Aids -> Serial Port Snoop. Setting a "Snoop String" will all you to enter a string through the serial console to force reboot the machine if it locks up, or you do something wrong while booting, and the console isn't set to the right place.

Installing Debian

I net-booted the installer. To do this, set up the host in dhcpd.conf with an entry like this:

host p550 {
hardware ethernet 00:02:55:df:d5:dd;
fixed-address p550.blah;
next-server storage.blah;
filename "/tftpboot/debian-squeeze-ppc64-vmlinuz-chrp.initrd";
}

Boot the machine into OpenFirmware (hit "8" at the firmware "IBM IBM IBM IBM ..." screen), and net-boot from there:

0> boot net console=hvsi0

If you don't boot with the right args from openfirmware, you won't get a working console when you boot into the installer. That's where the "serial port snoop" option from the service processor comes in handy.

Once you get to the end of the installer, you will need to do some magic to get the bootloader (yaboot) installed. Hopefully, the Debian people will get some of this sorted out before the release of Squeeze. Tell the installer that you want a shell, then do this:

# mount --bind /dev /target/dev
# chroot /target
# mount -t proc proc proc
# yabootconfig
# ybin

Upgrading firmware

Debian doesn't include binary update_flash in its powerpc-utils package. Download the latest binary release in RPM format.

Convert that to an rpm with Alien (apt-get install alien if you don't have it):

# alien powerpc-utils-1.2.3-0.ppc.rpm

then

# apt-get remove powerpc-ibm-utils powerpc-utils
# dpkg -i powerpc-utils_1.2.3-1_powerpc.deb

Now, you can download a new flash image from IBM. Once you get it, use alien to convert and unpack the rpm, and do "update_flash ./tmp/fwupdate/01SF240_403_382", where 01SF240_403_382 is the flash image name from the RPM you downloaded. When you reboot the system, Linux will update the system flash just before rebooting.

Infiniband and beyond

I had some problems initially getting Infiniband set up and going. I'm using a Topspin SDR Infiniband adapter, which is basically a stock Mellanox InfiniHost . It seems that the hypervisor on the machine wasn't allocating all of the resources that the card was asking for.

After some discussion on the linuxppc-dev mailing list, it was pointed out that there are certain slots in the machine which the system calls "super slots", and which the firmware is willing to allocate more resources than a typical PCI-X card requests. This Redbook (PDF) on IBM's redbook site details Infiniband usage on pSeries systems, Section 3.4.3 indicates which slots you may install an infiniband adapter into on certain machines. On a p550, these are slots C2 and C5. I had plugged my IB adapter into slot C1, which is why I was having problems.

After getting it into the slot, it was just a matter of getting the right drivers loaded on the host OS. In order to use IP over Infiniband, you'll want the ib_ipoib module. To use RDMA and the Verbs interface, you'll want ib_umad and ib_uverbs modules to be loaded. At this point, it basically acts like a typical Linux system with Infiniband, just with more I/O bandwidth than you can get out of a typical PCI-X based system.

What next?

Setting up an HMC, and playing around with virtualization on the machine sounds like it could be a good time.

Saturday, October 2, 2010

Running an Altix 4700 supercomputer at home

Introduction

At the beginning of the year, work decomissioned the SGI Altix 4700 system that we put into production around January 2007. It sat around unused, and we had little luck finding a buyer for the system - it seems that no one is really commercially interested in running Linux on big Itanium systems anymore.

What's an Altix 4700?

Briefly, an SGI Altix 4700 is a large multi-processor SSI (single-system image) supercomputer, which uses Intel Itanium 2 (in my case, 1.6GHz, dual-core Montecito) processors, and memory on blades, which are interconnected using a "ccNUMA" architecture. This stands for "cache-coherent Non-Uniform Memory Access" - basically, a method of making large SMP-like machines by gluing processors with local memory together with a system interconnect.

With NUMA, unlike SMP, there is memory that is closer to (and thus faster from) each CPU. Like SMP, however, the system is contained in one single address space (unlike, say, a cluster which is connected using Ethernet or Infiniband). It thus runs a single OS image, and looks to the user like one large SMP system.

System Specs

The system that I have is contained within one rack, and has 4 "bricks", each with 8 processor blades, each blade containing 2 dual-core 1.6GHz Montecito and 16GB of RAM, plus one system I/O blade with disks, PCI-X slots, Gigabit ethernet, USB, etc, and assorted NUMA routing blades and system controllers.

That totals 128 system cores, and 512GB RAM. The theoretical peak GFLOP rating of the system as configured is approx 820GF. Of course, 128 CPU and 256 DIMMs draw a bit of power...

Powering/cooling a supercomputer

Running as a full system, the computer draws 9kW of power, and requires 2 x 200-240V, 30A power circuits. That's a lot of power. I pay about $0.072/kWh, so running the system for one hour costs me about $0.65.

One issue with running a system that draws 9kW is that you get 9kW of heat output. As a coworker of mine has said, at work we run heaters, that produce computation as a side-effect. The easiest and most cost-effective solution is to open up some windows, and turn on some fans. With an outside temperature of about 60F, I can set up a few fans, open some windows, and keep the temperature inside below 80F.

It is possible to deal with this problem, to make the machine a bit more friendly and less power-hungry to run -- you can run less than a full system. By pulling out the blades that you don't want to run, you can cut down the power usage by a proportional amount. For testing purposes, I have run the system with either 1/2 or 1/8 of the blades running to reduce my power usage.

One thing I've noticed is that EFI state information (its equivalent of a PC's "CMOS" configuration memory) is stored and updated only on one system blade. So, you really want to make sure that you have blade "0" (the bottom left blade in the chassis marked "001c01") installed, or booting will become much more difficult.

Installing Debian

At work, we ran the machine with SuSE. Due to licensing issues, the fact that SuSE sucks to administer, and that I prefer to run Debian on things, I got to installing Debian. The machine runs EFI, Intel's "next-generation BIOS" that is used on Itanium (IA-64) systems, and some x86 (PC like) systems such as Apple's Intel-based systems. The boot process is pretty close to PXE booting, and Debian seems to have pretty good IA-64 support. The install went smoothly - I ended up installing "Squeeze" - the next version of Debian that will be released.

Kernel changes

In general, the Debian kernel just works. However, it only has support for up to 64 cpus (cores) built in. I downloaded the latest kernel sources from ftp.kernel.org, primed the configuration with the Debian kernel config, adjusted the max number of CPUs, recompiled, and rebooted the system into the new kernel.

I should note that being able to do a make -j 64 does a lot to speed up a kernel compile... :)

Running HPL

HPL is the standard benchmark to test the effective speed of supercomputers, and is used by the ranking on the "Top500" list at http://top500.org.

HPL is also contained within the "hpcc" benchmark collection, which is how I ran the benchmark. At first, I tried the Debian package for hpcc. I got some fairly poor results, because it doesn't use an optimized "BLAS" math library. To speed things up, I went to TACC's web site and downloaded GotoBLAS2, compiled that with the added gcc option "-mtune=itanium2", copied it to /opt, and built hpcc from its source. The instructions here are useful to build these two software packages.

What next?

The systems' CPUs aren't all that fast compared to modern CPUs. For example, the rating in Gigaflops of all 128 cores is about 4 x the rating of a 3.2GHz IBM/Sony Cell BE CPU. One place that the system does have an advantage, though, is its 512GB of DDR2 memory. Someone could easily plug an Infiniband, 10 Gigabit Ethernet, or Fiberchannel card into the machine, and turn it into a pretty snappy solid-state drive, accessible over iSCSI, FCoE, Infiniband SDP, or straight FC.

The next item that I'm going to work on, is getting some FPGA blades from another Altix system working in the 4700, and test out writing code to use the FPGAs to speed up compute intensive tasks. An example of what FPGAs are typically used for is to search a genomics database for a particular DNA sequence. Basically, any algorithm that is applied to a stream of data can be a good candidate for putting into an FPGA.