Thursday, June 26, 2008

A supercomputer reborn




After my last entry, I have spent a bit of time poking at Linux kernel versions, and found what I had created last time I tried to do a Debian install on one of the SP high nodes: a custom compiled version of the 2.6.8 kernel, with the Debian Sarge installer thrown in as an initrd. To my amazement, it actually booted, and I had a system running a somewhat useful version of Linux again, instead of AIX.

Now, my next goal is to get this system to run the latest kernel release, and up-to-date software. I managed to locate my set of 32GB of ram for the system (arranged as 128, 256MB modules!), plug that in, and end up with a system that has more memory than disk space, and has more memory used when it boots (from things like page tables) than there is on the disk. Right now, I seem to be having trouble booting any kernels past 2.6.8 on the system, ending up with the kernel causing bad page faults while it tries to set up the MPIC on the system (that's the interrupt controller for PowerPC systems).

After I get both of those in place, I am hoping to take a spare fiber channel card, the SCSI/FC target driver in the current Linux kernel, and turn the system into a 32GB or so solid-state disk. I'm thinking that this would be a good concept to test, as it's way cheaper than getting a modern sold-state disk from Texas Memory Systems. In fact, you can pick up a new commodity x86 system with 128 or 256GB of ram and a fiber channel card or two for significantly less than a commercial solid state disk solution.

Sunday, June 22, 2008

The death of a supercomputer



In Feburary of 2005, Purdue was given the hardware that used to be Blue Horizon, the San Diego Supercomputing Center's old IBM RS/6000 SP supercomputer, which was purchased through funding from NSF. When it was new in 2000, it placed 8th on the list of the top 500 supercomputers in the world. When Purdue acquired it, the system was well off the bottom of the chart. The system was a set of 144, 8-processor 375MHz POWER3 "SP high node" (9076-N81) systems with 4GB of ram each.

The people in charge of my department, named the Rosen Center for Advanced Computing had decided that the price of the system (free plus shipping) was good enough to send two of our hardware guys out to condense the system down, maxing out the systems, and going from two 4-processor modules to four modules (16 processors) and from 4GB to 16GB of memory per system. At the time, we had the free power and floorspace, so it seemed like it could be a reasonable idea, and the systems were still computationally useful for a year or so after we got them. We condensed the system down from nearly 40 racks of machines to just 10, each one somewhere around twice as fast as my dual-G5 in doing a Linux kernel compile (one of my standard metrics for testing speed; I did this under a Debian Linux install on both systems).

Unfortunately. the amount of time and effort necessary to set up an IBM SP and AIX to be a useful compute resource is non-trivial. Adding to the problem, we lost two of our senior systems administrators in the Summer of 2005, one of which was our AIX guru and had set up our existing IBM SP systems. I had played around a bit with our testing SP cluster, including reinstalling it, and discovered how much effort is required to make a useful system. Just the software necessary to do a base OS install on an SP has an install manual that's over an inch thick.

So, by Summer of 2006, our management finally decided that we should get rid of the system, which meant that some of it would be coming home with me. So, I purchased the system from Purdue's surplus store for somewhere around $500.

The first rack of nodes (there's four nodes per rack) went onto ebay, and sold to a researcher in China for about enough money to pay for my endeavor. I shipped one more rack to a computer collector in New Jersey, and a Saturday evening after the annual Vintage Computer Festival/Midwest show that I ran, one more rack of systems was loaded into the back of a Toyota minivan and headed up with a lawnmower to Ontario, Canada. Later, a second rack would go to Canada, a few would get scrapped for parts, some nodes were stripped for ram to sell to a reseller, another rack to a company in Minnesota, and the rest sat around in my warehouse until they got scrapped, or used for parts. All that remain are two of the original nodes, and a few boxes of boards, heat sinks, memory and other parts that need to go a scrapper, so that they can be recycled.

I'm still keeping the two nodes, and 8 or so SP thin nodes (9076-270) , partly because the POWER architecture is neat, and partly because a machine with 16GB of ram in it is still kinda pricy. Plus, in a bit more than a month, we'll be retiring our remaining SP system, which has memory modules that can push my two nodes to 64GB each. Sure, you can buy faster machines with the same amount of memory, which are smaller and use less power, from people like Sun, but I can also pay for a lot of power with the amount of money that one new machine would cost.

I've actually managed to get Debian Linux running on them; they don't work with many kernel version, but a 2.6.8 kernel that came with a past Debian installer seemed to work OK with them. I'm trying to revive the nodes I have, but I don't seem to be able to acquire the bootable installer anymore from the usual Debian places, and I'm not sure if I have it archived off somewhere accessible. Still, I should be able to take an installed copy running on a different machine, clone it and rebuild and install the correct kernel version, and boot that on the system. I just haven't had enough time or round tuits to do that yet.

So it seems, even in the death of a supercomputer, the machine still lives on, dissected, and disseminated to other countries, providing what help it can to further science, maybe just become an interesting conversation piece, or even become recycled into parts for tomorrow's supercomputing hardware.

Friday, June 20, 2008

"My" SiCortex SC5832


Well, I actually just run it, my employer, Purdue actually owns it for now. But, in 5 years or so, maybe I'll get a chance to buy it for my own enjoyment, just like other things I've managed to collect.

It is a fun little MIPS processor cluster in a box, the one we have is populated with 540 nodes, each with 6, 64-bit MIPS cores and 8GB of ram. The cores are kinda poky (500MHz), but each node CPU chip (with 6 cores, a PCI-E controller, two DDR2 memory controllers, and the high-speed interconnect fabric switch) draws a whole 600mW. The system draws less than 11kW as we have it configured now, and looks quite pretty. That's approximately the same amount of power that a single rack of 24 of our new "steele" cluster nodes uses (which are Dell 1950s with dual, quad-core Xeon E5410 CPUs and 16 or 32GB ram each).

We got the system as a "green" computing initiative that we've been working on, as our new President seems quite interested in doing green things. It is also one of the few things we have that's not a commodity x86 cluster for HPC. We got this just after replacing three of our old compute clusters (dual processor, single core things) with a new, dual-socket quad-core cluster, also in the name of green computing. Hopefully this will stave off our need for a new datacenter, at least another year or two.

There is some home that the system will help revive an old project, started by the late David Moffett, called the "High Performance Classroom"; allowing students to learn about parallel computing though the use of dedicated compute time on the system. The idea was suggested by Matt Reilly of SiCortex when he came to visit Purdue and talk about the system we have. This is one of the few machines that my work owns that weren't purchased with funds from users or specific grants, and which we actually have some leeway at specifying at how the system gets used.

Our other discretionary resource is a cluster of 3-year old desktop computers, that gets updated ever year; old machines from student computer labs get rotated out of use after 3 years, and into a cluster of machines that's about 500 machines in size. It's the only dedicated resource that we current have running that anyone on campus could get an account to run their own jobs on. Some of these machines are pictured in the background of the picture of the SiCortex above.

In any case, hopefully our higher-up management decides that the SiCortex is a good enough machine to keep around (we got in on a sort-of rent-to-own plan), and use to help further one of the fundamental goals of Purdue - not the Research that keeps bringing in money, and fancy awards and prizes, but Teaching Students, which sometimes seems like it gets forgotten behind the other, gold-plated awards from the NSF and other organizations for research.