Have you ever wondered why a supercomputer is called a supercomputer? Is it the number of processors or the amount of RAM? Must a supercomputer occupy a certain amount of space, or consume a specific amount of power?
The first supercomputer, the Control Data Corporation (CDC) 6600, only had a single CPU. Released in 1964, the CDC 6600 was actually fairly small — about the size of four filing cabinets. It cost $8 million — around $60 million in today’s money — and operated at up to 40MHz, squeezing out a peak performance of 3 million floating point operations per second (flops).
In comparison, the CDC 6600 was up to 10 times faster than the fastest computer at the time, the $13-million ($91m today!), 2000-square-foot-occupying IBM 7030 Stretch — thus earning the title ofsupercomputer. At this point, Intel was still seven years away from releasing the 740KHz 4004 CPU. (For a bit of fun, definitely read the original 1960 IBM 7030 press release.)
The CDC 6600 was super for other reasons, too. It was cooled with Freon that circulated in pipes around the four cabinets, which was then heat exchanged with a chilled external water supply (you can see some pipework in the bottom right corner of the image above). While there was only one CPU (which in those days was constructed from multiple circuit boards, not a single chip!) the CDC 6600 had 10 Peripheral Processors, each of which was dedicated to managing I/O and keeping the CPU’s queue full. The CPU itself contained 10 parallel functional units, each of which were dedicated to different tasks; floating point add, floating point divide, boolean logic, etc. The architecture was superscalar, in other words (though this word didn’t exist at the time).
The CPU had 60-bit word length and 60-bit registers, but a very small instruction set, because it only dealt with information that had been pre-processed by the Peripheral Processors. It is this simplicity that allowed the CDC 6600 CPU to be clocked so high. By today’s standards, we would call the CDC 6600 the first RISC system.
The CDC 6600, incidentally, was designed by Seymour Cray — a name that will pop up more than once in the next few pages.
CDC followed up with the 7600 four years later in 1968, but Cray left soon after to set up his own supercomputer company, Cray Research. In 1976, the Cray 1 was released. It was installed at Los Alamos National Laboratory, where it was primarily tasked with nuclear weapons modeling (hooray for the Cold War!)
Clocked at 80MHz, the Cray 1 used integrated circuits (chips) and increased word size (64-bit) to obtain performance of 136 megaflops — a lot faster than the 3-megaflops CDC 6600. 1,662 printed circuit boards with up to 144 ICs on each were crammed into one of the most distinctive-looking supercomputers ever made. Again, Freon liquid cooling was used.
The shape, incidentally, wasn’t a homage to Star Trek — it actually served a purpose. Speed-dependent modules were placed on the inside edge of the computer, where wire lengths were shorter — without it, the timing would be all wrong and 80MHz wouldn’t have been achievable. The modern day equivalent is the laying out of motherboard traces so that everything works in perfect synchrony at billions of hertz.
The Cray 1 would go on to be one of the most successful supercomputers of all time, with over 80 units sold between 1976 and 1982, for between $5 and $8 million a piece (about $25 million in today’s money — a significant reduction from the $60-million CDC 6600).
It’s important to note that, at this stage, an entire supercomputer was still being referred to as a single CPU. The Cray X-MP, released in 1982, had support for up to four CPUs, but housed inside the same Cray 1 chassis. The Cray X-MP CPUs were very similar to the Cray 1, but with a clockspeed bump from 80 to 105MHz and a more-than-doubling of memory bandwidth, each of the X-MP CPUs pushed up to 200 megaflops. For $15 million ($32 million today), you could get your hands on a grand total of 800 megaflops.
By the end of the Cray X-MP’s run it could support up to 16 million 64-bit words of memory — in SRAM! — which is equivalent to around 128MB of today’s RAM. It’s also worth noting that none of the costs mentioned so far include permanent storage — just the computer itself. The Cray X-MP, for example, supported up to 32 disk storage units, each about the size of a filing cabinet (pictured above) and capable of storing 1.2 gigabytes. Each unit cost $270,000 each in today’s money — about $225k per gig — but with an impressive transfer rate of around 10MB/sec, they were probably worth it.
By now you’re probably a bit bored of Cray computers — but the fact is, the company dominated supercomputing from its inception in the ’70s through until the early ’90s. In 1985, the Cray 2 was released. The technology used was fairly similar to the Cray 1 and Cray X-MP — ICs packed together on logic boards — and again it had a similar horseshoe-shaped chassis.
To boost performance, though, the logic boards were crammed incredibly tightly (pictured below), meaning air cooling and Freon heat exchanging was no good — instead, the the entire computer was submersed in Fluorinert. In the picture above, the device on the right is the Fluorinert “waterfall” radiator.
With increased performance (and up to 8 CPUs), Cray Research also had to overcome a memory bottleneck. Basically, the Cray 2 used “foreground” processors to load data from main memory to local memory (similar to a cache but not quite) via a very fast gigabit-per-second bus, and then pass instructions off to “background” processors which would actually perform computation. In today’s nomenclature, foreground processors would be similar to modern CPU load/store units. The peak performance of the Cray 2 was 1.9 gigaflops — about twice the Cray X-MP, and fast enough to retain the title of world’s fastest supercomputer until 1990.
The Cray 2 is notable for being the first supercomputer to run “mainstream” software, thanks toUniCOS, a Unix System V derivative with some BSD features. Until this point, Cray supercomputers had only really been used by US governmental agencies like the DoE and DoD (for nuclear modeling — what else?), but the Cray 2 found a home in many universities and corporations.
Here come the Japanese
After some 20 years of American dominance, the early ’90s would see the emergence of a new king of supercomputing: the Japanese. These computers, such as the NEC SX-3 (pictured below), Fujitsu Numerical Wind Tunnel, and Hitachi SR2201, used very similar architectures to Cray — i.e. highly parallel arrays of vector processors attached to fast memory — and all respectively became the fastest supers in the world. The SR2201 (pictured above — check out the self-adulating “H” chassis!), released in 1996, had 2048 processors and a peak performance of 600 gigaflops — by comparison, a modern Sandy Bridge Core i5 or i7 CPU can perform around 100-200 gigaflops.
During this period there was a shift away from a single shared bus to massive parallelism, where 2D and 3D networks (such as Cray’s Torus interconnect) connected together hundreds of CPUs. This was the origin of MIMD — multiple instruction, multiple data — which eventually led to multi-core CPUs.
Meanwhile, Seymour Cray had broken away from Cray Research to form Cray Computer Corporation (CCC) to build the Cray 3, the first computer built with gallium arsenide chips. The project failed, and then CCC went bankrupt during the production of Cray 4. As you’re probably aware, though, Cray Research most definitely lives on — but more on that later.
But what about Intel?
We’re now up to the mid-’90s, and yet Intel — the king of microprocessors since the ’70s — hasn’t been mentioned once. The main reason for this is that supercomputers and PCs are generally at odds with each other: where supers want as much processing power as possible, PCs have lots of cost and heat constraints. For the most part, it just didn’t make sense to use Intel chips in early supercomputers.
Throughout history, Intel has occasionally tried to launch chips based on a non-x86 architecture, usually without success. In 1989 it released the i860, a 32- and 64-bit RISC chip designed for use in large computers. The i860 would become the basis for the Intel Paragon, a supercomputer that supported up to 4,000 processors in a 2D MIMD topology. Paragon was a commercial failure, but it led to the creation of ASCI Red in 1996 (pictured above), which was the first supercomputer made from off-the-shelf CPUs — Pentium Pros, and then Pentium II Xeons — and other readily-available commercial components.
ASCI Red, with over 6,000 200MHz Pentium Pros, was the first supercomputer to break the 1 teraflop barrier. Later upgraded to 9,298 Pentium II Xeons, ASCI Red reached 3.1 teraflops. It was the fastest supercomputer in the world for four years, and also the first supercomputer installation to use more than 1 megawatt of power. It was only decommissioned in 2006, after 10 years of use by the Sandia National Laboratories.
Once supercomputers could be built with off-the-shelf components, it was only a matter of time until everyone started building supercomputers. Beowulf clusters — networks with any number of commodity PCs, generally running Linux — quickly emerged, and Linux soon replaced Unix as the supercomputing OS of choice.
The commoditization of supercomputers (and compute clusters) almost certainly played a key role in computer animated films like Toy Story, and the increasing use of CGI in cinema and TV throughout the ’90s.
While continued improvements to CPUs obviously helped supercomputers break new records, for the most part high-performance computing (HPC) in the 2000s mostly focused on squeezing more and more CPUs into a single system. This involved the development of ever-more-complex interconnects, and reducing power usage (and thus heat production).
Japan briefly retook the crown from the US ASCI Red and ASCI White in 2002 with the 35-teraflops NEC Earth Simulator, but then in 2004 IBM released Blue Gene/L, the first of a series of supercomputers that would blow the doors off the competition until 2008. The first version of Blue Gene/L, located at Lawrence Livermore National Laboratory, had 16,000 compute nodes (each with two CPUs) and was capable of 70 teraflops — but the final iteration in 2007 had more than 100,000 compute nodes and peak performance of 600 teraflops.
Blue Gene/L was exceptional for two main reasons: Instead of fast, power-hungry chips, it used low-power RISC PowerPC cores — and, except for RAM, the compute nodes were entirely integrated into SoCs (system-on-a-chip). The image above shows the incredible density of a 2U Blue Gene/L rack — and each heatsink is a CPU, and you’ll notice that there are no fans or water cooling blocks.
Blue Gene/L would lead the pack until it was succeeded by IBM Roadrunner, a 20,000-CPU PowerPC/AMD Opteron hybrid that was the first computer to break the 1-petaflop barrier.
Don’t forget the Chinese
It took them a while, but in 2010 China eventually topped the supercomputing charts (The Top500) with the 2.5-petaflops Tianhe-1A. Tianhe-1A is notable for being one of the few heterogeneous supercomputers in operation — it houses 14,336 Intel Xeon X5670 CPUs and 7,168 Nvidia Tesla GPUs — apparently saving lots of power in the process.
More importantly, though, China recently unveiled Sunway, a 1-petaflops supercomputer built entirely out of homegrown ShenWei CPUs. China has repeatedly stated that it wants to lessen its reliance on Western high-technology, and Sunway is a very important step in that direction. Russia has also stated that it would like to build its own homegrown supercomputers, but so far it lacks China’s manufacturing prowess.
The return of Cray, and the Japanese
The current undisputed champion of the high-performance computing world is Fujitsu’s K, housed at the RIKEN institute in Japan, which clocks in at 10 petaflops — some four times faster than Tianhe-1A. K does away with the low-power approach pioneered by Blue Gene and simply throws 88,128 8-core SPARC64 processors into the mix. Each CPU has 16GB of local RAM, for a total of 1,377 terabytes of memory. K draws almost 10 megawatts of power — about the same as 10,000 suburban homes — and the whole thing (some 864 cabinets!) is, understandably, water cooled.
Looking forward, the next target is exaflops — 1,000 petaflops. Realistically, we should hit 100 petaflops in the next few years, and exaflops a few years after that (2018-2020). The USA’s fastest supercomputer, the 1.7-petaflops Cray Jaguar at Oak Ridge National Laboratory, is currently being upgraded to become the 20-petaflops Cray Titan. Titan will be built with Cray XK6 blades, which marry AMD Opteron CPUs and Nvidia Kepler GPUs up to a theoretical peak of 35 petaflops.
Meanwhile, DARPA, recognizing that current silicon technology might not even be capable of exaflops,has summoned researchers to reinvent computing. IBM, on the other hand, is building an exascale supercomputer to process the exabytes of astronomical data produced by the world’s largest telescope, the Square Kilometre Array. The telescope goes online in 2024, which will hopefully give IBM enough time to work out how to multiply the performance of current computers by more than 100.
So there you have it: From 3 megaflops to 10 petaflops in 48 years. The world’s fastest supercomputer is 3.3 billion times faster than the first.
[Image credit: CDC 6600, Wikipedia]