A year ago, I wrote a story hypothesizing that AMD’s long-term success might hinge on futureBobcat-based processors rather than on Bulldozer. I wrote it before we learned that Krishna and Wichita, the two 28nm follow-ups to Brazos built at GlobalFoundries, had been canceled. AMD eventually admitted this and put two new Brazos-based designs on the roadmap: Kabini and Temash. Kabini targets netbook/notebook form factors, with Temash planned as the follow-up to AMD’s first tablet SoC, codenamed Hondo.
To date, AMD’s tablet APUs have found very little market, though the company claims it’ll show off multiple design wins at CES. After spending some time with both Surface and Samsung’s Ativ Smart PC, I think AMD has a real opportunity to win back market share in 2013 — provided that Kabini can ship on time. The 28nm laptop chip is expected by Q2 of next year; Temash, the tablet part, will probably launch in the back half of 2013.
A bit of history
AMD’s Bobcat CPU was designed to compete with Intel’s Atom at the upper end of that CPU’s performance and power curve. Every microprocessor design can be thought of as a balance between power consumption, performance, and manufacturing difficulty. Of the three new chips AMD delivered in 2011, Bobcat is the only one that hit all three. Bulldozermissed its power consumption and performance targets; Llano hit both of these, but was difficult to manufacture.
Brazos (that’s the APU) showed up with its game face on, right as netbook sales began to slump. It’s still an important part of AMD’s sales, but the spotlight has mostly been on AMD’s big-core x86 hardware. With Atom, Intel has focused on improving power consumption and moving to SoCs rather than driving raw performance (the first out-of-order Atom, Valleyview, arrives in 2014). That means 28nm Kabini/Temash has a chance to reignite a performance battle in this segment of the market.
AMD disclosed a significant amount of information on Jaguar at Hot Chips last August. The new core refines and polishes much of what made Brazos successful, without significantly changing much of the underlying hardware. From a high-level perspective, the two are nearly identical.
Bobcat’s block diagram
Bobcat is above, Jaguar below.
Jaguar’s block layout
Almost — but not quite. And that’s actually encouraging. CPU architecture analyst Agner Fog describes Bobcat [PDF, page 168] as having “a well balanced pipeline design with no obvious bottlenecks.” When it comes to CPU design, most changes are evolutionary and iterative.
One front-end improvement to Jaguar is the addition of four 32-byte loop buffers. Loop buffers are used to hold a small number of already-decoded instructions. This is useful when the CPU is executing tight loops; it ensures that the decoders aren’t tasked with decoding the same instructions repeatedly. This saves power and speeds overall execution.
Jaguar adds a pipeline stage to increase frequency but keeps the two-issue decoder from Bobcat. On the integer side, the core picks up Llano’s hardware divider unit. Previously, integer division was handled via the floating point unit, which caused a significant delay. Jaguar also includes support for SSE4.2, AVX, and features a larger read order buffer (ROB).
The biggest changes between Jaguar and Bobcat are on the FPU side of the equation. The FPU units are now 128 bits wide, compared to 64 bits on Bobcat. The chip supports 256-bit AVX by breaking the operations into a pair of 128-bit uops, just like Bulldozer and Piledriver do. FPU performance won’t match Trinity — Jaguar can only decode two instructions per clock cycle, compared to four for the larger core — but it should substantially improve over Bobcat.
Next up are the L1/L2 caches. The L1 improvements listed here are all designed to reduce latency penalties and improve FPU bandwidth. Like Bobcat, Jaguar’s L1 is split into a 32K instruction cache and 32K data cache and is two-way set associative.
The L2 cache is a bit different.
Each Bobcat core had a 512K L2 directly attached, clocked at half CPU speed. With Jaguar, AMD has opted to attach a single shared cache to the CPUs. This cache pool is connected via an L2 interface unit, running at full processor speed. The L2 cache itself still runs at 50% core clock.
Going this route has several advantages for AMD. First, it makes more total L2 available to any single core in a single-threaded program. The total number of supported cores is bumped to four (Bobcat was strictly a dual-core design) and it simplifies the chip’s layout. Data lookups and L2 cache misses should both be improved with the new design.
AMD is projecting a 15% IPC gain as well as a 10% frequency gain for the new part. That puts the core in a very interesting position.
AMD is talking about Jaguar/Kabini solely as a quad-core part, but we expect the companywill release a dual-core SKU. It’s a sensible way to boost yield and improve availability.