Tag Archive: kabini



Surya R Praveen Bond vs Jaws

A year ago, I wrote a story hypothesizing that AMD’s long-term success might hinge on futureBobcat-based processors rather than on Bulldozer. I wrote it before we learned that Krishna and Wichita, the two 28nm follow-ups to Brazos built at GlobalFoundries, had been canceled. AMD eventually admitted this and put two new Brazos-based designs on the roadmap: Kabini and Temash. Kabini targets netbook/notebook form factors, with Temash planned as the follow-up to AMD’s first tablet SoC, codenamed Hondo.

To date, AMD’s tablet APUs have found very little market, though the company claims it’ll show off multiple design wins at CES. After spending some time with both Surface and Samsung’s Ativ Smart PC, I think AMD has a real opportunity to win back market share in 2013 — provided that Kabini can ship on time. The 28nm laptop chip is expected by Q2 of next year; Temash, the tablet part, will probably launch in the back half of 2013.

A bit of history

AMD’s Bobcat CPU was designed to compete with Intel’s Atom at the upper end of that CPU’s performance and power curve. Every microprocessor design can be thought of as a balance between power consumption, performance, and manufacturing difficulty. Of the three new chips AMD delivered in 2011, Bobcat is the only one that hit all three. Bulldozermissed its power consumption and performance targets; Llano hit both of these, but was difficult to manufacture.

Brazos (that’s the APU) showed up with its game face on, right as netbook sales began to slump. It’s still an important part of AMD’s sales, but the spotlight has mostly been on AMD’s big-core x86 hardware. With Atom, Intel has focused on improving power consumption and moving to SoCs rather than driving raw performance (the first out-of-order Atom, Valleyview, arrives in 2014). That means 28nm Kabini/Temash has a chance to reignite a performance battle in this segment of the market.

AMD’s Jaguar

AMD disclosed a significant amount of information on Jaguar at Hot Chips last August. The new core refines and polishes much of what made Brazos successful, without significantly changing much of the underlying hardware. From a high-level perspective, the two are nearly identical.

Surya R Praveen Bobcat's block diagram

Bobcat’s block diagram

Bobcat is above, Jaguar below.

Surya R Praveen Jaguar block diagram

Jaguar’s block layout

Almost — but not quite. And that’s actually encouraging. CPU architecture analyst Agner Fog describes Bobcat [PDF, page 168] as having “a well balanced pipeline design with no obvious bottlenecks.” When it comes to CPU design, most changes are evolutionary and iterative.

One front-end improvement to Jaguar is the addition of four 32-byte loop buffers. Loop buffers are used to hold a small number of already-decoded instructions. This is useful when the CPU is executing tight loops; it ensures that the decoders aren’t tasked with decoding the same instructions repeatedly. This saves power and speeds overall execution.

Surya R Praveen Jaguar integer design

Jaguar adds a pipeline stage to increase frequency but keeps the two-issue decoder from Bobcat. On the integer side, the core picks up Llano’s hardware divider unit. Previously, integer division was handled via the floating point unit, which caused a significant delay. Jaguar also includes support for SSE4.2, AVX, and features a larger read order buffer (ROB).

Surya R Praveen Jaguar's FPU

The biggest changes between Jaguar and Bobcat are on the FPU side of the equation. The FPU units are now 128 bits wide, compared to 64 bits on Bobcat. The chip supports 256-bit AVX by breaking the operations into a pair of 128-bit uops, just like Bulldozer and Piledriver do. FPU performance won’t match Trinity — Jaguar can only decode two instructions per clock cycle, compared to four for the larger core — but it should substantially improve over Bobcat.

Surya R Praveen Jaguar L1 cache

Next up are the L1/L2 caches. The L1 improvements listed here are all designed to reduce latency penalties and improve FPU bandwidth. Like Bobcat, Jaguar’s L1 is split into a 32K instruction cache and 32K data cache and is two-way set associative.

The L2 cache is a bit different.

Surya R Praveen Jaguar's L2 cache

Each Bobcat core had a 512K L2 directly attached, clocked at half CPU speed. With Jaguar, AMD has opted to attach a single shared cache to the CPUs. This cache pool is connected via an L2 interface unit, running at full processor speed. The L2 cache itself still runs at 50% core clock.

Going this route has several advantages for AMD. First, it makes more total L2 available to any single core in a single-threaded program. The total number of supported cores is bumped to four (Bobcat was strictly a dual-core design) and it simplifies the chip’s layout. Data lookups and L2 cache misses should both be improved with the new design.

Surya R Praveen Jaguar IPC

AMD is projecting a 15% IPC gain as well as a 10% frequency gain for the new part. That puts the core in a very interesting position.

Surya R Praveen Jaguar's overall positioning

AMD is talking about Jaguar/Kabini solely as a quad-core part, but we expect the companywill release a dual-core SKU. It’s a sensible way to boost yield and improve availability.

Source


Surya R Praveen AMD Logo - 3D

AMD’s Analyst Day kicks off today and the news is starting to flow. Additional details will be disclosed throughout the various presentations, so we’ll update this story or publish follow-ups as appropriate.

So what are the headlines so far? As we expected, AMD has canceled its Krishna and Wichita APUs that were to follow Brazos, in favor of what it calls Brazos 2.0. Brazos 2.0, as it turns out, looks just like Brazos 1.0, but with minimally faster clock speeds and USB 3.0 tossed in. We spoke with the company yesterday in a pre-briefing.

This could be problematic for the company’s lower-end products. Qualcomm has given notice that it intends to push into the netbook market late this year or early next, while Brazos’ 40nm technology will face competition from Intel’s 32nm Atom, as well as 28nm Qualcomm and Cortex-A15 chips.

Surya R Praveen AMD 2013 roadmap

Hondo — a chip we discussed last August, and rumored to be cancelled — is still on target. It’s a respun version of Brazos that’s been rearchitected for low-power operation. AMD has had several wins with Desna, its 5.9W TDP tablet option; Hondo brings this down to 4.9W. With Microsoft’s Windows 8 not expected until the end of the year, AMD has time to ready something more competitive before the x86 tablet market really takes off.

Come 2013, we’ve got debuts from Temash, Kabini, Kaveri, and Sea Islands, AMD’s next-generation graphics core. Temash will use the next-generation Jaguar CPU core and will be AMD’s first SoC, building on the expertise that AMD gains from Hondo. Kabini, meanwhile, uses the same core but fits into a slightly higher power envelope. It’s not clear if Kabini is also an SoC or not — keeping a separate APU part would give AMD more die space to devote to CPU/GPU processing cores.

Finally, there’s Steamroller, a third-generation Bulldozer core and what AMD calls “HSA” (Heterogeneous System Architecture) features. Based on the current rate of progression, the GPU at the heart of Temash, Kabini, and Kaveri will be based on AMD’s Tahiti (aka 7900). The Trinity GPU is based on Cayman.

Surya R Praveen AMD Financial Analyst Day

This slide breaks down the differences between mobile and desktop. One surprising factor in AMD’s pre-briefing is that the third-generation CPU at the heart of Kabini and Kaveri doesn’t appear to have a high-end variant — at least not in 2013. AMD also intends to move to 28nm production in 2013. GlobalFoundries has a 28nm-SHP process that uses SOI, but everything we’ve heard from the foundry suggests that 28nm is a very modest improvement over 32nm as far as power consumption is concerned. As we’ve explored recently, however, modest improvements are the best the semiconductor industry can deliver these days.

Surya R Praveen AMD Analyst Day

The left-hand column shows server plans for 2012, the right side is 2013. This new roadmap is significantly different from slides that leaked back in August. At that point, AMD’s plan was to release new platforms, with 10 and 20-core Bulldozer chips launching in 2012 on 32nm, followed by 28nm die shrinks in 2013. As the new slide shows, AMD’s G34 and C32 platforms will survive through 2012. According to company executives, the performance improvements from Piledriver are significant enough to make the switch to deca-core and icosa-core processors unnecessary. Instead, AMD will hold upper core counts steady at octal and hexadeca levels. (This crash course in Greek nomenclature brought to you by the letter Qoppa).

This is good news. AMD’s previous guidance implied Piledriver would deliver a 10-15% improvement in performance-per-watt. Hopefully the company managed to exceed that target — but even if it didn’t, what BD needs is a combination of improved architectural efficiency, faster caches, and higher clock speeds. AMD’s roadmap doesn’t show anything beyond 32nm — a discrepancy that may be explained by the following older slide.

Surya R Praveen Old Fusion

“Bulldozer NG,” in this case, is Piledriver. Given that the company has canceled its original plan to move to a new platform and 10/20-core architecture in 2013, it’s possible that AMD’s server platforms will move directly from the configuration on the far left to the far right, SoC-style implementation. Historically, AMD’s desktop and server CPUs have been tightly linked as far as their CPU architectures are concerned — the fact that we don’t see third-generation CPU core anywhere in 2013 could mean that the company will move to a unified SoC for servers and high-end desktop in 2014.

There’s still considerable question as to Trinity’s CPU performance and whether it’ll be strong enough to keep AMD competitive with Intel through 2012. The good news is that things should improve in 2013 with the launch of new 28nm hardware across the company’s entire product line.

Source