Hot Chips: IBM's Next Generation z14 CPU Mainframe Live Blog (5pm PT, 12am UTC)
by Ian Cutress on August 22, 2017 6:55 PM EST- Posted in
- CPUs
- Enterprise CPUs
- IBM
- IBM z14
- Mainframe
07:56PM EDT - Sitting down, ready to go
08:01PM EDT - This is the last set of talks at Hot Chips. Starting with IBM, then Intel Xeon, AMD EPYC and Qualcomm Centriq
08:02PM EDT - We've covered Xeon, EPYC and Centriq in recent articles, and nothing new is being announced for the show for them except some minor things that we'll summarize in a news post
08:02PM EDT - But the IBM z14 will be interesting
08:02PM EDT - To clarify, the z series is IBM's mainframe product line
08:02PM EDT - So this isn't POWER8 or POWER9
08:04PM EDT - IBM's z-series has central processors and system control chips with integrated fabric and off-compute chip caches
08:05PM EDT - This is under a 'mainframe' setup, rather than a standard CPU/co-processor setup.
08:05PM EDT - Dr Christian Jacobi to the stage, Chief Architect
08:06PM EDT - z14 was technically announced a few weeks ago
08:06PM EDT - A lot of mainframes still exist
08:06PM EDT - Still used in large corporations for transactional data, e.g. credit card has a mainframe involved. 90% of airline booking systems involve mainframes
08:07PM EDT - Run large databases and large virtualised linux
08:07PM EDT - Have to make design decisions tailored for those workloads
08:07PM EDT - z10 was high frequency, z196 had OoO, z13 had SMT and now z14
08:08PM EDT - The mainframe uses two different chips - the CP (cores and shared L3) and SCP (large L4 and interconnect logic)
08:08PM EDT - Picture is a deep drawer with DRAM, PCIe, and six CP chips under cold plates and one SC (SCP)
08:08PM EDT - Two clusters of CP chips connect to the SC. Can connect four drawers together
08:09PM EDT - CP and SC are large chips, 17 layer metal in 14nm SOI
08:09PM EDT - 10 cores has private 2MB L2-i and 4MB L2-D and 128 MB shared L3
08:09PM EDT - SC chip has 672MB of L4 and coherency logic
08:10PM EDT - Up to 24 sockets int he system, 32 TB RAIM protected memory, 40 PCIe lane fanouts, 320 IO cards
08:10PM EDT - New translation and TLB design over z13, and general pipeline optimations. Changes in instruction set too
08:10PM EDT - Pauseless garbage collection for Java, single and quad vector precision for crypto
08:11PM EDT - Register to register arithmatic
08:11PM EDT - Optimizing for COBOL performance (........)
08:11PM EDT - E.g. gazillions of lines of COBOL in online booking systems
08:11PM EDT - Compression acceleration
08:11PM EDT - This is the pipeline diagram
08:12PM EDT - 5.2 GHz, super long pipeline
08:12PM EDT - 6 instruction parse and decode, CISC instruction cracking
08:12PM EDT - 4-cycle load/use
08:12PM EDT - Directory and TLB pipeline changes
08:13PM EDT - Most designs use logical indexed, absolute tagged directory
08:13PM EDT - Use of partial compare set-predict array reduces latency of data return from L1 cache - TLB and L1 directory access happen in parallel with L1 cache read
08:13PM EDT - (doesn't that sound like way-prediction?)
08:14PM EDT - Highly associative TLB is area and power inefficiency, to limit TLB L1 size
08:14PM EDT - Sorry, I misread the slide, This is how L1 cache looks today
08:14PM EDT - This new slide shows how IBM is using it in z14
08:15PM EDT - I-cache and D-cache is now logically tagged, combining TLB1 and cache directory into single structure
08:15PM EDT - Significant area and power reduction for L1 hit
08:15PM EDT - Now a super large L2 TLB
08:16PM EDT - L2 and TLB2 can be large - 2MB L2I and 4MB L2D, 6k entries TLB2 for 4KB pages
08:16PM EDT - 8 cycle L2 hit latency (that's only 1.5 ns) ...
08:17PM EDT - Now crypto
08:17PM EDT - Now redesigned for 4-7x bandwidth
08:17PM EDT - make it simple and fast enough to be able to encrypt all data
08:17PM EDT - combination of OS, firmware and hardware implementation
08:18PM EDT - Execute 2 AES in 3 cycles
08:18PM EDT - Copy up to 256B per instruction from D-cache to coprocessor
08:18PM EDT - can execute multiple AES at once, multiple engines on die
08:19PM EDT - 13.2GB/sec per core (so 132GB/s per CP, and about 1TB/s per 6-socket server)
08:19PM EDT - Use new instructions to feed crypto engine to avoid branches
08:19PM EDT - Avoid pipeline bubbles using new instructions
08:19PM EDT - Significant effort in prefetching as well
08:20PM EDT - New GCM instruction
08:20PM EDT - Algorithm that does encryption and signature authentication
08:20PM EDT - Implement use AES and GHASH engines
08:20PM EDT - the 2 engines used in concert rather than independently
08:21PM EDT - Now key protection - most CPUs work with keys in memory. CryptoExpress6S is a tamper responding PCIe crypto accelerator. Master key is in physically protected memroy on card
08:21PM EDT - 'Clear Key Cryptography'
08:22PM EDT - Root key access usually means can steal key through mem access or core dump. This method means that the key is protected by tamper protection
08:23PM EDT - Secure Key is another mode, which diverts all crypto off the CPU onto the card instead
08:23PM EDT - This way the application never sees the key, just sees the encrypted data
08:24PM EDT - Creates a key token from the data, which remains in tamper resistent memory, and when data is decrypted, key is thrown away and new key generated
08:24PM EDT - Data Compression Accelerator
08:24PM EDT - Dictionary based data compression
08:25PM EDT - Reduces bandwidth need between memroy and disks, increases efficiency, implemented as irmware and co-processor specialized hardware
08:25PM EDT - *firmware
08:25PM EDT - z14 performance at peak throughput and start up latency. Optimized compression status return to firmware
08:26PM EDT - Order-preserving compression: Allows data still be compared when compressed
08:26PM EDT - Allows compressed directory/tree structures to do comparisons between elements without decompression
08:27PM EDT - CP has 7b transistors, SC has 10b transistors
08:27PM EDT - water cooled
08:28PM EDT - of 240 CPUs in a full system, 170 can be customer configured
08:28PM EDT - +35% capacity, +10 single thread, +25% SMT2 perf over z13
08:29PM EDT - Now for Q&A
08:29PM EDT - Q: Please generate workstations. I want to swap out x86 with z14
08:29PM EDT - (at same price, insert laughs)
08:29PM EDT - Not a serious question
08:30PM EDT - Q: What power for the chips?
08:31PM EDT - A: You can get the chips to run at any power you need. Could go 400-500W on high workload. We aim around 300-350W. We don't bin - there's only one product and we stay within the drawer power
08:31PM EDT - The chips themselves are water cooled, but customers can run an aircooled system, or you can hook up datacenter water
08:32PM EDT - Q: Doesn't going over the PCI card cause extra latency
08:32PM EDT - A: Card only has the master key - the data has a key token, which doesn't need to keep going back and forth
08:32PM EDT - Q: Have you considered something like SGX?
08:33PM EDT - A: That's not an apples to apples comparison. We consider the tamper resistant element a key feature of our products.
08:34PM EDT - Q: But SGX prevents someone with a logic analyzer going in
08:34PM EDT - A: Our solution does not need recoding - our customers use older software and it is transparent
08:34PM EDT - Q: What would you do to make COBOL run faster?
08:35PM EDT - A: COBOL does a lot of time doing BCD arithmetic, but there's traditional issue queue limitations, so we use packed BCD compute to reduce that bottleneck
08:36PM EDT - Q: What did +35% capacity and +25% SMT2 mean
08:37PM EDT - A: +35% is instructions for a whole system. The +10% single thread is a large scale number for benchmarks on capacity planning. +25% SMT2 from tuning and tweaking in our implementation due to maturity
08:37PM EDT - That seems to be a wrap. This is our last live blog on Hot Chips - I'll be writing up some of these talks on my flight home tomorrow. Hope you enjoyed them :)
67 Comments
View All Comments
takeshi7 - Tuesday, August 22, 2017 - link
I'm keeping my fingers crossed for a Power Mac G6 or a POWER based Windows PC. I've been waiting since 2005. Don't let me down IBM.ddriver - Tuesday, August 22, 2017 - link
Why, despite IBM claims POWER performance was never really exceptional or even offering a tangible lead over x86, and at prices that make it very poor value product.takeshi7 - Tuesday, August 22, 2017 - link
performance just has to be competitive, not exceptional. And prices would be lower if they produced and sold more for PCs. Competition is good for everyone.ddriver - Tuesday, August 22, 2017 - link
It won't do much good without software. It would take a considerable software base to make a useful platform. Just look at linux, it rules servers, supercomputers, mobiles, but it sucks for things like content creation or engineering, because there is just no prosumer software for it. And it's still x86, just different OS.Now imagine throwing in a different platform in that equation. The effort that would be needed to catch up on the software size will dwarf the hardware development of the platform in the last 10 years.
tuxRoller - Wednesday, August 23, 2017 - link
Linux is heavily used in workstations in the animation industry.Santoval - Saturday, August 26, 2017 - link
Do you really think that Maya, Arnold, Houdini, Modo, Mudbox, Softimage(RIP), Indigo, Nuke, Katana etc (all professional proprietary software, used extensively in Hollywood films and many computer games, among others, without even mentioning the open source usual suspects - Blender, POV-Ray etc) support Linux just for the heck of it? Or that it might be possible that CGI studios often prefer to model, render, and animate on Linux machines?AmeliaPerry - Monday, August 28, 2017 - link
Read following report to learn how a single mom was able to make $89,844/year in her spare time on her computer without ........ http://cutt.us/4DDiGmichael2k - Wednesday, August 23, 2017 - link
While what you say is technically true, ARM is more likely than POWER to be in a PC or MacSarahKerrigan - Tuesday, August 22, 2017 - link
Power wins by quite a bit for some loads, especially per-core. 12c Power8 vs 15c Ivy-EX when P8 launched tended to show big wins for Power8.ats - Wednesday, August 23, 2017 - link
But that isn't exactly apples to apples. Power8 tended to have a much higher power budget than Ivy-EX. For instance, Skylake-X has much higher performance when you are willing to give it 300W per chip than 150W per chip.