Pentium 4

Pentium 4: In Depth
Copyright (C) 2000 by Darek Mihocka
President and Founder, Emulators Inc.
Updated December 29, 2000

Introduction

How Intel Blew It

Processor Basics

Limitations of the Pentium III

Pentium 4 - Generation 7 or complete stupidity?

Analyzing the results

Introduction

According to Gateway's web site, the Pentium 4 is "the most powerful processor available for your PC". Unfortunately for most computer users, it's simply not true.

Merry Christmas and brace for impact! The PC industry is taking a huge leap backwards as Intel's new flagship Pentium 4 processor turns out to be an engineering disaster that will hurt both consumers and computer manufacturers for some time to come. Effects of Intel's heavily delayed Pentium 4 release and this summer's aborted high-end Pentium III release are already being felt, with sharp drops in PC sales this season, and migration to competing AMD Athlon based systems. Intel, and Intel exclusive vendors such as DELL have already suffered crippling drops in stock price due to Intel's mis-steps, each company's stock falling well over 50% in the past few months and hitting yearly lows this month. Don't say I didn't warn you folks about this back in September, I did.

As has been confirmed by other independent sources such as Tom's Hardware and most recently http://www.eet.com/story/OEG20001213S0045, and as a month of close study of the chip reveals, the Pentium 4 does not live up to speed claims, loses terribly in fact to months old AMD Athlon chips, and lacks some of the crucial features originally designed into the Pentium 4 spec. The only thing the chip does live up to is the claim that it is based on a new redesigned architecture - something that Intel does every few years, but usually to increase speed. The new architecture has serious fatal flaws that in some common cases can throttle the speed of a 1.5 GHz Pentium 4 chip down to the equivalent speed of a mere 200 MHz Pentium chip of 5 years ago. In fact this terrible performance is below the level of any Celeron, Pentium II, or Pentium III chip ever released! It's a huge setback for Intel.

As Tom's Hardware points out, in some case a massive rewrite of the code does put the Pentium 4 on top, barely, over the Athlon, but for most "bread and butter" code it losses horribly. As I've found out this past month in rewriting both FUSION PC and SoftMac 2000 code, no amount of code re-writing can make up for the simple fact that the Pentium 4 contains serious defects in design and implementation. Other developers who have followed Intel's architectures and optimization guidelines and optimized their code for the Pentium II and III will also find that no amount of rewriting will make their code run faster on the Pentium 4 than it currently runs on the Pentium III. And this is not the fault of the developer. In cutting corners to rush to release the Pentium 4, Intel made numerous cuts to the design to reduce transistor counts and reduce manufacturing costs. And in the process crippled the chip.

What's worse, popular compiler tools, such as Microsoft's Visual C++ 6.0 are still producing code optimized for obsolete 486 and Pentium classic processors - they haven't even caught up to Pentium III levels yet. Since most developers do not write in low-level machine language as I do, most Windows software released for the next year or two will not be Pentium 4 (or even Pentium III) optimized. Far from it. And since Microsoft traditionally takes about 3 to 5 years from the release of a processor to the time when their compiler tools are optimized for that processor, well, it'll be a long wait for all of us waiting for Intel and Microsoft to get things right.

Good news is, the AMD Athlon processor is still the fastest x86 processor on the planet and works around many of the problems in Microsoft's compilers and Intel's flawed Pentium III and Pentium 4 designs. If you weren't an AMD fan before, you will be.

What happened? In an attempt to regain the coveted PC processor speed crown which the Intel Pentium III lost to the AMD Athlon in late 1999, Intel seems to have lost all sense of reason and no longer allows engineering to dictate product design. Under pressure from stock holders to prop us its sagging stock price due to product delays, and under pressure from PC users to deliver a chip faster than the AMD Athlon, Intel made two horrible mistakes in 2000 trying to push new chips out the door that were not technically sound:

#1 - in the summer of 2000 it tried to push the aging "P6" architecture too far. The P6 design, or 6th generation of x86 processor which since 1996 has been the heart of all Pentium Pro, Pentium II, Celeron, and Pentium III processors, simply does not scale well above 1 GHz. As the aborted 1.13 GHz Pentium III launched stillborn this summer showed, Intel tried to overclock an aging architecture without doing thorough enough testing to make sure it would work. The chip was recalled on the day of the launch, costing computer manufacturers such as DELL millions of dollars in lost sales and causing more users to migrate to the more stable, and faster, AMD Athlon.
#2 - after numerous postponements, Intel finally launched the Pentium 4 chip on November 20 2000, under pressure to ship by end of 2000, but only after engineering cut so many features from the chip as to effectively render it useless.

What it boils down to is this - just like at Microsoft and just like at Apple, the marketing scumbags at Intel have prevailed and pushed sound engineering aside. With the 1.13 GHz Pentium III chip dead on arrival, and the Pentium 4 crippled beyond repair, Intel may have just set itself back a good 3 to 5 years. Don't get me wrong, I've liked Intel's processors for years. I rode their stock up when their engineers were allowed to innovate. After all, they invented the processor that powers the PC. For almost a decade the 486 and Pentium architectures have been superior to any competitors' efforts - better and faster than the AMD K5 and K6 chips, far more backward compatible than Motorola 68K and PowerPC chips, and almost as fast or faster than the previous generation of chips they replaced. But as past history shows, it takes an Intel or an AMD or a Motorola a good 3 to 5 years to design a new processor. And when you blow it, you sit in second place for those next 3 to 5 years.

What users get today, buying either the 1.4 or 1.5 GHz systems from DELL or Gateway or whoever, is an over-priced, under-engineered, and very costly computer. A basic 1.5 GHz Pentium 4 computer runs for well over $3000, while comparable Athlon and Pentium III based systems literally cost 1/3 as much and run faster. Given the price the PC manufacturers pay Intel for the Pentium 4 chip (a few hundred dollars more than the Pentium III), and given the $1000 to $2000 premium consumers pay for Pentium 4 systems, the only ones who benefit from the Pentium 4 are the PC manufacturers themselves! That is, if people will be stupid enough to fall for it.

The Pentium 4 fails miserably on all counts. In terms of speed and running existing Windows code, the Pentium 4 is as slow or slower than existing Pentium III and AMD Athlon processors. In terms of price, an entry level Pentium 4 system from DELL or Gateway sells for about double the cost of a similar Pentium III or AMD Athlon based system, with little or no benefit to the consumer. And most sadly of all, from the engineering viewpoint, the Pentium 4 design is very disappointing and casts serious doubts on whether any intelligent life exists in Intel's engineering department. After a month of using them, I was so disgusted with the two Pentium 4 machines I purchased in November that both machines have since been return to DELL and Gateway. I personally own dozens of PCs and hundreds of PC peripherals, and never have I been so disgusted with a product as to return it.

Both DELL and Gateway falsely advertise their Pentium 4 based systems as somehow being superior or better than their Pentium III and/or Athlon based systems. The only thing that is superior is the price. I urge all computer consumers to BOYCOTT THE PENTIUM 4 and BOYCOTT ALL INTEL PRODUCTS until such time as Intel redesigns their chips to work as advertised. If you have already purchased a Pentium 4 system and sadly found out that it doesn't work as fast as expected, RETURN IT IMMEDIATELY for a refund.

In hindsight, it is not surprising then that prior to the November 20th launch of the Pentium 4, Intel delayed the chip numerous times, and Intel, DELL, Gateway, and COMPAQ all warned of potential earnings problems in the coming quarter, probably knowing full well of the defects in the Pentium 4. Remember, the engineers at those companies have had Pentium 4 chips to play with for several months and all knew of its defects.

It is also not surprising that a week before the Pentium 4 launch, at COMDEX Las Vegas, neither Intel nor Gateway, who both had huge displays at that show, would give much information about the Pentium 4 systems. While Intel did display Pentium 4 based computers, they were locked up during show hours and not available for inspection by the general public. At Gateway's booth, many of the salespeople appeared ignorant, apparently not even aware that the Pentium 4 was being launched the following week. Even DELL, usually a big exhibitor, chose to pull their show exhibit completely, holding only closed door private sessions with the press. Shareholders and software developers were barred from these secret meetings.

Don't be a sucker. Don't buy a Pentium 4. Do as we have suggested here for over a year. If you need a fast cheap PC, buy one that uses an inexpensive Celeron processor. If you require maximum speed, buy one based on the AMD Athlon. Under no circumstances should you purchase a Pentium II, Pentium III, or Pentium 4 based computer! In fact, with the cheaper AMD Duron now available to rival the Celeron, it makes more sense to boycott Intel completely. Buy AMD based systems. AMD has worked hard to outperform Intel and they deserve your business!

How Intel Blew It

Before I start the next section and get very technical, I'll explain briefly how over the past 5 years Intel dug itself into the hole it is in now. When you understand the trouble Intel is in, their erratic behavior will make a little more sense, if you can call it that.

Let's go back 2 or 3 years, back to when the basic Pentium and Pentium MMX chips were battling with AMD's K6 chips. AMD knew (I'm sure) that it had inferior chips on its hands with the K6 line. With the goal of producing a chip truly superior than anything from Intel, its engineers went back to the drawing board and designed a chip architecture from scratch they codenamed the "K7". What in late 1999 was eventually released as the AMD Athlon processor. It took 5 years of work, but they hit their goal. Faster than the best Pentium III at the time, the Athlon delivered 20% to 50% more speed at only slightly higher cost than the basic Pentium III. Mission accomplished.

Intel on the other hand, not content with 90% market share, focused not on FASTER chips, but on CHEAPER SLOWER chips. Monopolistic actions, much like Microsoft's, designed not to deliver a better product to the consumer but rather to wipe out the competition. The Pentium II, while easily the fastest chip on the market at the time, was also more expensive than the AMD K6 and its own Pentium chips. And thus started a comical series of brain dead marketing blunders:

Intel launched the Celeron processor, a marketing gimmick aimed directly at AMD, which consisted of nothing more than taking a Pentium II and removing the entire 512K of L2 cache memory. Basically what they did was chop off the most expensive part of the chip to reduce costs, with no regard to side effects. The result was a smaller less expensive processor, which unfortunately had the nasty side effect of running far slower than other AMD or Intel chips on the market at the time! Buying a Celeron was like buying an old 486 system, it was that slow.
When that didn't pan out, Intel kept selling its older line of Pentium MMX chips. While running at the same 233, 266, and 300 MHz clock speeds as the Pentium II, the Pentium MMX was based on the older design of the original Pentium, and thus delivered about 30% less performance than the Pentium II. Again, it lost out to the AMD K6.
In an effort to fix the Celeron problem, Intel re-launched the Celeron as the Celeron-A, (starship Enterprise anyone?) which now featured, surprise, surprise, an L2 cache right on the chip. The new chip was indeed much faster than the original Celeron, and unfortunately, due to a better designed L2 cache, was even faster than the more expensive Pentium II chip. Intel now shot itself in the foot by offering a "low end" Celeron processor that outperformed the "high end" Pentium II. Confusion!
Finally, in 1999, Intel killed off the slower more expensive Pentium II by introducing the "new" Pentium III. Which for all intents and purposes is simply a Pentium II with a higher number to justify the higher cost relative to the Celeron.

In other words, Intel succeeded so well at producing a low cost version of the Pentium II, that it not only put the AMD K6 to shame, it also killed off the Pentium II and was forced to fraudulently remarket the chip as the Pentium III! For all intents and purposes, the Pentium II, the Celeron, and the Pentium III are ONE AND THE SAME CHIP. They're based on the same basic P6 architecture, with only things like clock speed and cache size to differentiate the chips. This is why we tell you not to purchase a Pentium II or Pentium III based system. If you must buy Intel, buy a Celeron. Same chip, lower cost.

Sure, sure, the Pentium III has new an innovative features, like, oooh, a unique serial number on each chip. Well guess what? The serial number idea was so poorly received, and rightfully so, that the serial number is already dead. The Pentium 4 has no such feature.

What Intel FAILED TO DO during these past 5 years is it failed to anticipate the end of the line of the P6 architecture would come as quickly as it did. It hits an upper limit around 1 GHz and can no longer compete with faster AMD chips which people already have running in the 1.5 GHz range.

Here is how and why Intel REALLY blew it. Intel has known since the Athlon first came out that its P6 architecture was doomed. Intel was already well under way to developing the Pentium 4. Remember, these chips take 3 to 5 years to design and implement and it had already been 3 years since the P6 architecture was launched. Intel had about two more years of work left, but that mean losing badly to the Athlon for those next two years.

So instead of focusing on engineering - doing what AMD did and biting the bullet while it developed the new chip - Intel went ahead and first tried to ship a faster Pentium III chip. That back fired. So as a last resort they pulled another Celeron-type stunt and shipped a crippled chip that cut so many features as to result in a chip that is neither fast nor cheap and benefits no one but greedy computer makers.

I've been studying Intel's publicly available white papers on the Pentium 4 for the good part of 6 months now, and while the chip looked promising on paper, the actual first release of the chip is a castrated version at best of the ideal chip that Intel set out to design. Intel selectively left out important implementation details of the Pentium 4, which they finally revealed in November with the posting of the Intel Pentium 4 Processor Optimization manual on their web site.

In an attempt to cover up their design defects, and with no back up plan in place (since the demise of the 1+ GHz Pentium III chip) Intel has been forced to carefully word their optimization document. I encourage all software developers and technically literate computer users to download the Pentium 4 optimization manual mentioned above, and to also for comparison download and study the Pentium III manuals as well as the AMD Athlon manual. It does not take a rocket scientist to read and compare the three sets of processors and realize the fatal design flaws in the Pentium 4.

The flaws in the Pentium 4 are so serious in fact that this is far beyond earlier fiascos. This is not a simple Pentium floating point bug that can be fixed by replacing the processor. This is not a 486SX scam where Intel was selling crippled 486DX chips as SX chips and then selling you a second processor (a real 486DX) as an upgrade. No, in both those past cases the defective chip still delivered the true speed performance advertised. One was simply the result of a minor design error while the other was a marketing scam, but in the end, the chips lived up to spec.

In the case of the Pentium 4, the chip contains design flaws which aren't easily fixed, and it is marketed fraudulently since the speed claims are pulled out of thin air. No quick upgrade or chip fix exists to deliver the true performance that the Pentium 4 was supposed to have. Users will have to wait another year or two while Intel cranks out new silicon which truly implements the full Pentium 4 spec and fixes some of the glaring flaws of the Pentium 4 design.

If you do not have a good technical background on Pentium processors, I recommend you read my Processor Basics section. It will give you a good outline of the history of PC processors over the past 20 years and will allow you to read and understand most of the Intel and AMD processor documents. You have to have at least a basic understanding of the concepts in order to understand why the Pentium 4 is the disaster that it is.

Or, skip ahead to the Pentium 4 - Generation 7 section.

Processor Basics - the various generations of processors over the past 20 years

Generation 1 - 8086 and 68000

In the beginning, the computer dark ages of two decades ago, there was the 8086 chip, Intel's first 16-bit processor which delivered 8 16-bit registers and could manipulate 16 bits of data at a time. It could also address 16-bit of address space at a time (or 64K, much like the Atari 800 and Apple II of the same time period). Using a trick known as segment registers, a program could simultaneously address 4 such 64K segments at a time and have a total of 1 megabyte of addressable memory in the computer. Thus was born the famous 640K RAM limitation of DOS, since the remaining 384K was used for hardware and video.

A lower cost and slower variant, the 8088, was used in early PCs, providing only an 8-bit bus externally to limit the number of pins on the chip and reduce costs. As I incorrectly stated here before, the 8086 was not used in the original IBM PC. It was actually the lower cost 8088.

The original Motorola 68000 chip, while containing 16 32-bit registers and being essentially a 32-bit processor, used a similar trick of having only 16 external data pins and 24 external data pins to reduce the pin count on the chip. An even smaller 68008 chip, addressed only 20 bits of address space externally and had the same 1 megabyte memory limitation as the 8086.

While these first generation processors from Intel and Motorola ran at speeds of 4 to 8 MHz, they each required multiple clock cycles to execute any given machine language instruction. This is because these processors lacked any of the modern features we know today such as caches and pipelines. A typical instruction to 4 to 8 cycles to execute, really giving the chips an equivalent speed of 1 MIPS (i.e. 1 million instructions per second).

Generation 2 - 80286 and 68020

By 1984, Intel released the 80286 chip used in the IBM AT and clones. The 80286 introduced the concept of protect mode, a way of protecting memory so that multiple programs could run at the same time and not step on each other. This was the base chip that OS/2 was designed for and which was also used by Windows/286. The 286 ran at 8 to 16 MHz, offering over double the speed of the original 8086 and could address 16 megabytes of memory.

Motorola meanwhile developed the 68020, the true 32-bit version of the 68000, with a full 32-bit data bus and 32-bit address bus capable of addressing 4 gigabytes of memory.

By the way, both companies did release a "1" version of each processor - the 80186 and 68010 - but these were minor enhancements over the 8086 and 68000 and not widely used in home computers.

Generation 3 - 80386 and 68030

The world of home computers didn't really become interesting until late 1986 when Intel released its 3rd generation chip - the 80386, or simply the 386. This chip, although almost 15 years old now, is the base on which OS/2 2.0, Windows 95, and the original Windows NT run on. It was Intel's first true 32-bit x86 chip, extending the registers to a full 32 bits in size and increasing addressable memory to 4 gigabytes. In effect, catching up to the 68020 in a big way, by also adding things like paging (which is the basis of virtual memory) and support for true multi-tasking and mode switching between 16-bit and 32-bit modes.

The 386 is really the chip, I feel, that put Intel in the lead over Motorola for good. It opened the door to things like OS/2 and Windows NT and Linux - truly pre-emptive, multi-tasking, memory protected operating systems. It was a 286 on steroids, so much more powerful, so much faster, so much more capable than the 286, that at over $20,000 a machine, people were dying to get their hands on them. I remember reading the review of the first Compaq 386 machine, again, a $20,000+ machine that today you can buy for $50, and the reviewer would basically kill to get one.

What made the 386 so special? Well, Intel did a number of things right. First they made the chip more orthogonal. What that means is that they extended the machine language instructions so that in 32-bit mode, almost any of the 8 32-bit registers could be used for anything - storing data, addressing memory, or performing arithmetic operations. Compare this to the 8086 and 80286 whose 16-bit instructions could only use certain instructions for certain operations. The orthogonality of the 386 registers made up for the extra registers in the Motorola chips, which specifically had 8 registers which could be used for data and 8 for addressing memory. While you could use an address registers to hold data or use data registers to address memory, it was most costly in terms of clock cycles.

The 386 allowed the average programmer to do away with segment registers and 640K limitations. In 386 protect mode, which is what most Windows, OS/2, and Linux programs run in today, a program has the freedom to address up to 4 gigabytes of memory. Even when such memory is not present, the chip's paging feature allows the OS to implement virtual memory by swapping memory to hard disk, what most people know as the swap file.

Another innovation of the 386 chip was the code cache, the ability of the chip of buffer up to 256 bytes of code on the chip itself and eliminate costly memory reads. This is especially useful in tight loops that are smaller than 256 bytes of code.

Motorola countered with the 68030 chip, a similar chip which added built-in paging and virtual memory support, memory protection, and a 256 byte code cache. The 68030 also added a pipeline, a way of executing parts of multiple instructions at the same time, to overlap instructions, in order to speed up execution.

Both the 386 and 68030 ran at speeds ranging from 16 MHz to well above 40 MHz, easily bringing the speed of the chips to over 10 MIPS. Both chips still required multiple clock cycles to execute even the simplest machine language instructions, but were still an order of magnitude than their first generation counterparts. Microsoft quickly developed Windows/386 (and later OS/2 2.0 and Windows NT) for the 386, and Apple added virtual memory support to Mac OS.

Both chips also introduced something known as a barrel shifter, a circuit in the chip which can shift or rotate any 32-bit number in one clock cycle. Something used often by many different machine language instructions.

The 386 chip is famous for unseating IBM as the leading PC developer and for causing the breakup with Microsoft. IBM looked at the 386, decided it was too powerful for the average user, and decided not to use it in PCs and not to write operating systems for it. Instead it chose to keep using the 286 and to develop OS/2 for the 286. Microsoft on the other hand developed Windows/386 with improved multitasking, Compaq and other clone makers did use the 386 to deliver the horsepower needed to run such a graphical operating system, and the rest is history. By the time IBM woke up, it was too late. Microsoft won. Compaq DELL and Gateway won.

Generation 4 - 486 and 68040

This generation is famous for integrating the floating point co-processor, previously a separate external chip, into the main processor. This generation also refined the existing technology to run faster. The pipelines on the Intel 486 and Motorola 68040 were improved to in effect give the appearance of 1 clock cycle per instruction execution. 20 MIPS. 25 MIPS. 33 MIPS. Double or triple the speed of the previous generation with virtually no change in instruction set! As far as the typical programmer or computer user is concerned, the 386 and 486, or 68030 and 68040, were the same chips, except that the 4th generation ran quicker than the 3rd. And speed was the selling point and the main reason you upgraded to these chips.

The way these chips exploited speed was in a number of ways. First, the caches were increased in size to 8K, and made to handle both code and data. Suddenly relatively large amounts of data (several thousands bytes) could be manipulated without incurring the costly penalty of accessing main memory. Great for mathematical calculations and other such applications. This is why many operating systems today and many video games don't support anything prior to the 4th generation. Mac OS 8 and many Macintosh games require a 68040. Windows 98, Windows NT 4.0, and most Windows software today requires at least a 486. The caches made that huge a difference in speed! Remember this for later!

With the ability to read memory in a single clock cycle now came the ability to execute instructions in a single clock cycle. By decoding one instruction while finishing the execution of the previous instruction, both the 486 and 68040 could give the appearance of executing 1 instruction per cycle. Any given instruction still takes multiple clock cycles to execute, but by overlapping several instructions at once at different stages of execution, you get the appearance of one instruction per cycle. This is the job of the pipeline.

Keeping the pipeline full is of extreme importance! If you have to stop and wait for memory (i.e. the data or code being executed isn't in the cache) or you execute a complex instruction such as a square root, you introduce a bubble into the pipeline - an empty step where no useful work is being done. This is also known as a stall. Stalls are bad. Remember that.

One of the great skills of writing assembly language code, or writing a compiler, is knowing how to arrange the machine language instructions in such an order so that the steps you ask the processor to perform are done as efficiently as possible.

The rules for optimizing code on the 486 and 68040 are fairly simple:

keep loops small to take advantage of the code cache
avoid referencing memory by using the chip's 32-bit registers
avoid referencing memory blocks larger than the size of the data cache
avoid complex instructions - for example where possible use simple instructions such as ADD numerous times in places of a multiply

The techniques used in the 4th generation are very similar to techniques used by RISC (reduced instruction set) processors. The concept is to use as simple instructions as possible. Use several simple instructions in place of one complex instructions. For example, to multiply by 2 simply add a value to itself instead of forcing the chip to use its multiply circuitry. Multiply and divide take many clock cycles, which is fine when multiplying by a large number. But if you simply need to double a number, it is faster to tell the chip to add two numbers than to multiply two numbers.

Another reason to follow the optimization rules is because both the 486 and 68040 introduced the concept of clock doubling, or in general, using a clock multiplier to run the processor internally at several times the speed of the main computer clock. The computer may run at say, 33 MHz, the bus speed, but a typical 486 or 68040 chip is actually running at 66 MHz internally and delivering a whopping 66 MIPS of speed.

The year is now 1990. Windows 3.0 and Macintosh System 7 are about to be released.

Generation 5 - the Pentium and PowerPC

With the first decade and the first 4 generations of chips now in the bag, both Motorola and Intel looked for new ways to squeeze speed out of their chips. Brick walls were being hit in terms of speed. For one, memory chips weren't keeping up with the rapidly increasing speed of processors. Even today, most memory chips are barely 10 or 20 times faster than the memory chips used in computers two decades ago, yet processor speeds are up by a factor of a thousand!

Worse, the remaining hardware in the PC, things like video cards and sound cards and hard disks and modems, run at fixed clock speeds of 8 MHz or 33 MHz or some sub multiple of bus speed. Basically, any time the processor has to reference external memory or hardware, it stalls. The faster the clock multiplier, the more instructions that execute each bus cycle, and the higher the chances of a stall.

This is why for example, upgrading from a 33 MHz 486 to a 66 MHz 486 only offers about a 50% speed increase in general, and similarly when upgrading from the 68030 to the clock doubled 68040.

It's been said many times by many people, but by now you should have realized that CLOCK SPEED IS NOT EVERYTHING!!

Two chips running at the same speed (a 33 MHz 386 and a 33 MHz 486) do not necessarily give the same level of performance, and
Doubling the internal clock speed of a chip (486 from 33 to 66 MHz) does not always double the performance.

What can affect speed far more than mere clock speed is the rate at which the chip can process instructions. The 4th generation brought the chip down to one instruction per clock cycle. The 5th generation developed the concept of superscalar execution. That is, executing more than one instruction per clock cycle by executing instructions in parallel.

Intel and Motorola chose different paths to achieve this. After an aborted 68050 chip and short lived 68060 chip, Motorola abandoned its 68K line of processors and designed a new chip based on IBM's POWER RISC chip. A RISC processor (or Reduced Instruction Set) does away with complicated machine language instructions which can take multiple clock cycles to execute, and replaces them with simpler instructions which execute in less cycles. The advantage of this is the chip achieves a higher throughput in terms of instructions per second or instructions per clock cycle, but the down side is it usually takes more instructions to do the same thing as on a CISC (or Complex Instruction Set) processor.

The theory with RISC processors, which has long since proven to be bullshit, was that by making the instructions simpler the chip could be clocked at a higher clock speed. But this in turn only made up for the fact that more instructions were now required to implement any particular algorithm, and worse, the code grew bigger and thus used up more memory. In reality a RISC processor is no more or less powerful than a CISC processor.

Intel engineers realized this and continued the x86 product line by introducing the Pentium chip, a superscalar version of the 486. The original Pentium was for all intents and purposes a faster 486, executing up to 2 instructions per clock cycle, compared to the 1 instruction per cycle limit of the 486. Once again, CLOCK SPEED IS NOT EVERYTHING.

By executing multiple instructions at the same time, the design of the processor gets more complicated. No longer is it a serial operating. While earlier processors essentially followed this process:

fetch an instruction from memory or the code cache
decode the instruction
execute the instruction in either the floating point unit (FPU), integer unit (ALU), or branch unit
repeat

a superscalar processor how has additional steps to worry about

fetch two instructions from memory or the code cache
decode the two instructions
execute the first instruction
if the second instruction does not depend on the results of the first instruction, and if the second instruction does not require an execution unit being used by the first instruction, execute the second instruction
repeat

The extra check are necessary to make sure that the code executes in the correct order. If two ADD operations follow one another, and the second ADD depends on the result of the first, the two ADD operations cannot execute in parallel. They must execute in serial order.

Intel gave special names to the two "pipes" that instructions execute in - the U pipe and the V pipe. The U pipe is the main path of execution. The V pipe executes "paired" instructions, that is, the second instruction sent from the decoder and which is determined not to conflict with the first instruction.

Since the concept of superscalar execution was new to most programmers, and to Microsoft's compilers, the original Pentium chip only delivered about 20% faster speed than a 486 at the same speed. Not 100% faster speed as expected. But faster nevertheless. The problem was very simply that most code was written serially.

Code written today on the other hand does execute much faster, since compilers now generate code that "schedules" instructions correctly. That is, it interleaves pairs of mutually exclusive instructions so that most of the time two instructions execute each clock cycle.

The original PowerPC 601 chip similarly had the ability to execute two instructions per cycle, an arithmetic instruction pair with a branch instruction. The PowerPC 603 and later versions of the PowerPC added additional arithmetic execution units in order to execute 2 math instructions per cycle.

With the ability to execute twice as much code as before comes greater demand on memory. Twice as many instructions need to be fed into the processor, and potentially twice as much data memory is processed.

Intel and Motorola found that as clock speed was being increased in the processors, performance didn't scale, even on older chips. A 66 MHz 486 only delivered twice the speed of a 33 MHz 486. Why?

The reason again has to do with memory speed. When you double the speed of a processor, the speed of main memory stay the same. That means that a cache miss, which forces the processor to read main memory, now takes TWICE the number of clock cycles. With today's fast processors, a memory read can literally take 100 or more clock cycles. That means 100, or worse, 200 instructions not being executed.

The way Intel and Motorola attacked this problem was to increase the size of the L1 cache, the very high speed on-chip level one cache. For example, the original 486 had an 8K cache. The newer 100 MHz 486 chips had a 16K cache.

But 8K or 16K is nothing compared to the megabytes that a processor can suck in every second. So computers started to include a second level cache, the L2 cache, which was made up of slightly slower but larger memory. Typically 256K. The L2 cache is still on the order of 10 times faster than main memory, and allows most code to operate at near to full speed.

When the L2 cache is disabled (which most PC users can do in the BIOS setup), or when it is left out completely, as Apple did in the original Power Macintosh 6100, performance suffers.

Generation 6 - the P6 architecture and PowerPC G3/G4

By 1996 as processor speeds hit 200 MHz, more brick walls were being hit. Programmers simply weren't optimizing their code and as processor speeds increased, the processors simply spent more time waiting on memory or waiting for instructions to finish executing. Intel and Motorola adopted a whole new set of tricks in their 6th generation of processors. Tricks such as "register renaming", "out of order execution", and "predication".

In other words, if the programmer won't fix the code, the chip will do it for him. The Intel P6 architecture, first released in 1996 in the Pentium Pro processor, is at the heart of all of Intel's current processors - the Pentium II, the Celeron, and the Pentium III. Even AMD's Athlon processor uses the same tricks.

What they did is as follows:

expand the L2 cache to a full 512K of memory. The Pentium II, the original Pentium III, and the original AMD Athlon all did this. Big speed win with no burden on the programmer.
expand the L1 cache. The P6 processors have 32K of L1 cache (16K for data, 16K for code), while the AMD Athlon has a whopping 128K of L1 cache (64K data, 64K code). Another big speed win, more so for the Athlon. Again with no burden on the programmer.
expand the decoder to handle 3 instructions at once. This places a burden on the programmer because instructions now have to be grouped in sets of 3, not just in pairs. Potential 50% speed increase if the code is written properly.
allow decoded instructions to execute out of order provided they are mutually exclusive. This is a huge speed win because it can make up for poor scheduling on the part of the programmer. It also allows code to execute around "bubbles", or instructions which are stalled due to a memory access. Big big big big big speed win.
improved branch prediction. The processor can "guess" with pretty good reliability whether a branch instruction (an "if/else" in a higher level language) will go one way or the other. Higher rates of branch prediction mean less stalls caused by branching to the wrong code.
conditional execution or "predication" allows the processor to conditionally execute an instruction based on the result of a previous operating. This is similar to branching, except no branch occurs. Instead data is either moved or not moved. This reduces the number of "if/else" style branch conditions, which is a big win. Unfortunately, predication is new to the P6 family and is not supported on the 486 or earlier versions of the Pentium
add additional integer execution units so that up to 3 integer instructions can execute at once. Big speed win thanks to out of order execution.
in the AMD Athlon, add additional floating point units to allow up to 3 floating point instructions to execute at once. Big speed win for the Athlon, allowing it to trounce the Intel chips on 3-D and math intensive tasks.
allow registers to be mapped to a larger set of internal registers, a process called "register renaming". Internally, the P6 and K7 architectures do away with the standard 8 x86 32-bit general purpose registers. Instead they contain something like 40 32-bit registers. The processor decides how to assign the 8 registers which the programmers "sees" to the 40 internal registers. This is a speed win for cases where the same register is used serially, but for mutually exclusive instructions. The two uses of the register will get renamed to two different internal registers, thus allowing superscalar out-of-order execution to take place. This trick works best on older 386 and 486 code, or poorly optimized C code which tends to make heavy use of one or two registers only.

From an engineering standpoint, the enhancements in the 6th generation processors are truly amazing. Through the use of brute force (larger caches and faster clock speed), parallel execution (multiple execution units and 3 decoders), and clever interlocking circuitry to allow out-of-order execution, Intel has been able to stick with the same basic architecture for 5 years now, catapulting CPU throughput from the 100 to 150 MHz range in 1995 to over 1 GHz today. Most code, every poorly written unoptimized code, executes at a throughput of over 1 instruction per clock cycle, or roughly 1000 MIPS on today's fastest Pentium III processors.

The PowerPC G3 and G4 chips use much the same tricks (after all, all these silicon engineers went to the same schools and read the same technical papers) which is why the G3 runs faster than a similarly clocked 603 or 604 chip.

Limitations of the Pentium III

AMD, calling the Athlon a "7th generation" processor, something I don't fully agree with since they really didn't have a 6th generation processor, took the basic ideas behind the Pentium II/III and PowerPC G3 and used them to implement the Athlon. Having the benefit of seeing the original Pentium Pro's faults, they fixed many of bottlenecks of the P6 design and which even today limit the full speed of the Pentium III.

These are the same problems that Intel of course is trying to address in the Pentium 4. It helps us to understand why the AMD Athlon is a faster chip and what AMD did right to understand why Intel needed to design the Pentium 4, and that is what I shall discuss in this section.

Not counting the unbuffered segment register problem in the original Pentium Pro (which was fixed in the far more popular Pentium II chip), what are the bottlenecks? What can possibly slow down the processor when instructions are being executed out-of-order 3 at a time!?!?

Well, keep in mind that a chain is only as strong as its weakest link. In the case of the processor, each stage can be considered a link in a chain. The main memory. The L2 cache. The L1 cache. The decoder. The scheduler which takes decoded micro-ops and feeds them into the various execution units. in a the two main bottlenecks in the P6 architecture are the 4-1-1 limitation of the decoder, and the dreaded partial register stall.

If you read the Pentium III optimization document, you will see reference to the 4-1-1 rule for decoding instructions. When the Pentium III (for example) fetches code, it pulls in up to three instructions through the decoders each clock cycle. Decoder 1 can decode any machine language instruction. Decoders 2 and 3 can decode only simple, RISC-like instructions that break down into 1 micro-op. A micro-op is a basic operation performed inside the processor. For example, adding two registers takes one micro-op. Adding a memory location to a register requires two micro-ops: a load from memory, then an add. It uses two execution units inside the processors, the load/store unit on one clock cycle, and then an ALU on the next clock cycle. Micro-ops translate roughly into clock cycles per instruction but don't think of it that way. Since several instructions are being executed in parallel and out of order, the concept of clock cycles per instruction becomes rather fuzzy.

Instead, think of it like this. What is the limitation of each link? How frequently does that link get hit? Main memory, for example, may not be accessed for thousands of clock cycles at a time. So while accessing main memory may cost 100 clock cycles, that penalty is taken infrequently thanks to the buffering performed by the L1 and L2 caches. Only when dealing with large amounts of memory at a time, such as when processing a multi-megabyte bitmap, does it start to hurt.

Intel and AMD have addressed this problem in two ways. First, over they years they have gradually increased the speed of the "front side bus", the data path between main memory and the processor, to work at faster and faster clock speeds. From 66 MHz in the Celeron and Pentium II, to 100 and 133 MHz in the Pentium III, to 200 MHz in the AMD Athlon. Second, Intel produces a version of the Pentium II and III called the "Xeon", which contains up to 2 megabytes of L2 cache. The Xeon is used frequently in servers as it supports 8-way multi-processing, but on the desktop the Xeon does offer considerable speed advantages over the standard Pentium III when large amounts of data are involved. The PowerPC G4 has up to 1 megabyte of L2 cache, which explains why a slower clock speed Power Mac G4 blows away a Pentium III in applications such as Photoshop.

Basically, the larger the working set of an application, that is, the amount of code and data in use at any given time, the larger the L2 cache needs to be. To keep costs low, Intel and AMD have both actually DECREASED the sizes of their L2 caches in newer versions of the Pentium III and Athlon, which I believe is a mistake.

The top level cache, the L1 cache, is the most crucial, since it is accessed first for any memory operation. The L1 cache uses extremely high speed memory (which has to keep up with the internal speed of the processor), so it is very expensive to put on chip and tends to be relatively small. Again, from 8K in the 486 to 128K in the Athlon. But as my tests have shown, the larger the L1 cache, the better.

The next step is the decoder, and this is one of the two major flaws of the P6 family. The 4-1-1 rule prevents more than one "complex" instruction from being decoded each clock cycle. Much like the U-V pairing rules for the original Pentium, Intel's documents contain tables showing how many micro-ops are required by every machine language instructions and they give guidelines on how to group instructions.

Unlike main memory, the decoder is always in use. Every clock cycle, it decodes 1, 2, or 3 instructions of machine language code. This limits the throughput of the processor to at most 3 times the clock speed. For example, a 1 GHz Pentium III can execute at most 3 billion instructions per second, or 3000 MIPS. In reality, most programmers and most compilers write code that is less than optimal, and which is usually grouped for the complex-simple-complex-simple pairing rules of the original Pentium. As a result, the typical throughput of a P6 family processor is more like double the clock speed. For example, 2000 MIPS for a 1 GHz processor.

By sticking to simpler instruction forms and simpler instructions in general (which in turn decode to less micro-ops) a machine language programmer can achieve close to the 3x MIPS limit imposed by the decode. In fact, this simple technique (along with elimination of the partial register stalls) is the reason our SoftMac 2000 Macintosh emulator runs so much faster than other emulators, and why in the summer of 2000 when I re-wrote the FUSION PC emulator I was able to achieve about a 50% speed increase in the speed of that emulator in only a few days worth of work. By simply understanding how the decoder works and writing code appropriately, one can achieve near optimal speeds of the processor.

Once again, let me repeat: CLOCK SPEED IS NOT EVERYTHING! So many people stupidly go out and buy a new computer every year expecting faster clock speed to solve their problems, when the main problem is not clock speed. The problem is poorly written code, uneducated programmers, and out of date compilers (that's YOU Microsoft) that target obsolete processors. How many people still run Microsoft Office 95? Ok, do a DUMPBIN on WINWORD.EXE or EXCEL.EXE to get the version number of the compiler tools. That product was written in an old version of Visual C++ which targets now obsolete 486 processors. Do the same thing with Office 97 or Office 2000. Newer tools that target P6. Wonder why your Office 97 runs faster than your Office 95 on the same Pentium III computer? Ditto for Windows 98 over Windows 95. Windows 2000 over Windows 98. Etc. etc. The newer the compiler tools, the better optimized the code is for today's processors.

The next bottleneck are the actual execution units - the guts of the processor. They determine how many micro-ops of a given type can execute in one clock cycle. For example, the P6 family can load or store one memory location per clock cycle. It can execute one floating point instruction per clock cycle because there is only one FPU. This means that every the most optimized code, that caches perfectly, decodes perfectly, can still hit a bottleneck simply because too many instructions of the same type are trying to executing. Again, one needs to mix instructions - integer and floating point and branch, to make best use of the processor.

Finally that dreaded partial register stall! The one serious bug in the P6 design that can cause legacy code to run slower. By "legacy code" I mean code written for a previous version of the processor. See, until now, every generation so far improved on the design of previous generations. No matter what, you were almost 100% guaranteed that a newer processor, even running at the same clock speed as a previous processor, would deliver more speed. Why a 68040 is faster than a 68030. Why a Pentium is faster than a 486.

Not so with generation 6. While every other optimization in the P6 family pretty much boosts performance without requiring the programmer to rewrite one single line of code, even the 4-1-1 decode rule, the register renaming optimization has one fatal flaw that kills performance: partial registers stalls! A partial register stall is when a partial register (that is, the AL, AH, and AX parts of the EAX register, the BL, BH, and BX parts of the EBX register, etc) get renamed to different internal registers because the processor believes the uses are mutually exclusive.

For example, a C compiler will typically read an 8-bit or 16-bit integer from memory into the AL or AX register. It will then perform some operation on that integer, for example, incrementing it or testing a value. A typical C code sequence to test a byte for zero goes something like this:

int foo(unsigned char ch)
{
return (ch == 0) ? 1 : -1;
}

Microsoft's compilers for years have used a "clever" little trick with conditional expressions, and that is to use a compare instruction to set the carry flag based on the result of an expression, then to use the x86 SBB instruction to set a register to all 1's or 0's. Once set, the register can be masked and manipulated to generate any two desired resulting values. MMX code makes heavy use of this trick as well, although MMX registers are not subject to the partial register stall.

Anyway, when you compile the above code using Microsoft's Visual C++ 4.2 compiler with full Pentium optimizations (-O2 -G5) you get code the following code:

PUBLIC _foo
_TEXT SEGMENT
_ch$ = 8

; 4 : return (ch == 0) ? 1 : -1;

00000 80 7c 24 04 01 cmp BYTE PTR _ch$[esp-4], 1
00005 1b c0 sbb eax, eax
00007 83 e0 02 and eax, 2
0000a 48 dec eax

0000b c3 ret 0
_foo ENDP
_TEXT ENDS
END

and when compiled with Microsoft's latest Visual C++ 6.0 SP4 compiler you get code like this:

PUBLIC _foo
_TEXT SEGMENT
_ch$ = 8

; 4 : return (ch == 0) ? 1 : -1;

00000 8a 44 24 04 mov al, BYTE PTR _ch$[esp-4]
00004 f6 d8 neg al
00006 1b c0 sbb eax, eax
00008 83 e0 fe and eax, -2 ; fffffffeH
0000b 40 inc eax

0000c c3 ret 0
_foo ENDP
_TEXT ENDS
END

Notice in both cases the use of the SBB instruction to set EAX to either $FFFFFFFF or $00000000. Internally the processor reads the EAX register, subtracts it from itself, then write out the value back to EAX. (Yes, it is stupid that when a processor subtracts a register from itself that it would read the register first, but I have verified that it does). In the VC 4.2 case, the processor may or may not stall because we don't know how far back the EAX register was last updated and whether all or part of it was updated.

But interestingly, with the latest 6.0 compiler, even using the -G6 (optimize for P6 family) flag, a partial register stall results. AL is written to, then all of EAX is used by the SBB instruction. This is perfectly valid code, and runs perfectly fine on the 486, Pentium classic, and AMD processors, but suffers a partial register stall on any of the P6 processors. On the Pentium Pro a stall of about 12 clock cycles, and on the newer Pentium III about 4 clock cycles.

Why does the partial register stall occur? Because internally the AL register and the EAX registers get mapped to two different internal registers. The processor does not discover the mistake until the second micro-op is about to execute, at which point it needs to stop and re-execute the instruction properly. This results in the pipeline being flushed and the processor having to decode the instructions a second time.

How to solve the problem? Well, Intel DID tell developers how to avoid the problem. Most didn't listen. The way you work around a partial register stall is to clear a register, either using an XOR operation on itself, a SUB on itself, or moving the value 0 into the register. (Ironically, SBB which is almost identical to SUB, does not do the trick!) Using one of these three tricks will flag the register as being clear, i.e. zero. This allows the second use of the instruction to be mapped to the same internal register. No stall.

So what is the correct code? Something this below is correct (generated with the Visual C++ 7.0 beta):

PUBLIC _foo
_TEXT SEGMENT
_ch$ = 8

; 4 : return (ch == 0) ? 1 : -1;

00000 8a 4c 24 04 mov cl, BYTE PTR _ch$[esp-4]
00004 33 c0 xor eax, eax
00006 84 c9 test cl, cl
00008 0f 94 c0 sete al
0000b 8d 44 00 ff lea eax, DWORD PTR [eax+eax-1]

0000f c3 ret 0
_foo ENDP
_TEXT ENDS
END

Until every single Windows application out there gets re-compiled with Visual C++ 7.0, or gets hand coded in assembly language, your brand spanking new Pentium III processor will not run as fast as it can. But even then, at the expense of code size and larger memory usage. Note the extra XOR instruction needed to prevent the partial register stall on the SETE instruction. While this does eliminate the partial register stall, it does so at the expense of larger code. You eliminate one bottleneck, you end up increasing another.

Why the AMD Athlon doesn't suck

Guess what folks? The AMD Athlon has no partial register stall!!!! Woo hoo! AMD's engineers attacked the problem and eliminated it. I've verified that to be true by checking several different releases of the Athlon. That simple design fix, which affects just about every single Windows program every written, along with the larger L1 cache and better parallel execution is why the AMD Athlon runs circles around the Pentium III.

Floating point code especially, which means many 3-D games, run faster on the Athlon because the code runs into less bottlenecks inside the processor.

That's it, three simple things that AMD did right. They stuck to the principle of making every generation of processor faster than every previous generation without forcing programmers to re-write their code. Don't force the programmer to group code a certain way. Don't force code to execute serially because you were too lazy to add a second or third floating point unit. Don't let perfectly legal code cause the processor to have a cardiac.

Intel? Are you listening? HELLO?

Pentium 4 - Generation 7 or complete stupidity?

Let's get to the meat of it. WHY THE PENTIUM 4 SUCKS. If you've read this far I expect you have downloaded the Intel and AMD manuals I mentioned above, you're read them, and you have a good understanding of how the Pentium III, AMD Athlon, and Pentium 4 work internally. If not, start over!

You're read my previous section on the cool tricks introduced in the 6th generation processors (Pentium II, AMD Athlon, PowerPC G3) and the kinds of bottlenecks that can slow down the code:

partial register stalls, big big minus on Pentium II and Pentium III (including Celeron and Xeon)
decoder limitation - forcing the programmer to re-arrange code a certain way
lack of execution units - such as only having one floating point unit available each clock cycle
small caches - the faster the processor, the larger the cache you need to keep up with the data coming in

As I mentioned, AMD to their well deserved credit attacked all these problems head on in the Athlon by detecting and eliminating the partial register stall, by relaxing limitations on the decoder and instruction grouping, and by making the L1 caches larger than ever.

So, after 5 years of deep thought, billions of dollars in R&D, months of delays, hype beyond belief, how did Intel respond?

In what can only be considered a monumental lapse in judgment, Intel went ahead and threw out the many tried and tested ideas implemented in both the PowerPC and AMD Athlon processor families and literally took a step back 5 years to the days of the 486.

It seems that Intel is taking the approach similar to that of their upcoming Itanium chip - that the chip should do less optimization work and that the programmer should be responsible for that work. An idea not unfamiliar to RISC chip programmers, but Intel really went a little too far. They literally stripped the processor bare and tried to use brute force clock speed to make up for it!

Except the idea doesn't work. Benchmark after benchmark after benchmark shows the 1.5 GHz Pentium chip running slower than a 900 MHz Athlon, and in some cases slower than a 533 MHz Celeron, even as slow as a 200 MHz Pentium in rare cases.

Intel did throw a few new ideas into the Pentium 4. The two ALUs (arithmetic logic units) which perform adds and other simple integer operations, run at twice the clock speed of the processors. In other words, the Pentium 4 is in theory capable of executing 6 billion integer arithmetic operations per second. As I'll explain below, the true limit is actually much lower and not any better than what you can get out of a Pentium III or Athlon already.

Another new idea is the concept of a 'trace cache", or what is basically a code cache for decoded micro-ops. Rather than constantly decode the instructions in a loop over and over again, the Pentium 4 does not have an L1 code cache. Instead, it caches the output of the decoder, caching the raw micro-ops. This sounds like a good idea at first, but again, in reality it does not prove any better than simply having an 8K code cache, and certainly falls short of the Athlon's 64K code cache.

The Benchmarks

As Tom's Hardware site documented last month, the Pentium 4 lost miserably against the AMD Athlon at MPEG4 video encoding. Only after Intel engineers personally modified the code did the Pentium 4 suddenly win the benchmarks. A side effect of this is that the benchmarks on the Athlon improved considerably as well, indicating that the code was very poorly written in the first place.

However, this brings up the point again that Intel now expects software developers to completely rewrite their code in order to see performance gains on the Pentium 4. And we don't all have the luxury of having an Intel engineer showing up on our doorstep to re-write our code for us. With thousands of Windows applications out there, not to mention the growing number of Linux applications out for the PC, and sadly out of date compiler tools, does Intel seriously expect millions of computer code to be rewritten just for the Pentium 4?

I downloaded the modified MPEG4 FlasK encoder and ran it through my own 150 megabyte sample video file. I used 8 similarly configured Windows Millennium computers as the test machines, which had the following processors and memory sizes and roughly sorted by cost:

1.5 GHz Pentium 4 Gateway 1500xl with 512 MB of RDRAM (cost: over $4000)
650 MHz Pentium III Gateway Solo notebook with 288 MB of RAM (cost: about $4000)
670 MHz (overclocked) Pentium III home built with 1 GB of RAM (cost: about $2500)
600 MHz Crusoe based Sony Picturebook with 128 MB of RAM (cost: about $2200)
900 MHz AMD Athlon Thunderbird homebuilt with 384 MB of RAM (cost: about $1500)
600 MHz AMD Athlon home built with 256 MB of RAM (cost: about $1000)
500 MHz Pentium III home built with 256 MB of RAM (cost: (about $1000)
533 MHz Celeron (one of our COMDEX demo machines) with 128 MB of RAM (cost: under $600 including DVD-ROM and RAM upgrade)

The encoding times (in seconds) for the same sample piece of video were as follows:

Chip speed and type	Elapsed time (seconds)	Clock cycles (billions)
1.5 GHz Pentium 4	484	726
900 MHz AMD Athlon	544	490
670 MHz Pentium III	743	498
650 MHz Pentium III	757	492
533 MHz Celeron	858	457
600 MHz AMD Athlon	922	553
500 MHz Pentium III	946	473
600 MHz Crusoe	1369	821

The 1.5 GHz Pentium 4 won of course, but barely over the 900 MHz AMD Athlon at about 1/3 the price and 60% of the clock speed. Worse, the Pentium 4 fails to even cut the processing time in half compared to the much slower clocked Pentium III and Celeron systems. The Pentium 4 is barely twice as fast at this benchmark as a 500 MHz Pentium III.

This benchmark illustrates several important concepts I discussed earlier, especially when we calculate the total number of clock cycles executed on each processor. By counting the total number of cycles, it equalizes the differences in clock speed between the various systems.

First, CLOCK SPEED IS NOT EVERYTHING!! Just because one processor runs at a faster clock speed than another, does not mean you will get proportionally faster performance.

Second, the Pentium 4 seems to require almost 50% more clock cycles than the Athlon or Pentium III, indicating that either floating point operations each take more cycles on the Pentium 4, that the Pentium 4 does not execute as many floating point instructions in parallel as the Athlon, or that the Pentium 4 is being throttled by the cache or decoder. An MPEG decode does deal with a lot of data (in this case, 150 megabytes of data) and my guess is the small L1 cache on the Pentium 4 hurts it here. Intel's optimization document addresses this issue, referring to techniques known as "cache blocking" and "strip mining" to minimize the cache usage by working on small portions of data at a time. Again, something that a code rewrite is needed to implement.

For another floating point test, I ran the widely used Prime95 utility for finding Mersenne primes (see http://www.mersenne.org). Setting up the same machines I launched PRIME95 on each machine at the same time and had them begin calculating the primality of roughly the same length number several million digits long. The number being worked on requires about 24 hours of processing time. After about an hour of running time, it was clear that the Pentium 4 was neck and neck with the 900 MHz Athlon. After several hours, still tied. After 24 hours of running time both the Pentium 4 and 900 MHz Athlon completed, while the others were still part way through processing the number. I recoded the relative progress of each, with the Pentium 4 and 900 MHz Athlon being shown to have complete and roughly tied:

Chip speed and type	Relative speed
1.5 GHz Pentium 4	>100% (tied)
900 MHz AMD Athlon	>100% (tied)
670 MHz Pentium III	90%
650 MHz Pentium III	90%
533 MHz Celeron	60%
600 MHz AMD Athlon	99%
500 MHz Pentium III	45%
600 MHz Crusoe	60%

Here, the clear floating point dominance of the AMD Athlon over the Pentium III and Pentium 4 is evident. Since the source code to PRIME95 can be freely downloaded, I looked at it. It contains a lot of hand coded assembly code, and more importantly, a LOT OF FLOATING POINT instructions. The Athlon, with its ability to execute 3 floating point instructions per clock cycle, even at 60% the clock speed, just about keeps up with the Pentium 4. At 600 MHz speeds, the Athlon blows away a Pentium III chips running over 10% faster.

In a third floating point test, running the SoftMac 2000 emulator and then running a heavily floating point based benchmark on the Mac OS, the Pentium 4 fails to keep up with even the 600 MHz chips, losing badly (82 seconds vs. 49 seconds) against the 670 MHz Pentium III and losing worse (82 seconds vs. 36 seconds) against the 900 MHz AMD Athlon.

Running other tests using various emulators, I found that in general the Pentium 4 runs emulators such as SoftMac 2000 SLOWER in most cases than the 650 MHz Pentium III and 600 MHz AMD Athlon.

A small tangent about the Transmeta Crusoe

I should stop and mention a few things about the biggest surprise to me from COMDEX Las Vegas, and that being not the Pentium 4 chip but the Transmeta Crusoe chip.

See, the folks over at Transmeta have their own ideas how to build processors, especially for portable devices that have to restrict their power use. After about 5 years of secret development, these guys came up with a chip that works slightly differently from the Intel and AMD chips.

Rather than waste millions of transistors on a chip for out-of-order schedulers and other fancy tricks, they decided to strip all this out of the chip and eliminate about 90% of the chip's power consumption. Instead, a piece of software performs the code optimizations at run time. This is essentially the concept behind a dynamically recompiling emulator, and the concept behind a JIT (just in time) compiler such as what is found in Java.

What Transmeta has done is taken this a step further to JIT the entire Windows operating system at run time, rather than say, a tiny 100K Java applet. And they pulled it off. Using a 600 MHz chip that performs software based optimizations, I find the Picturebook consistently performing at about the speed of a 300 MHz Pentium class processor, as perfectly demonstrated by the MPEG4 results above.

In addition to that, the Crusoe chip has this peculiar side effect of running faster as time goes by. i.e. as more time elapses, the chip's JIT compiler appears to optimize the code further, and I've actually noticed this running the SoftMac emulator. As I repeat benchmarks under the Mac OS, they get slightly faster each run.

This shows up in the PRIME95 benchmark quite clearly, where after 24 hours of run time, the Crusoe keep right up with a 533 MHz Celeron chip - almost keeping up clock cycle for clock cycle with Intel's chip!

As I said a few weeks ago, hats off to the geniuses at Transmeta for pulling off such an amazing feat of emulation. This idea of software-assisted execution may well in fact be the solution for Intel's woes in future generations of chips, as it takes the burden of code optimization off the hands of millions of software developers and puts in back in the chip without requiring millions of extra transistors.

Analyzing and understanding the results

But back to the Pentium 4 and figuring out what it sucks. I finally pulled out my big gun: a custom CPU analyzer utility which I use to analyze various processors. It measures things like the sizes and speeds of the caches, and it executes hundreds of different sample code sequences in order to measures the throughput of each piece of code on each processor. These code sequences consists of code that is commonly emitted by Microsoft's Visual C++ compiler and code that is commonly found in emulation code. I've used this utility for years to hand tune my emulators to various processors and it's served me well.

After just a few minutes on the Pentium 4 it gave me the results I needed. I then read over Intel's Pentium 4 documents again and corroborated my results, in order to finally determine the fatal design flaws of the Pentium 4.

MISTAKE #1 - Small L1 data cache - I couldn't believe it myself when I first saw the results, but Intel's own statements confirm it. The Pentium 4 has a grossly under-sized 8K L1 data cache. That was the size of the L1 cache back in the 486, more than TEN YEARS AGO. Some idiots never learn. The L1 cache is the most important block of memory in the whole computer. It is the first level of memory that the processor goes to and it is the memory that the processor spends most of its time accessing. Intel learned back in the 486 days that 8K of cache was grossly inadequate, raising the size of the cache from 8K to 16K in later versions of the 486 and to 32K (16K code, 16K data) in the P6 family. AMD went a step further with their 128K L1 cache in the Athlon and Duron processors.

Going back to 8K is just plain idiotic. Sure, you save a few transistors on the chip. You also crippled the speed of just about every Windows program out there! The problem is, 8K is not a lot of data. At a 1024x768 screen resolution and a 32-bit color depth, 2 scan lines of video consume 8K of data. Simply manipulating more than two scan lines of video data at a time will overflow the L1 cache on the Pentium 4.

My testing shows that while the Pentium 4 has extremely fast memory access for working sets of data up to 8K in size, at 16K and 32K sizes it is no faster than a 650 MHz Pentium III. The Pentium III's L1 cache, even though running at a much slower clock speed, keeps up with the Pentium III's L2 cache. The 900 MHz Athlon's 64K data cache in fact outperforms the Pentium 4's L2 cache. Therefore at manipulating sound or video data, the AMD Athlon can manipulate 8 times as much data as the Pentium 4 as quickly as the Pentium 4.

MISTAKE #2 - No L3 cache - Intel originally specified a 1 megabyte L3 cache for the Pentium 4. This third level cache, much like a G4's back side cache or the large L2 cache in the Pentium III and Athlon, provides an extra level of fast memory to help keep the chip from having to access slow main memory. The L3 cache is completely removed in the released Pentium 4 chip. Like I said, some idiots at Intel never learn. Intel learned the hard way when they released the original crippled Celeron processor NOT to cut or eliminate the cache. It's quite obvious that Intel realized early on that an 8K L1 cache would hurt, and compensated by adding the L3 cache. When push came to shove and Intel needed to cut corners, the cut the L3 cache.

How significant a cut is this? Well, consider that Intel DOES make versions of the Pentium III that have 1 and 2 megabytes of L2 cache - the Pentium III Xeon. While more expensive than the regular Pentium III chip, ask anyone with a Xeon if they're trade it in for a regular Pentium III. My testing shows that at working sets between 256K and 2M, a 700 MHz Xeon processor easily outperforms the Pentium 4 at memory operations. How much is 256K or 2M? Well, that's about the typical size of an uncompressed bitmap. It's the reason a Power Mac G4 running Photoshop kills a typical Pentium III running Photoshop. And axing the L3 cache is a main reason why the Pentium 4 is not the G4 killer it could have been.

MISTAKE #3 - Decoder is crippled - In another step back to 486 days of 10 years ago, Intel took a rather idiotic approach to the U-V pairing and 4-1-1 grouping limitations of past decoders. They simply eliminated the extra decoders and went back to a single decoder. Only one machine language instruction can be decoded per clock cycle. The idea behind this twisted logic being that the trace cache eliminates the need to decode instruction every clock cycle. True, IF and only if the code being executed has already been decoded and cache in the trace cache.

But guess what my friends? When a new piece of code is called that is not in the trace cache (or in a traditional code cache), the processor must reach into the L2 cache or into main memory to pull in 64 bytes of memory. Then it has to decode that 64 bytes of code. Well, a typical x86 instruction is about 3 bytes in size, therefore 64 bytes of memory is equivalent to about 21 machine language instructions. Assuming all 64 bytes of code executes, how long will it take a Pentium 4 to decode all of the instructions? 21 clock cycles. How long with it thus take that piece of code to execute? More than 21 clock cycles. Now, compare this to the Pentium III or Athlon. How long will those chips need to decode the bytes? Roughly 7 to 11 cycles.

MISTAKE #4 - Trace cache throughput too low - Remember my analogy about the weak link in the chain. We've already found that the decoder can only feed 1 instruction worth of micro-ops to the trace cache. Then, reading Intel's specs some more, we can see that the trace cache itself can only feed at most 3 micro-ops to the execution units per clock cycle.

The trace cache feeds these micro-ops to the processor core which then executes them in one or more dedicated execution units. Intel's Pentium 4 overview mentions that the Pentium 4 processor core contains 7 execution units:

the two double-speed ALUs to handle adds and subtracts
a normal speed ALU to handle shift and rotate operations
a load unit to read memory
a store unit to write to memory
a floating point move unit (which also handled load/store to memory), and
a general floating point unit to handle all floating point and MMX operations

Together, these execution units can in theory process 9 micro-ops per clock cycle - 4 simple integer operations, 1 integer shift/rotate, a read and write to memory, a floating point operating, and an MMX operation.

Sounds pretty sweet, except for the problem that the trace cache feeds only 3 micro-ops at a time! While on the Pentium III we have the situation that the decoder can feed up to 3 instructions and 6 micro-ops (4+1+1) to the core per clock cycle, the Pentium 4 is crippled to the point of decoding one instruction per cycle and feeding at most 3 micro-ops to the code per clock cycle.

For well optimized code, code which follows the 4-1-1 rule and runs optimally on both Pentium III and AMD Athlon processors, the Pentium 4 is virtually guaranteed to run slower at the same clock speed. I verified this with some common code sequences. No wonder the 900 MHz Athlon keeps beating the Pentium 4 in the benchmarks.

MISTAKE #5 - Wrong distribution of execution units - This is a direct result of mistake #4, and that is that the breakdown of the execution units themselves is completely wrong.

Think about it. 5 of the 7 execution units are dedicated to handling the integer registers, the 8 "classic" registers EAX EBX ECX EDX ESI EDI EBP and ESP. Yet as it's already clear, the Pentium 4 does horrific job of executing legacy code.

Intel's own documents put heavy emphasis on the use of the new MMX registers, both 64-bit and 128-bit MMX registers introduced in the P6 family. Yet only one single execution unit handles MMX. And if you read Intel's specs in more detail, it states that the unit can only accept a micro-ops every second clock cycle. In other words, the 1.5 GHz Pentium 4 can at most execute 750 million floating point operations or MMX operations per second. But MMX is one of the things Intel hypes up!

So why cripple the very feature you're trying to hype?

In a related act of stupidity, Intel put 3 integer ALUs in the core, two of which operate at double the chip speed. So between them, the three ALUs can accept up to 5 micro-ops per clock cycle. But we've already learned that the trace cache can provide at most 3. So one or more integer ALUs sit idle each clock cycle. It is impossible to even feed 4 micro-ops into the two double-speed units. So why did Intel waste transistors to implement a redundant ALU, but then cut corners by eliminating a much more needed second floating point unit? It's just plain idiotic.

MISTAKE #6 - Shifts and rotates are slow - It seems Intel has taken yet another step back to the days of the 486, even the days of the 286, by eliminating the high-speed barrel shifter found in all previous 386, 486, Pentium, 68020, 68030, 68040, and PowerPC chips. Instead, they created the shift/rotate execution unit, which by design operates at normal clock speed (not double clock speed), but in my testing actually operates even slower. A typical shift operation on the Pentium 4 requires 4 to 6 clock cycles to complete. Compare this with a single clock cycle on any 486, Pentium, or Athlon processor.

How bad is this mistake? For emulation code, it's absolutely devastating. Shift operations are used for table lookups, for bit extractions, for byte swapping, and for any number of other operations. For some reason, Intel's engineers just could not spare a few extra transistors to keep shifts fast, yet they waste transistors on idle double speed ALUs.

Intel's own documentation is now contradictory. On the one hand, Intel has for years advocated the use of shift and add operations to avoid costly multiply operations. For example, to multiply by 10, it is quicker on the 486 and Pentium to use shifts to quickly multiply by 2 and 8 and then add the results. However, on the Pentium 4 a shift can take as long as 6 clock cycles while a multiply operation completes in 7 cycles. So it is quicker now to perform a multiply, even though the end result is slower than on a Pentium III.

This appears to have something to do with the fact that the original Pentium 4 design called for there to be two address generation units, which are circuits to quickly calculate addresses for memory operations. In previous chips, the AGU contained a barrel shifter to quickly handle indexed table lookups, which the Pentium 4 now handles using the much slower ALU. The "add and shift" trick was usually accomplished by the AGU by a programming trick using the LEA (load effective address) instruction. This trick is now rendered useless thanks to Intel cutting out the part.

MISTAKE # 7 - Fixed the partial register stall with a worse solution - While it is true that the partial register stall is finally a thing of the past in the Pentium 4, Intel's solution is less than elegant. It is not only worse that AMD's solution, but actually worse than the problem it tries to fix. Accessing certain partial registers now involves the shift/rotate unit, meaning that a simple 8-bit register read or write can take longer than accessing L1 cache memory! It's backwards!

MISTAKE #8 - Instructions take more clock cycles to complete - This is less a specific mistake as it is an overall side effect of the first 7 idiotic mistakes. The end result of all the cost cutting and silicon chopping is that typical code sequences now take more clock cycles to execute than on the P6 architecture. Intel relies on the much faster clock speed of the Pentium 4 to overcome this problem, but this only works against the Pentium III and slower Intel processors. Against the AMD Athlon, it loses badly.

As I mentioned above, typical code sequences generated by C++ compilers now take more clock cycles to execute. This is due in part to the brain dead decision to only decode one instruction per clock cycle, to only feed 3 micro-ops to the core per clock cycle. And partly due to the longer pipeline used in the Pentium 4, flow control operations (such as branches, jumps, and calls) take longer because it takes longer to fill the processor pipeline.

For example, an indirect call through a general purpose register, common when making member function calls in C++, now takes about 40 clock cycles on the Pentium 4. Compare this to only 10 to 14 cycles on P6 family and AMD Athlon processors. Even at the faster clock speed, the Pentium 4 function calls are slower overall. Similarly, Windows API calls, which call indirectly through an import table, are now slower. Several Windows APIs that I tested literally took 2 to 3 times the number of clock cycles to execute on the Pentium 4. This is because not only do all the internal function calls within Windows take longer, but you have to remember that Windows 2000 and Windows Millennium are compiled using C++ compilers that optimize for Pentium III and Athlon processors. So as I mentioned at the beginning, until such time as most Windows code is recompiled using as-yet-non-existent Pentium 4 optimized C++ compilers, the performance of Windows applications will be terrible on the Pentium 4 processor.

The Verdict

If it isn't clear already, the Pentium 4 is a terrible choice for PC users. It is a severely crippled processor that does not live up to its original design specifications. Its makes inefficient use of available transistors and chip space. It places a higher burden on software developers to optimize code, contrary to the trends being set by AMD and Transmeta processors. It reverts to 10 year old techniques which Intel abandoned and apparently forgot why. And it just plain runs slower than existing Pentium III, Celeron, and AMD Athlon chips.

Intel needs to heavily beef up the L1 cache size, add the missing L3 cache, add more decoders, raise the transfer rate from the trace cache to the core, lower the cost of shift operations, and add additional FPU and MMX execution units. Once these changes are made, and only then, will the Pentium 4 begin to even be a threat to its own Pentium III and Pentium III Xeon processors and be a viable contender for the speed crown currently held by the AMD Athlon.

Intel, Dell, Gateway, and other computer manufacturers have intentionally misled consumers about the performance of the Pentium 4 processor and are currently involved in selling useless overpriced hardware to unsuspecting consumers. Do what I did with my two Pentium 4 machines. Send them right back to Gateway and Dell and let them know they're selling crap. A $1500 Athlon system is a far better choice of computer than a $4000 Pentium 4 system. No doubt about it and I challenge anyone from Intel, Dell, or Gateway to prove my statements wrong.

Comments? Flame mail? Got a Pentium 4 horror story to share? Email darekm@emulators.com

Also I'd like to point people at an excellent web site that I've read for years - http://x86.org - through which I tracked down some of the info for this review. Robert's site is updated almost daily with fresh news about Intel and also includes some excellent technical information about Intel's processors.

Some other hardware related sites and pages you might want to check out (yes, I'm repeating a few sites I already mentioned):

Tom's Hardware

http://www.eet.com/story/OEG20001213S0045

http://www.theregister.co.uk

http://www.athlonoc.com

http://www.apushardware.com

Сайт управляется системой uCoz