Author Topic: Why is the P4 bad? (part 2)  (Read 230 times)

Offline bloom25

  • Silver Member
  • ****
  • Posts: 1675
Why is the P4 bad? (part 2)
« on: April 28, 2001, 01:30:00 AM »
>Continued from part 1<

Now, back to the pipeline.  (Remember that from WAY up in my post?   )  The pipeline is the real heart of a processor.  When a processor wants to do some work it needs to do 3 things:  Get the next instruction from memory, figure out what the instruction is, and finally execute the instruction.  The formal names for these three processes are "fetch", "decode", and "execute."  The pipeline is bascially a series of stages the processor follows to do these three things.

You are probably wondering why this is important?  Hang on, I'm getting there.  Hopefully you have made it this far without falling asleep.  At this point you are now going to learn the main reason why the p4 sucks so much compared to the Athlon.  The reason is it's pipeline!  (Bet you didn't see that coming.  I'll bet it surprised you just about as much as a c hog going for the HO.   )  The original pentium processor had a 5 stage pipeline.  The pentium 2 and 3 use a 10 stage pipeline.  The Athlon and Duron is 11 stages.  The p4 has a 20 stage pipeline.  Why, you ask, is that bad?  Soon you will understand.  Remember those three things I mentioned that a processor must do, "fetch", "decode", "execute."  The first stage in the pipeline is always going to be a "fetch."  The processor will get the next program instruction from memory.  Many stages in the pipeline may be "fetch" as well.  In the middle there will be "decode" stages.  This is where the processor figures out what it is supposed to do.  At the end of the pipeline there are the "execute" stages, where the processor actually does the instruction.  (For example, add 2 to 5.)  Here's why a longer pipeline is not better, each stage takes 1 clock cycle.  This means that while an Athlon can take up to 11 clock cycles to perform a simple instruction, the P4 can take up to 20!  This means that the Athlon at 1 GHz is almost as fast in doing work as a P4 at almost 2 GHz!!!  If this were 100% true the P4 at 2 Ghz would be the same speed as a 1 gig p3.  Fortunately processor makers have found ways to speed up the process of moving through the pipeline by a process called "branch prediction."  Bascially what this does is try to guess what the program is going to do earlier in the pipeline and then skip a few stages.  This does work in practice, but it also carries the potential for disaster.  What do you think would happen if the branch predictor guessed wrong?  Well, what happens is the processor has to start all over from the beginning, regardless of how far along the instruction was in the pipeline.  The longer the pipeline, the greater the potential for misprediction.  Guess which processor has the longest pipeline?  Hmm, P4.  (If you guessed wrong, consider yourself as just getting shot down by a c202 while your were flying a n1k.   )  Fortunately the branch predictors in processors have been much improved in the Athlon/duron and p4 series processors.  Both currently "guess" correctly about 90 - 95% of the time.  One of the biggest faults of the p2 and p3s was their poor branch prediction and resulting performance hit.  The Athlon has an extremely advanced predictor unit compared to the p3.  This accounts for much of it's performance improvement when compared to that processor.  The p4s branch predictor has so far been even a bit better, but considering the pipeline is almost twice as long, it better be.

At this point I'll bet you are wondering why anyone would ever want to make the pipeline longer?  That's actually very simple, clock speed.  Most consumers associate raw clock speed alone as how fast a processor is.  If you make the pipeline longer the processor does less per clock cycle, making it much easier to run it at a higher clock speed.  If all things were ideal the pipeline would be 3 stages long, eliminating the need for the predictor all together.  Unfortuately it is VERY hard to get the whole processor to work at gigahertz speeds without things breaking down.  If only small pieces at a time have to work at high speed that makes it a lot easier.  It also reduces power consumption.

Do you now see a problem with the P4s we currently have?  You should.  Compare for the moment an Athlon 1.33 Ghz and P4 at say 1.5 Ghz.  From what you've just learned above tell me which processor is now going to be the obvious winner in performance tests.  That's right, the Athlon.

Oh, but it gets even worse for Intel.  The AMD Athlon has 3 more things going for it.  One is it's far superior floating point unit.  The next two are its larger instruction and data caches.  The caches basically store data and instructions in very high speed memory on the CPU itself waiting to be used.  These are what are known as L1, or level 1, caches.  The Athlon has 64kb of data cache and 64kb of instruction cache.  This is compared to 16kb in the p3 and only 8kb ,YES 8kb, in the p4.  Why would intel only put 8 kb into the p4?  The simple reason is cost.  Nothing takes up more space on the CPU die as memory caches.  The p4 die is already 4 times as large as the Athlon and prices were spinning out of control.  Intel was forced to make cuts somewhere, and this is one thing they cut.

Now we come to the floating point unit.  This is best explained in an example:  What is the simplest way to multiply any number by 10 or divide by 10?  That's easy, just move the decimal point to the left or right one position.  (Same as adding a zero or removing one.) This is what a floating point unit in the CPU does, it just moves the decimal point around to do multiplication and division.  The difference is that in binary moving the decmial point is like multiplying by 2, not 10.  It's time for another example:  What is the easiest way for a computer to multiply by 3?  The answer is to move the decimal point to the right one position and then add the original number to this.  The Athlon can do this MUCH faster than the p3 or p4.  Basically it can do multiplication and division faster, by about 30%, than the p3 and about 33% faster than the p4.  This means that for software that basically do a lot of math, like engineering software that I use, the Athlon just blows the doors off the p4.  Guess what, AH does a lot of math in it's flight model calculations.  This gives the Athlon an additional edge over the P4 in AH and most direct x games.

Now you understand why the Athlon tends to be overall faster than the P4.  The P4 is a processor designed for one thing, clock speed.  It does this at the expense of performance per clock.  Only if Intel can release p4s running at double the clock speed of the Athlon will it actually outperform it.  This will never happen.  Intel, for all it's faults, is very smart when it comes to marketing.  All they have to do is release P4s at a few hundred Mhz above the competion and all but the extremely CPU savvy people will ooh and awe over it and fork over $500.

>Continue to part 3<

------------------
bloom25
-MAW-
(Formerly of the)
THUNDERBIRDS