This time around I'll describe some of the techniques newer x86 CPUs use to keep themselves as busy as possible. Unfortunately some of these concepts are very technical and it's quite difficult to describe them in a way that makes sense, but I'll try my best.
Definately ask questions if you're confused. I've tried to arrange the information here where it kind of builds on itself to help make more sense.
One technique used by the P4, Athlon XP/64, and even the last of the P3s to increase performance is hardware prefetch. In short, hardware prefetch improves the effectiveness of the CPUs data cache. Hardware prefetch is a method by which the CPU attempts to guess what data it will be working with in the future. Data the hardware prefetch unit thinks will be used is retrieved from main memory and placed in the CPUs L1 or L2 (level 1 or level 2) data cache. If the hardware prefetch unit guesses correctly, the CPU has the data it needs in its data cache and does not need to wait for the data to be transfered from main memory. If the guess is incorrect, the prefetched data is eventually discarded. It's probably not too difficult to imagine that certain software applications gain more from this technique than others. Unfortunately hardware prefetch can sometimes also hurt performance slightly, as the bandwidth used by the prefetch unit reduces the bandwidth available to the rest of the CPU.
Since we have touched a bit on cache memory, I'll discuss it a bit as there are some significant differences between the Athlon and P4 when it comes to CPU caches. Cache memory is extremely high speed memory (many times faster than main system memory) that stores data and instructions expected to be needed by the CPU in the very near future. When the CPU attempts to fetch either an instruction or data from main memory, its caches are first checked to see if the needed bit of information is there. If it is, that is what is known as a cache "hit", basically meaning the cache contains the information requested by the CPU. It's important to note that there are both instruction and data caches because the instruction cache the P4 uses is very special. I'll come back to that very soon. You've also probably seen terms like L1, L2, and L3 cache thrown around when reading CPU posts and reviews. The "L" means "Level" and the biggest differences between L1, L2, and L3 caches is how quickly they can be accessed. The L1 cache is extremely high speed memory that can be accessed in only a few CPU clockcycles. L2 cache memory takes significantly more clock cycles to access than L1 cache, usually around 20, but is typically much larger in size. L3 cache is slower still than L2 cache at around 30 to 40 clock cycles to access. Both the Athlon XP/64 and Pentium 4 have both L1 and L2 caches. (The P4 Extreme Edition also has an L3 cache.) Having cache memory has the net effect of reducing the amount of time the CPU spends waiting for instructions and data, and thus increases performance. In general, more cache = faster performance, but this isn't always true. If both the data and instructions used by an application completely fit within the CPUs caches, having more than that amount won't gain you performance. Adding more cache memory also has significant drawbacks. On both the Athlon and P4 cache memory takes up more than 1/2 of the entire CPU die! Adding more cache memory means you must expand the die area of the CPU, which directly results in higher manufacturing costs. Thus it is important to balance the amount of cache with the costs of production. (The amount of cache memory is one of the main differences between the Athlon and Duron. Having less cache makes the Duron quite a bit cheaper to produce.)
Now that we've described what cache memory is we can talk about how the P4 and Athlon differ with respect to their caches. The Athlon XP using the Throughbred core has 256kB of L2 cache and 128kB (64kB data & 64kB instruction) of L1 cache. The Athlon XP with the Barton core has 512kB of L2 cache and 128kB of L1 cache. The Athlon 64s have 1MB of L2 cache and 128kB of L1 cache. The Pentium 4s of with the original Willamette core have 256kB of L2 cache, and the Northwood core P4s have 512kB of L2 cache. (The P4 Extreme Edition also has a 2 MB L3 cache.) Notice that I didn't list the amount of L1 cache for the P4s. This gets back to what I said above about the instruction cache in the P4 being special. The L1 data cache in the P4 is only 8 kB in size, but the instruction cache is quite special. The instruction cache in the P4 is known as a "trace cache" and is about 20kB in size. The L1 instruction cache in the P3 and Athlons stores x86 instructions, but as I'm sure most of you are aware (and I noted in an earlier post in this thread), modern CPUs break up x86 instructions into simplier instructions used within the CPU itself. A trace cache stores these simpler instructions, rather than the more complicated x86 instructions. Thus in the P4 once an x86 instruction has been decoded, the simpler instructions used to perform the x86 instruction are stored in the trace cache. (These simplier instructions are known as Macro ops by AMD and Micro ops by Intel.) This trace cache used by the P4 is a significant improvement over a conventional L1 instruction cache, as it improves the ability to schedule those instructions to keep as much of the CPU busy as possible. There are a couple other significant differences between the Athlons and Pentium 4 when it comes to their caches. The cache on the Athlon is "exclusive", which is different than most other CPUs. In a typical processor the contents are the L1 cache are duplicated in the L2 cache, which is what is known as an "inclusive" cache. As you can imagine, the contents of the L1 cache being duplicated in the L2 cache is really a waste of valuable L2 cache space. In the Athlons, the L1 and L2 cache contents are not duplicated and thus effectively gives the Athlon more cache that if it had inclusive caches. There is one other significant difference between the Athlon and P4 caches. The caches in the P4 are MUCH faster when it comes to bandwidth. Latency, or the number of CPU cycles needed for the CPU to get data from its caches is similar between the two. The Athlon 64s L2 cache is significantly faster than that in the Athlon XP, but still only offers about 60% of the bandwidth of the L2 cache in the P4. As you can see from the above, the Athlons generally have more cache memory, but the P4 has faster cache memory. Which is better depends mainly on the application.