Author Topic: PC Architecture (Read 2690 times)

DES · « **Reply #15 on:** November 01, 2003, 09:01:56 PM »

I'd certainly be interested in reading more.

des

Wanker · « **Reply #16 on:** November 02, 2003, 08:51:28 AM »

Looks like another good candidate for a sticky thread has been created.

WTG Bloom!

zmeg · « **Reply #17 on:** November 02, 2003, 11:41:23 AM »

WTFG bloom; I can't tell you how much we all apreciate you sharing your hard earned knowledge with all us wantabees,
I learned more from your post than any other source i've come across. Starving for more. BTW am I correct in assuming that AH is branch intensive? Will a 400Mhz chipset improve performance over a 266Mhz chipset if processor is 266Mhz fsb?

qts · « **Reply #18 on:** November 02, 2003, 04:35:37 PM »

Just a small point, Bloom, but I haven't spotted you stating that you work for Intel. Have you moved on? If not, it's probably worth stating this up front, for integrity's sake.

Sixpence · « **Reply #19 on:** November 02, 2003, 07:17:36 PM »

I have an AMD barton 2500, corsair 2700 ddr mem( 2 twin 512 sticks on a dual channel mobo) asus a7n8x deluxe with nforce chipset.

If I read it correctly, I am not bottlenecking?

bloom25 · « **Reply #20 on:** November 02, 2003, 07:24:03 PM »

I've never worked for Intel. AKDejaVu (MiniD?) is the only person who frequents this BBS who does that I know of.

I'm working on another post as we speak.

bloom25 · « **Reply #21 on:** November 02, 2003, 09:22:19 PM »

This time around I'll describe some of the techniques newer x86 CPUs use to keep themselves as busy as possible. Unfortunately some of these concepts are very technical and it's quite difficult to describe them in a way that makes sense, but I'll try my best.

Definately ask questions if you're confused. I've tried to arrange the information here where it kind of builds on itself to help make more sense.

One technique used by the P4, Athlon XP/64, and even the last of the P3s to increase performance is hardware prefetch. In short, hardware prefetch improves the effectiveness of the CPUs data cache. Hardware prefetch is a method by which the CPU attempts to guess what data it will be working with in the future. Data the hardware prefetch unit thinks will be used is retrieved from main memory and placed in the CPUs L1 or L2 (level 1 or level 2) data cache. If the hardware prefetch unit guesses correctly, the CPU has the data it needs in its data cache and does not need to wait for the data to be transfered from main memory. If the guess is incorrect, the prefetched data is eventually discarded. It's probably not too difficult to imagine that certain software applications gain more from this technique than others. Unfortunately hardware prefetch can sometimes also hurt performance slightly, as the bandwidth used by the prefetch unit reduces the bandwidth available to the rest of the CPU.

Since we have touched a bit on cache memory, I'll discuss it a bit as there are some significant differences between the Athlon and P4 when it comes to CPU caches. Cache memory is extremely high speed memory (many times faster than main system memory) that stores data and instructions expected to be needed by the CPU in the very near future. When the CPU attempts to fetch either an instruction or data from main memory, its caches are first checked to see if the needed bit of information is there. If it is, that is what is known as a cache "hit", basically meaning the cache contains the information requested by the CPU. It's important to note that there are both instruction and data caches because the instruction cache the P4 uses is very special. I'll come back to that very soon. You've also probably seen terms like L1, L2, and L3 cache thrown around when reading CPU posts and reviews. The "L" means "Level" and the biggest differences between L1, L2, and L3 caches is how quickly they can be accessed. The L1 cache is extremely high speed memory that can be accessed in only a few CPU clockcycles. L2 cache memory takes significantly more clock cycles to access than L1 cache, usually around 20, but is typically much larger in size. L3 cache is slower still than L2 cache at around 30 to 40 clock cycles to access. Both the Athlon XP/64 and Pentium 4 have both L1 and L2 caches. (The P4 Extreme Edition also has an L3 cache.) Having cache memory has the net effect of reducing the amount of time the CPU spends waiting for instructions and data, and thus increases performance. In general, more cache = faster performance, but this isn't always true. If both the data and instructions used by an application completely fit within the CPUs caches, having more than that amount won't gain you performance. Adding more cache memory also has significant drawbacks. On both the Athlon and P4 cache memory takes up more than 1/2 of the entire CPU die! Adding more cache memory means you must expand the die area of the CPU, which directly results in higher manufacturing costs. Thus it is important to balance the amount of cache with the costs of production. (The amount of cache memory is one of the main differences between the Athlon and Duron. Having less cache makes the Duron quite a bit cheaper to produce.)

Now that we've described what cache memory is we can talk about how the P4 and Athlon differ with respect to their caches. The Athlon XP using the Throughbred core has 256kB of L2 cache and 128kB (64kB data & 64kB instruction) of L1 cache. The Athlon XP with the Barton core has 512kB of L2 cache and 128kB of L1 cache. The Athlon 64s have 1MB of L2 cache and 128kB of L1 cache. The Pentium 4s of with the original Willamette core have 256kB of L2 cache, and the Northwood core P4s have 512kB of L2 cache. (The P4 Extreme Edition also has a 2 MB L3 cache.) Notice that I didn't list the amount of L1 cache for the P4s. This gets back to what I said above about the instruction cache in the P4 being special. The L1 data cache in the P4 is only 8 kB in size, but the instruction cache is quite special. The instruction cache in the P4 is known as a "trace cache" and is about 20kB in size. The L1 instruction cache in the P3 and Athlons stores x86 instructions, but as I'm sure most of you are aware (and I noted in an earlier post in this thread), modern CPUs break up x86 instructions into simplier instructions used within the CPU itself. A trace cache stores these simpler instructions, rather than the more complicated x86 instructions. Thus in the P4 once an x86 instruction has been decoded, the simpler instructions used to perform the x86 instruction are stored in the trace cache. (These simplier instructions are known as Macro ops by AMD and Micro ops by Intel.) This trace cache used by the P4 is a significant improvement over a conventional L1 instruction cache, as it improves the ability to schedule those instructions to keep as much of the CPU busy as possible. There are a couple other significant differences between the Athlons and Pentium 4 when it comes to their caches. The cache on the Athlon is "exclusive", which is different than most other CPUs. In a typical processor the contents are the L1 cache are duplicated in the L2 cache, which is what is known as an "inclusive" cache. As you can imagine, the contents of the L1 cache being duplicated in the L2 cache is really a waste of valuable L2 cache space. In the Athlons, the L1 and L2 cache contents are not duplicated and thus effectively gives the Athlon more cache that if it had inclusive caches. There is one other significant difference between the Athlon and P4 caches. The caches in the P4 are MUCH faster when it comes to bandwidth. Latency, or the number of CPU cycles needed for the CPU to get data from its caches is similar between the two. The Athlon 64s L2 cache is significantly faster than that in the Athlon XP, but still only offers about 60% of the bandwidth of the L2 cache in the P4. As you can see from the above, the Athlons generally have more cache memory, but the P4 has faster cache memory. Which is better depends mainly on the application.

bloom25 · « **Reply #22 on:** November 02, 2003, 09:23:19 PM »

If you've read posts and reviews of the Athlon 64 you've probably noticed that one of the new additions of the Athlon 64 is an "on die memory controller." This is probably the single most significant architectural improvement of the Athlon 64 over any other x86 CPU. Just as a recap from what I discussed in the first few posts above in a typical PC the memory controller for main system memory (DDR Sdram for most new PCs) is located in what is known as the Northbridge. The CPU and the Northbridge communicate with each other by means of the FSB (front side bus). In the Athlon 64 the memory controller is part of the CPU itself. Why do this? To put it simply, most modern CPUs spend most of their time waiting for information from memory. Putting the memory controller on the CPU dramatically reduces the amount of time needed to get information from memory (what is known as latency). It also improves bandwidth significantly. (The amount of data that can be transfered in a given amount of time.) It also completely removes the FSB as a bottleneck to memory performance. Having an on die memory controller also allows the Athlon 64 to scale much better as the clock frequency is increased than other CPUs. If you want to know just how much the on die memory controller improves performance you can take a look at this page out of Aces Hardware's Athlon 64 review: http://www.aceshardware.com/read.jsp?id=60000258 . Pay close attention the the 128 bit and 256 bit memory latency numbers. Notice that they are less than 1/2 of that of the 3.2 GHz P4 and about 1/3 better than the 3200+ Athlon XP. Some other things to note are the memory bandwidth chart at the top of the page and the L2 cache bandwidth chart.

Why doesn't every x86 CPU have an on die memory controller? Probably the biggest reason is cost. Adding the memory controller to the CPU greatly increases the number of pins on the CPU, which makes for a much more expensive package. (The Athlon XP uses a 462 pin socket and the current P4 uses a 478 pin socket. Contrast this with the Athlon 64 3200+, which has a single channel DDR400 controller and uses a 754 pin socket. The Athlon 64 FX with a dual channel memory controller uses a 940 pin socket now and will use a 939 pin socket in the near future.) An on die memory controller is also more beneficial when the clockspeed difference between the CPU and memory is higher. In the last few years the clockspeeds of x86 CPUs has skyrocketed. In contrast, memory clockspeeds haven't risen nearly as quickly. This means CPUs of a few years ago wouldn't have gained as much by having an on die memory controller as those of today. Two other minor drawbacks of an on die memory controller is that the CPU itself must be updated to support newer memory types and onboard graphics solutions that share system memory will probably lose some performance as well. (Remember that onboard graphics used to be a part of the Northbridge, along with the memory controller. Now that the memory controller is part of the CPU they will have to communicate with the CPU to access memory.) If I had to make a guess, I'd say that an on die memory controller will be something you will begin to see more of in the future.

In my next post I'll try to cover SIMD instructions (MMX, SSE, SSE2, SSE3, and 3dnow!). I will also talk about Hyperthreading, so stay tuned!

Roscoroo · « **Reply #23 on:** November 03, 2003, 12:59:00 AM »

Bloom

good job explaining the cache in relationship to the exacution of memory vs the held amounts

~~(cant wait til the next installment )~~

beet1e · « **Reply #24 on:** November 04, 2003, 07:08:06 AM »

Quote

Originally posted by bloom25
This is why a Athlon 2500+ (333 MHz FSB) runs slower with DDR 400 memory than it does when using DDR 333 memory. The Athlon architecture is very sensitive to latency

ruh-roh. My new system is on order. I based my order on recommendations I've read here - A7N8X deluxe mobo, XP2600, ATi Radeon 9800 Pro etc... but 1xPC3200 512MB DDR 400MHz memory. Are you saying I would have been better off with the PC2700 DDR 333MHz? How bad a mistake was it to get the 400MHz? Should I return it and do an exchange? Or is the difference only very slight? If it's going to make a difference of 2fps in AH, I'm not that bothered.

boxboy28 · « **Reply #25 on:** November 04, 2003, 08:26:40 AM »

Beetle that chip(XP2500) is unlocked and will run at a FSB of 400 in sync with the ram! stay with the XP2500 and the 3200DDR
+the 2500 is tha Barton core with the 512 L2 cache !!!!

beet1e · « **Reply #26 on:** November 04, 2003, 08:39:25 AM »

Quote

Originally posted by boxboy28
Beetle that chip(XP2500) is unlocked and will run at a FSB of 400 in sync with the ram! stay with the XP2500 and the 3200DDR
+the 2500 is tha Barton core with the 512 L2 cache !!!!

It got buried, but I actually ordered the XP2600, not the XP2500 - am I going to be OK?

vorticon · « **Reply #27 on:** November 04, 2003, 03:57:07 PM »

long read (and im a fast reader) but well worth it...thanks alot bloom...

acetnt-2nd · « **Reply #28 on:** November 04, 2003, 05:00:11 PM »

Quote

Originally posted by beet1e
It got buried, but I actually ordered the XP2600, not the XP2500 - am I going to be OK?

you should be able to run the memory at 333Mhz

Roscoroo(work) · « **Reply #29 on:** November 04, 2003, 05:18:23 PM »

you can run the mem at 333 with that mb ... and the 2600+ is a unlocked cpu also ... w/ the 512 L-2 cache I dont think theres much difference between the 2500 and the 2600

the 2500 has been around longer so theres more info on it .

you should have a great running system w/ the 2600