Aces High Bulletin Board

General Forums => Hardware and Software => Topic started by: bloom25 on October 20, 2003, 04:14:27 PM

Title: PC Architecture
Post by: bloom25 on October 20, 2003, 04:14:27 PM: (This thread started out as a post in a CPU and Video card recommendation thread. Rather than completely hijacking that thread, I think it's best if I move what I posted there to here before I expand on it.)

This all started out as a simple comment about the Front Side Bus on a Pentium 4 vs the Athlon XP, so the first post mainly deals with that. To really fully understand this subject I'll have to dig into the actual architecture of the Athlon and Pentium 4, so I'll be adding information about that as soon as I can get around to it.

(Reposted from other thread)

Here you go! I'll try to condense 4 years of college computer architecture classes down to 1000 words. I just hope that some of it makes sense when I'm done.

The first thing I should probably do it explain what a "front side bus" is in the first place.

A "front side bus" is the link between the CPU and the rest of the system, specifically what is typically known as the "Northbridge".

Most chipsets (which are on the motherboard itself) consist of two parts, the Northbridge and Southbridge. The Northbridge historically has contained the memory controller (Sdram, DDR Sdram, Rambus, etc) as well as more recently the controller for the AGP slot itself. The Southbridge typically controls just about everything else in the system. (PS2 ports, USB ports, LPT port, onboard sound, IDE controller, floppy controller, onboard network, etc.) The Northbridge and Southbridge are typically linked by the PCI bus, on which most of the expansion cards in the PC also connect. (There are exceptions to this, as some single chip solutions now exist (Sis chipsets), and sometimes a different separate bus links the NB and SB, as is the case with the nForce chipsets which use a Hypertransport link and VIA chipsets which use "Vlink".) Why does this matter? Basically the front side bus is the critical link between the CPU and the entire system. This means that the faster the FSB is, the faster the CPU can communicate with everything else in the system. If the CPU wants to get data or instructions from system RAM, that data travels over the FSB. If the CPU needs data from the hard drive, that data travels to the southbridge, over the PCI bus to the Northbridge, and then over the FSB to the CPU. As you would expect, the faster this link is, the faster the system will be. If this is true, why would I say that a faster FSB results in diminishing returns in system speed beyond a certain point? I'll get to that.

(BTW for all you tech historians: There used to be a "back side bus" which linked the CPU to it's Level 2 (L2) cache. The term is now obsolete, because just about every modern CPU since the Coppermine Pentium 3 core has had it's L2 cache as part of the CPU itself, meaning the BSB is part of the CPU itself as well. If you want to get really technical, the term front side bus is no longer really valid in its original context, because it is now the only bus.)

Perhaps the first thing I should cover when trying to explain why a faster FSB doesn't always result in a corresponding increase in system performance is to consider the case when the CPU needs data from the harddrive. (Which happens quite a bit when loading programs and when the data the CPU needs does not fit into system memory.) I'm sure all of you know that the harddrive is many orders of magnitude slower in transfering data than system memory is. The amount of delay imposed by the data traveling over the FSB is nearly negligable when compared to the amount of time it takes for the hard drive to retreve and store information. This makes the FSB speed itself very much a non-factor.

The next case is when the CPU needs data from main memory. There are two key concepts to understand here: "Latency" and "Bandwidth".

Latency is essentially the amount of time the CPU must wait between issuing a request for information and when the information actually is available to the processor. This time is generally measured in nanoseconds, but it's far more useful to look at it in terms of clockcycles the CPU executes. This is because the CPU is essentially wasting time during the clockcycles where it is waiting for data and/or instructions from memory. I'll come back to this later, because it is probably the most important thing to understand.

Bandwidth is the amount of data that can be transfered in a given unit of time.

Let's look at this from a more intuitive example. Consider a highway where vehicles travel from one point to another. In this example, bandwidth is essentally the number of lanes of the highway. Latency is essentially its length. Lets say you have a contest to get the most vehicles from one end of the highway to the other. Unfortunately, only a certain number of vehicles can enter the highway per second. This start of the highway is roughly analogous to main memory in a computer. The end of the highway is the CPU itself, and compared to main memory, is far faster. As you can well imagine, if you make the highway shorter (lower the latency) you can get more vehicles to the end (data to the CPU) in the same amount of time as that of a longer highway. Given you can get enough vehicles onto the highway, having more lanes will also get more vehicles to the end of the highway. Consider this though, what happens when you have 800 lanes on your freeway, but only 400 cars can enter it at any given time? Basically, 400 lanes are wasted. (Ok, enough car talk. I'm getting bored with it... )

Real memory in a computer cannot transmit data continuously. It takes a certain amount of time from when the CPU (or more correctly the memory controller in the Northbridge acting on behalf of the CPU) requests data, until when the memory can begin sending that information. This amount of time is the memory latency. To read data from DDR SDRAM memory, which is arranged as a giant grid of both rows and columns, it takes a certain amount of clock cycles to charge the individual cell the data is in (precharge), a certain amount of time to activate the row the data is in (RAS - row address strobe), and a certain amount of time to access the column the data is in (CAS - column address strobe, a term most people who buy memory have heard.), the final factor is the command rate (time between issuing a command to memory to when the command is executed, usually only a cycle or two). All of this is what is collectively known at memory latency. (You see this printed on memory and on review sites as a string of 4 or 5 numbers.) The lower the latency, the less time it takes for the memory to begin transfering data to the northbridge. DDR memory currently runs at 100 Mhz - PC1600, 133 Mhz - PC2100, 166 MHz - PC2700, and 200 MHz - PC3200 as standard rates. The latency is measured in the number of memory clock cycles. (You probably think I'm wrong here, and that PC3200 memory runs at 400 MHz. That's not actually true, and I'm getting to that.) DDR memory (double data rate) has the capability of transfering data on both the rising edge (low to high) of the clock pulse and on the falling edge (high to low) of the clock pulse. If it could do this all the time, it would have the same bandwidth as regular SDRAM, which transfers data on only the rising (low to high) clock edge. This is why PC3200 is also known as DDR400, because it is capable of transfering, at a maximum, at the same bandwidth as SDRAM running at 400 MHz. This also explains why you sometimes see DDR memory with a CAS latency of 2.5 cycles, this means the data can access that column after 5 clock edges (rising or falling). DDR memory can transfer data on both the rising and falling edges when it is performing a burst transfer of more than one location in memory. Most of the time it does, and there is a very good reason for this. Typically when a CPU wants data from memory, the next access from memory will be from a location very close to that of the first access. For this reason, SDRAM (and the older fast page memory) will transfer the entire contents of the memory row. This boosts performance, because if the CPU does end up needing data in the next cell, that data has already been transfered. If it ends up the access is not from the same row, nothing is really lost, as the CPU just discards the data it doesn't need. Note that I've hugely simplified this. This is what is known as "spacial locatity" in computer architecture classes, which basically says that a CPU will most request data from memory in a location near the last access most of the time. Basically SDRAM and DDR SDRAM assume this and just transfer all the data near to what the CPU requests. Wow, that's a lot of information to try to condense and "dumb" down, but hopefully those of you who stuck with it now better understand what memory latency is.

Now, lets briefly touch on bandwidth. Individual DDR memory modules in modern computers are 64 bits wide, meaning they transfer data in 64 bit chunks on both the rising and falling edges of their data clock. This is the amount of data transfered over a single channel. If we are talking about a DDR 400 module, this bandwidth will be 3.2 Gigabytes per second. If we have two independant channels (dual channel) transfering data at the same time this will be 6.4 Gigabytes per second.
Title: PC Architecture
Post by: bloom25 on October 20, 2003, 04:15:38 PM: (Too long for a single post)

Consider this: For best performance, the CPU's FSB should be capable of transfering data at the same rate as which it can be transfered at a maximum from main memory. For the case of Dual channel DDR400 (PC3200) memory, this is 6.4 GB/sec. The bandwidth offered by the P4 'C' type 800 MHz equiv. FSB is 6.4 GB/sec. You can imagine that if the FSB was slower than this, you'd have a traffic jam with dual channel DDR 400 memory when both channels are transfering at the same time. This means you are losing performance. (This is why a 'C' type P4 performs best with dual channel PC3200 memory.) The opposite case is also true, if the FSB is capable of transfering more data than the ram is capable of delivering, you aren't gaining much performance by having the capability to do so. This is the case on many systems. Consider a 'C' type P4 with an 800 MHz FSB, but using only PC2700 DDR333 modules. If you neglect the influence of Hyperthreading, the 'C' type P4 will perform no better than the 'B' type P4! This is also true with Athlons. Thus we have two cases where we are losing potential performance - faster memory than FSB, or a faster FSB than memory. Thus, the best system memory performance occurs when the FSB is equal (or faster) than memory. If it is faster you don't gain much though, and in fact may actually lose performance because the Northbridge must wait to transfer data from the CPU's FSB and memory bus until the next clock edge. This adds latency. This is why a Athlon 2500+ (333 MHz FSB) runs slower with DDR 400 memory than it does when using DDR 333 memory. The Athlon architecture is very sensitive to latency, more so than a P4.

I'm afraid I'm going to have to stop here for the night. It's 1:20 AM and I need to get up in 6 hours. At this point I haven't tied all the loose ends up, but I think you may begin to see where this is going. Tomorrow I'll try to post about neat little things like: Hyperthreading, integrated memory controller, hardware prefetch, cache memory influence, and if anyone actually reads and gets something from this, maybe more!
Title: PC Architecture
Post by: bloom25 on October 20, 2003, 04:19:46 PM: Since it may be a while until I get around to this tommorow, I think I should mention this:

The 'C' type P4 has a 800 MHz FSB, with enough bandwidth capable of handling the amount of data transfered from two channels of DDR 400 memory. As I mentioned before, SDRAM type memory transfers the entire row of data, which means that in a dual channel setup two entire rows of data will be sent for each memory request. If the CPU doesn't actually need all of that data, the benefits of transfering all of it, and thus the performance advantage of a 800 MHz FSB over that of a 400 MHz FSB (single channel DDR 400) is wasted. Remember that the CPU simply throws away what it doesn't need. To be technically correct, it stores all the data in it's L2 cache (L3 as well if it has one) until it runs out of room, in which case it dumps the oldest data. It also must discard portions or all of its data stored in cache when it writes back to memory. (If a CPU needs to write to memory, the data in the cache which was transfered from the location it wants to write to is no longer correct, and is discarded. The cache can also be flushed when the CPU switches processing from one thread to another. If this sounds related to Hyperthreading, it is, and more on that tommorow... )
Title: CPUs
Post by: bloom25 on October 20, 2003, 04:42:10 PM: I think at this point it is best to try to explain just what a CPU does, and how the two most common CPU types - Athlons and Pentium 4s - actually go about doing their job. I think if you can understand how each CPU type works, you will understand just how hard it is to compare the two. I think I can really simplify this and still get the main points across, but be aware that there are a lot of special cases and exceptions to everything.

First of all, lets talk about what a CPU does. The simplest definition would be that a CPU executes instructions and makes decisions based on the results of those instructions. There are quite a few types of instructions that a CPU can execute:

Arithmatic instuctions - These instructions are basically addition, subtracton, multiplication, division, and a few others (sine, cosine, tangent, square root, etc). These arithmatic instructions follow into two types, integer and floating point. Integer instructions are those that act on whole numbers, 1 + 2 for example. Floating point instructions are those involving fractions, 1.234 + 2.954 for example.

Memory load and store instructions - These are exactly what they sound like. These instructions read or write to memory (that can be any type of "memory" in a system - ram, hard drive, video card, other expansion cards, etc).

Branch instructions - These instructions essentially allow the CPU to make decisions. For example, if A + B is greater than 10, execute instruction C, but if A + B is not greater than 10, execute instruction D. It is these instructions that really make a CPU, and thus the PC, more than a simple calculator. They give the CPU the ability to make decisions based on the results of other calculations.

Both Athlons and Pentium 4s all use the same instruction set, x86, which means they can run the same programs. The basic steps they use to go about executing those instructions is the same as every other CPU out there. All CPUs do 4 basic things to execute an instruction: Fetch, Decode, Execute, and Retire. The Athlon and Pentium 4 just do these 4 basic tasks very differently.

(I'll continue this later on today.)
Title: PC Architecture
Post by: mrblack on October 21, 2003, 01:15:19 AM: Thx darn good read.
Even for and MCSE and A+ dude like myself there is always neat stuff to learn.
Heck I forgot most of what i learned in school anyway:D
Title: PC Architecture
Post by: bloom25 on October 21, 2003, 02:13:45 AM: What do I mean when I say "fetch", "decode", "execute", and "retire"? Actually this is pretty simple.

Fetch - Retreve the next instruction from memory. Pretty obvious...

Decode - Essentially, figure out what the instruction is.
This step has really become much more important with more recent CPUs. Even though both Athlons and Pentium 4s both execute x86 instructions, internally they break up x86 instructions into a bunch of smaller, simpler tasks. I'll give you an example: Lets say you get an instruction asking you to add A to B and store the result in C. To execute an instruction like this the following steps must be done: Get what data is in location A, get what data is in location B, add A and B together, store the result in location C. If you noticed there are 4 smaller operations that must be carried out to do A + B = store in C. I'll come back to this later, but consider for a minute if you had the capability to do more than one operation at once. In this case you could actually get the data in locations A and B at the same time, but you can't add the two together until you have them both. If your CPU was capable of performing more than one task at the same time you could save a step both getting both A and B at the same time, adding them in the next step, and then writing the result to C. This is critical to realize, because modern CPUs CAN execute more than one operation at the same time and it is ABSOLUTELY vital for them to properly "schedule" (this is the techincal term - pretty obvious what it means) these operations to make the most use of the processors resources. This is an area where big differences exist between the Athlon and Pentium 4. I'll get back to this later.

Execute - Carry out the operation. This would be the step in which my previous example retrieved A and B, and added them. Athlons and Pentium 4s have big differences in the amount of cycles it takes to execute various instructions, the number of instructions they can execute at once.

Retire - This isn't really obvious, and some people will group this under Execute, but I prefer it to be thought of as a separate step. Basically this is the step where the CPU writes the result back into memory. (Store C in my example.)

Ok, now that we understand what basic steps are necessary to execute in instruction we can begin to get a feel for just how vastly different the Athlon and Pentium 4 are in how they go about accomplishing these 4 things. The Athlon and Pentium 4 belong to two essentially different trains of thought on how to maximize performance. Personally, I feel that both approaches have their own unique advantages and disadvantages so right off I should say that both approaches are equally valid in acheiving maximum performance.

Lets introduce the concept of "pipelining". Pipelining can be thought of as roughly the same thing as an assembly line. Your instruction always has to go through the 4 steps (fetch, decode, execute, retire), so the most obvious number of stages to have in your pipeline is 4. You do one of these steps every clock cycle, and if you CPU is only capable of executing one instruction at a time, it takes 4 clock cycles to execute a single instruction. This is actually the case with most cheap inexpensive microcontrollers which are found in just about everything these days. (Some of you may have heard of or played with PIC microcontrollers. These cheap little microcontrollers take 4 clock cycles to execute each instruction.) Unfortunately if you only have 4 stages in your pipeline it severely limits your maximum clockspeed, as the decode and execute stages typically take longer to execute than the other two stages. Your maximum clockspeed for your CPU can't be any faster than it takes for the pipeline stage that takes the most time to execute. For example, if your fetch stage takes 0.1 seconds, decode takes 1/2 second, execute takes 1 second, and retire takes 0.1 second your maximum clock rate is only 1 Hz, because the execute stage is the limiting factor if it always takes up to 1 second to finish. As you might be catching on by now, if you broke execute up into more than one pipeline stage you can achieve higher clockspeeds and still get the same level of performance.

Now lets talk about modern CPUs. The Athlon has a 10 stage integer (Athlon 64 is 12 stages) pipeline and the Pentium 4 has a 20 stage integer pipeline. This means that the 4 main tasks I detailed above are broken up into smaller tasks. Notice just how long the pipeline is in the Pentium 4 compared to the Athlon.

Now things are going to start to get more technical, and I'll do my best to try to keep things as basic as possible. (Please do post if you have questions; I'm sure others will be wondering the same things you are.)

Assume for a moment that the Athlon can only execute one instruction at the same time and every stage in the pipeline is actually doing something. This means that every clock cycle a finished instruction completes stage 10 of the pipeline (ignoring the very first 9 clock cycles) and is completed. For the P4, finishing stage 20 completes the instruction. Now we all know that AMD uses a rating system and the XP 3200+ runs at a true clockspeed to 2.2 GHz. The current top end P4 runs at 3.2 GHz. If these two CPUs could only execute 1 instruction at the same time and every stage in their pipelines was busy we could conclude that the Pentium 4 is completing 3.2 Billion instructions per second and the Athlon is only completing 2.2 Billion instructions per second. Since having a longer pipeline allows a CPU to run at a higher clockspeed, with our assumptions in place, the Pentium 4 walks all over the Athlon in performance. We can also conclude a couple other things from this most simple of example: 1. Each instruction in the Pentium 4 takes 20 3.2 GHz cycles to complete and each instruction in the Athlon takes 10 2.2 GHz clock cycles to complete. The interesting thing to note here is that from start to finish, the Athlon completes an individual instruction in less total time.
Title: PC Architecture
Post by: bloom25 on October 21, 2003, 02:15:11 AM: Now lets throw two gigantic monkey wrenches into the equation. 1. Real CPUs execute more than one instruction at once.
2. Not every stage in the pipeline can be working every clock cycle.

Number 1 might seem simple enough, but you might be asking why on number 2. Consider this, since a CPU can make branch type instructions, which depend on the results of previous instructions, you must know the final result of that instruction before executing the branch instruction. Put another way, consider the A + B, store in C example above. Lets say the next instruction is: If C is greater than 100, subtract 1 from B. If C is less than 100, add one to B. If you don't know the result of A + B, you don't know whether or not you should add or subtract one from B in the next instruction. This is bad. (Really bad if your pipeline is long.) You can't begin working on the branch instruction until the A + B instruction is done. Since you can't do the branch instruction, you also don't know if your next instruction is to add 1 to B, or to subtract 1 from B. This means that the A + B instruction must go through all 10 or 20 stages in your pipeline before the branch instruction can even start. You also need to wait another 10 or 20 cycles to start the instruction after that. As you can imagine, a branch instruction has the potential to hurt the P4 a LOT worse than the Athlon. In this (very simple) example the P4 will waste almost 40 clock cycles with nothing to do, waiting for the results of other instructions. The Athlon would only waste about 20 cycles. The technical term for a pipeline stage with nothing to do is a pipeline stall or bubble. In this scenario, with these assumptions, the Athlon will be faster. If only it were this simple though.

One of the main jobs of CPU designers is to come up with ways to keep the CPU as busy as possible. One method all modern CPUs since the Pentium Pro (which later became the P2 and P3) have employed is "branch prediction." The idea here is actually really simple and very smart. Make an educated guess what the result of the branch instruction will be and act accordingly. If you assume A + B will be greater than 100, you can assume that you will be subtracting 1 from B. Rather than have stages in your pipeline doing nothing, if you can track and execute more than one instruction at once you can just assume you will be subtracting one and check to see if your assumption was true once A + B complete. If you guessed correctly, you can keep your predicted subtract 1 instructon result. The advantage is that the predicted instruction can be nearly finished through the pipeline when the A + B result is finally known. This means that if you guess correctly you haven't wasted any clock cycles waiting for A + B to finish executing. If you guess wrong, you just have to discard the predicted instruction result and execute the correct instruction. Basically by using branch prediction you have a lot to gain and nothing to lose. As you can imagine the Pentium 4 devotes considerable resources to the prediction of branch instructions. (SSE2 even adds instructions which tell the P4 that a branch is "strongly taken", "weakly taken", "weakly not-taken", and "strongly not-taken".) Both the Athlon and Pentium 4 employ very advanced branch prediction schemes that track the history of similar branches and guess whether a branch will be taken or not. Back in the days of the P2 you basically just assumed a branch will be taken and act accordingly. The Pentium 4 lives or dies by its branch prediction unit's success in correctly guessing which instruction to execute next. The Pentium 4 and Athlon typically can predict branches with well over 90% accuracy. Even still, you can probably imagine that extremely branch intensive code will execute faster on an Athlon than on a P4.

Moving on to something else I touched on above when I said that the Athlon and P4 can execute more than one instruction at once. In actuality the Athlon can track and execute 6 (!!!) instructions at once. Specifically 3 of these instructions can be integer instructions or floating point and 3 can be memory read/write [address] instructions. The Pentium 4 can only execute 2 integer instructions or floating point instructions at the same time and 2 address instructions. This might seem like a huge advantage for the Athlon, and it is definately a strong point of the architecture, but unfortunately it is quite rare that 6 instructions can be executed at the same time because of many different reasons. You might remember that I talked about both floating point and integer operations. (Floating point numbers are those like 1.023 and integers are obvously just that, whole numbers like 1.) There are quite a few x86 instructions that internally the CPU executes as a mix of both integer and floating point operations. In the current Athlon and Athlon XPs the CPU can either process 3 FP or 3 integer operations at the same time, but not both. This means that if a x86 instruction involved 1 integer add and 2 floating point operations, the integer operation would have to wait for the next clock cycle to begin. The Athlon 64 does not have this limitation.

Since we are on the subject of floating point, lets go ahead talk some more about it. In a CPU floating point operations are executed by a special unit, the FPU (floating point unit). Floating point operations, besides mathematical operations involving a decimal point, include x87 (standard FP operations every modern CPU can execute) and special instructions like MMX, SSE, SSE2, and 3dnow! instruction sets. The FPU has 3 main components. One unit handles floating point additions and subtractions, another unit handles multiplication and division, and the final unit handles FP memory operations (load/store). The FPUs of Athlons and P3/P4s have some major differences. The Athlon's FPU is what is known as "fully pipelined." Basically what this means is that the add/subtract, multiply/divide, and load/store units are separate from each other and can all work at the same time. The FPU in the P3 and P4 is not fully pipelined. The multiply/divide unit must make use of the add/subtract unit to execute multiply and divide instructions. A multiply instruction can (and almost always does, unless the number is being multiplied or divided by a power of 2) take a lot longer to execute than an addition/subtraction operation. In the Athlon the multiply/divide unit can be busy processing a multiply instruction while the addition subtraction unit is busy processing add or subtract instructions. Like the integer units described above (10 stages in the Athlon, 20 in the P4), the FPU itself also has several stages. The Athlon has 15 stages in the multiply/divide unit. Unfortunately I don't know the exact number in the P4, but I do know there are more stages than that. Since the Athlon can do add subtracts when working on a multiply, it has the capability to schedule up to 32 floating point instructions to maximize this capability. (In case you were wondering, trig instructions and some of the others often are broken up into simpler instructions involving all three units, or the result is retrieved from a table.) When it comes to raw x87 FP performance the Athlon can literally run circles around the P3 and P4. Making use of the Pentium 4s special SSE2 instructions can make up for this performance gap however. SSE and SSE2 instructions are special instructions introduced with the P3 and P4 respectively. These instructions are what's called SIMD (single instruction/multiple data) type instructions. Basically they can greatly speed up the execution of code that involves performing the same basic operation on many different bits of data. Essentially this cuts down on the number of instructions needed. (Video encoding applications are probably the best candidate for this.) The Athlon XP can make use of SSE instructions, but cannot execute SSE2 instructions. This lets the P4 catch up to the Athlon and in many cases surpass it when code is specially optimized to make use of SSE2 instructions. One of the key improvements in the Athlon 64 is that it can now execute SSE2 instructions. Unfortunately for AMD, since SSE2 instructions were designed for the P4, the Athlon 64 doesn't gain as much percentage wise as the P4 by code that uses them. (You might wonder why Intel didn't make the FPU in the P4 fully pipelined. The reason was primarily cost savings. Sharing hardware between the add/subtract and multiply/divide units saves a lot of space on the die.)

I wish I could think of a way to explain all of the above more clearly, so please do ask questions. In general, the following is true: The Athlon gets more work done (more instructions per clock - IPC ratio) than the P4 does. Of course, the P4 runs at a higher clockspeed. Probably the simplest analogy I can come up with is that of auto engines. The Athlon is a big V8 running at 3000 RPM and the P4 is a 4 cylinder running at 6000 RPM. They can both put out the same amount of horsepower, but the 4 cylinder has to rev higher to do it.
Title: PC Architecture
Post by: bloom25 on October 21, 2003, 02:30:53 AM: (Teaser post, besides I've got 6 hours to wait while 1.9901 downloads at 28.8... :( )

I think I'll leave you guys with this, just to relate everything back to the FSB posts at the beginning of this monster of a thread. Do you think the Athlon would be more sensitive to latency (time between requesting and getting data or requesting a write and writing data) or bandwidth? Remember that it can execute 6 instructions at the same time, versus the P4s 4. It also has a shorter pipeline, meaning it takes less clock cycles to finish an instruction. I'm sure that if you think about it, you'll realize that latency is very important to both the P4 and Athlon, but the Athlon really needs low latencies for top performance. You can't get much work done if you can't get all those multiple instructions started or written back to memory when finished as quickly as possible. Is it any wonder that AMD removed the memory controller from the Northbridge on the Athlon 64 and placed it on the CPU. This drastically reduces latency, keeping as many of those multiple operational units busy as much of the time as possible.

On the plus side, now that we've discussed how the Athlon and P4 go about executing instructions, we can talk a little bit about some of the neat tricks that modern CPUs use to keep all those functional units and pipeline stages busy as much as possible. :) I'll start covering that tommorow. I might also have time to talk about tradeoffs in one design philosphy over the other.
Title: PC Architecture
Post by: boxboy28 on October 21, 2003, 03:22:33 PM: Bloom you are my Hero to several excellent posts!

on another note is LAZ runing an AMD or Intel

"Probably the simplest analogy I can come up with is that of auto engines. The Athlon is a big V8 running at 3000 RPM and the P4 is a 4 cylinder running at 6000 RPM. They can both put out the same amount of horsepower, but the 4 cylinder has to rev higher to do it."

LOL well im an AMD fanboy so i had too!
:aok
Title: PC Architecture
Post by: Thorns on October 21, 2003, 09:26:35 PM: Thanks Bloom, good stuff. Now why doesn't someone buy Win98 from Microbloat, and keep improving it?

Thorns
Title: PC Architecture
Post by: Flacke on November 01, 2003, 02:45:04 PM: Wonderful post Bloom, lots of work for you but a lot of learning for me and others. Thanks a lot.:aok
Title: PC Architecture
Post by: bloom25 on November 01, 2003, 04:38:27 PM: I was thinking about adding a little more about things like hyperthreading, SIMD instructions, hardware prefetch, etc. Are any of you interested in reading it?
Title: PC Architecture
Post by: Mini D on November 01, 2003, 04:48:56 PM: The gauntlet has been thrown down. Out-geek this guy skuzzy.

MiniD
Title: PC Architecture
Post by: Roscoroo on November 01, 2003, 05:26:11 PM: Go ahead Bloom .... I'm reading :aok
Title: PC Architecture
Post by: bloom25 on November 01, 2003, 06:36:32 PM: He can try MiniD, but if he does I'll be forced to post pictures of my alarm clock. (Actually to get the full effect I'd need to make an AVI.)

Only I would spend almost $100 designing and building an alarm clock with the following features:

1. Contains a 90 second digital voice IC with 22 Homer Simpson quotes I recorded. The alarm is Homer screaming and then yelling "DOH!" until I turn off the alarm. It also randomly plays about 20 or so different "Homerisms" plus other things when the alarm time is programmed. (It can actually be programmed to say anything I want.)

2. It can turn on my computer in the morning when the alarm goes off.

3. Yellow backlight LCD display.

4. Accurate to around 10 seconds a month or so.

5. Nobody else on the planet has one. ;)

It only took me 2 days (literally 20+ hours) to solder it all together, not counting the time it took me to write 1000+ lines of assembly.
Title: PC Architecture
Post by: DES on November 01, 2003, 09:01:56 PM: I'd certainly be interested in reading more.

des
Title: PC Architecture
Post by: Wanker on November 02, 2003, 08:51:28 AM: Looks like another good candidate for a sticky thread has been created.

WTG Bloom!
Title: PC Architecture
Post by: zmeg on November 02, 2003, 11:41:23 AM: WTFG bloom; I can't tell you how much we all apreciate you sharing your hard earned knowledge with all us wantabees,
I learned more from your post than any other source i've come across. Starving for more. BTW am I correct in assuming that AH is branch intensive? Will a 400Mhz chipset improve performance over a 266Mhz chipset if processor is 266Mhz fsb?
Title: PC Architecture
Post by: qts on November 02, 2003, 04:35:37 PM: Just a small point, Bloom, but I haven't spotted you stating that you work for Intel. Have you moved on? If not, it's probably worth stating this up front, for integrity's sake.
Title: PC Architecture
Post by: Sixpence on November 02, 2003, 07:17:36 PM: I have an AMD barton 2500, corsair 2700 ddr mem( 2 twin 512 sticks on a dual channel mobo) asus a7n8x deluxe with nforce chipset.

If I read it correctly, I am not bottlenecking?
Title: PC Architecture
Post by: bloom25 on November 02, 2003, 07:24:03 PM: I've never worked for Intel. AKDejaVu (MiniD?) is the only person who frequents this BBS who does that I know of.

I'm working on another post as we speak.
Title: PC Architecture
Post by: bloom25 on November 02, 2003, 09:22:19 PM: This time around I'll describe some of the techniques newer x86 CPUs use to keep themselves as busy as possible. Unfortunately some of these concepts are very technical and it's quite difficult to describe them in a way that makes sense, but I'll try my best. ;) Definately ask questions if you're confused. I've tried to arrange the information here where it kind of builds on itself to help make more sense.

One technique used by the P4, Athlon XP/64, and even the last of the P3s to increase performance is hardware prefetch. In short, hardware prefetch improves the effectiveness of the CPUs data cache. Hardware prefetch is a method by which the CPU attempts to guess what data it will be working with in the future. Data the hardware prefetch unit thinks will be used is retrieved from main memory and placed in the CPUs L1 or L2 (level 1 or level 2) data cache. If the hardware prefetch unit guesses correctly, the CPU has the data it needs in its data cache and does not need to wait for the data to be transfered from main memory. If the guess is incorrect, the prefetched data is eventually discarded. It's probably not too difficult to imagine that certain software applications gain more from this technique than others. Unfortunately hardware prefetch can sometimes also hurt performance slightly, as the bandwidth used by the prefetch unit reduces the bandwidth available to the rest of the CPU.

Since we have touched a bit on cache memory, I'll discuss it a bit as there are some significant differences between the Athlon and P4 when it comes to CPU caches. Cache memory is extremely high speed memory (many times faster than main system memory) that stores data and instructions expected to be needed by the CPU in the very near future. When the CPU attempts to fetch either an instruction or data from main memory, its caches are first checked to see if the needed bit of information is there. If it is, that is what is known as a cache "hit", basically meaning the cache contains the information requested by the CPU. It's important to note that there are both instruction and data caches because the instruction cache the P4 uses is very special. I'll come back to that very soon. You've also probably seen terms like L1, L2, and L3 cache thrown around when reading CPU posts and reviews. The "L" means "Level" and the biggest differences between L1, L2, and L3 caches is how quickly they can be accessed. The L1 cache is extremely high speed memory that can be accessed in only a few CPU clockcycles. L2 cache memory takes significantly more clock cycles to access than L1 cache, usually around 20, but is typically much larger in size. L3 cache is slower still than L2 cache at around 30 to 40 clock cycles to access. Both the Athlon XP/64 and Pentium 4 have both L1 and L2 caches. (The P4 Extreme Edition also has an L3 cache.) Having cache memory has the net effect of reducing the amount of time the CPU spends waiting for instructions and data, and thus increases performance. In general, more cache = faster performance, but this isn't always true. If both the data and instructions used by an application completely fit within the CPUs caches, having more than that amount won't gain you performance. Adding more cache memory also has significant drawbacks. On both the Athlon and P4 cache memory takes up more than 1/2 of the entire CPU die! Adding more cache memory means you must expand the die area of the CPU, which directly results in higher manufacturing costs. Thus it is important to balance the amount of cache with the costs of production. (The amount of cache memory is one of the main differences between the Athlon and Duron. Having less cache makes the Duron quite a bit cheaper to produce.)

Now that we've described what cache memory is we can talk about how the P4 and Athlon differ with respect to their caches. The Athlon XP using the Throughbred core has 256kB of L2 cache and 128kB (64kB data & 64kB instruction) of L1 cache. The Athlon XP with the Barton core has 512kB of L2 cache and 128kB of L1 cache. The Athlon 64s have 1MB of L2 cache and 128kB of L1 cache. The Pentium 4s of with the original Willamette core have 256kB of L2 cache, and the Northwood core P4s have 512kB of L2 cache. (The P4 Extreme Edition also has a 2 MB L3 cache.) Notice that I didn't list the amount of L1 cache for the P4s. This gets back to what I said above about the instruction cache in the P4 being special. The L1 data cache in the P4 is only 8 kB in size, but the instruction cache is quite special. The instruction cache in the P4 is known as a "trace cache" and is about 20kB in size. The L1 instruction cache in the P3 and Athlons stores x86 instructions, but as I'm sure most of you are aware (and I noted in an earlier post in this thread), modern CPUs break up x86 instructions into simplier instructions used within the CPU itself. A trace cache stores these simpler instructions, rather than the more complicated x86 instructions. Thus in the P4 once an x86 instruction has been decoded, the simpler instructions used to perform the x86 instruction are stored in the trace cache. (These simplier instructions are known as Macro ops by AMD and Micro ops by Intel.) This trace cache used by the P4 is a significant improvement over a conventional L1 instruction cache, as it improves the ability to schedule those instructions to keep as much of the CPU busy as possible. There are a couple other significant differences between the Athlons and Pentium 4 when it comes to their caches. The cache on the Athlon is "exclusive", which is different than most other CPUs. In a typical processor the contents are the L1 cache are duplicated in the L2 cache, which is what is known as an "inclusive" cache. As you can imagine, the contents of the L1 cache being duplicated in the L2 cache is really a waste of valuable L2 cache space. In the Athlons, the L1 and L2 cache contents are not duplicated and thus effectively gives the Athlon more cache that if it had inclusive caches. There is one other significant difference between the Athlon and P4 caches. The caches in the P4 are MUCH faster when it comes to bandwidth. Latency, or the number of CPU cycles needed for the CPU to get data from its caches is similar between the two. The Athlon 64s L2 cache is significantly faster than that in the Athlon XP, but still only offers about 60% of the bandwidth of the L2 cache in the P4. As you can see from the above, the Athlons generally have more cache memory, but the P4 has faster cache memory. Which is better depends mainly on the application.
Title: PC Architecture
Post by: bloom25 on November 02, 2003, 09:23:19 PM: If you've read posts and reviews of the Athlon 64 you've probably noticed that one of the new additions of the Athlon 64 is an "on die memory controller." This is probably the single most significant architectural improvement of the Athlon 64 over any other x86 CPU. Just as a recap from what I discussed in the first few posts above in a typical PC the memory controller for main system memory (DDR Sdram for most new PCs) is located in what is known as the Northbridge. The CPU and the Northbridge communicate with each other by means of the FSB (front side bus). In the Athlon 64 the memory controller is part of the CPU itself. Why do this? To put it simply, most modern CPUs spend most of their time waiting for information from memory. Putting the memory controller on the CPU dramatically reduces the amount of time needed to get information from memory (what is known as latency). It also improves bandwidth significantly. (The amount of data that can be transfered in a given amount of time.) It also completely removes the FSB as a bottleneck to memory performance. Having an on die memory controller also allows the Athlon 64 to scale much better as the clock frequency is increased than other CPUs. If you want to know just how much the on die memory controller improves performance you can take a look at this page out of Aces Hardware's Athlon 64 review: http://www.aceshardware.com/read.jsp?id=60000258 . Pay close attention the the 128 bit and 256 bit memory latency numbers. Notice that they are less than 1/2 of that of the 3.2 GHz P4 and about 1/3 better than the 3200+ Athlon XP. Some other things to note are the memory bandwidth chart at the top of the page and the L2 cache bandwidth chart.

Why doesn't every x86 CPU have an on die memory controller? Probably the biggest reason is cost. Adding the memory controller to the CPU greatly increases the number of pins on the CPU, which makes for a much more expensive package. (The Athlon XP uses a 462 pin socket and the current P4 uses a 478 pin socket. Contrast this with the Athlon 64 3200+, which has a single channel DDR400 controller and uses a 754 pin socket. The Athlon 64 FX with a dual channel memory controller uses a 940 pin socket now and will use a 939 pin socket in the near future.) An on die memory controller is also more beneficial when the clockspeed difference between the CPU and memory is higher. In the last few years the clockspeeds of x86 CPUs has skyrocketed. In contrast, memory clockspeeds haven't risen nearly as quickly. This means CPUs of a few years ago wouldn't have gained as much by having an on die memory controller as those of today. Two other minor drawbacks of an on die memory controller is that the CPU itself must be updated to support newer memory types and onboard graphics solutions that share system memory will probably lose some performance as well. (Remember that onboard graphics used to be a part of the Northbridge, along with the memory controller. Now that the memory controller is part of the CPU they will have to communicate with the CPU to access memory.) If I had to make a guess, I'd say that an on die memory controller will be something you will begin to see more of in the future.

In my next post I'll try to cover SIMD instructions (MMX, SSE, SSE2, SSE3, and 3dnow!). I will also talk about Hyperthreading, so stay tuned! :)
Title: PC Architecture
Post by: Roscoroo on November 03, 2003, 12:59:00 AM: :aok Bloom :aok good job explaining the cache in relationship to the exacution of memory vs the held amounts

~~(cant wait til the next installment )~~
Title: PC Architecture
Post by: beet1e on November 04, 2003, 07:08:06 AM: Quote
Originally posted by bloom25
This is why a Athlon 2500+ (333 MHz FSB) runs slower with DDR 400 memory than it does when using DDR 333 memory. The Athlon architecture is very sensitive to latency
ruh-roh. My new system is on order. I based my order on recommendations I've read here - A7N8X deluxe mobo, XP2600, ATi Radeon 9800 Pro etc... but 1xPC3200 512MB DDR 400MHz memory. Are you saying I would have been better off with the PC2700 DDR 333MHz? How bad a mistake was it to get the 400MHz? Should I return it and do an exchange? Or is the difference only very slight? If it's going to make a difference of 2fps in AH, I'm not that bothered.
Title: PC Architecture
Post by: boxboy28 on November 04, 2003, 08:26:40 AM: Beetle that chip(XP2500) is unlocked and will run at a FSB of 400 in sync with the ram! stay with the XP2500 and the 3200DDR
+the 2500 is tha Barton core with the 512 L2 cache !!!!
Title: PC Architecture
Post by: beet1e on November 04, 2003, 08:39:25 AM: Quote
Originally posted by boxboy28
Beetle that chip(XP2500) is unlocked and will run at a FSB of 400 in sync with the ram! stay with the XP2500 and the 3200DDR
+the 2500 is tha Barton core with the 512 L2 cache !!!!
It got buried, but I actually ordered the XP2600, not the XP2500 - am I going to be OK?
Title: PC Architecture
Post by: vorticon on November 04, 2003, 03:57:07 PM: long read (and im a fast reader) but well worth it...thanks alot bloom...
Title: PC Architecture
Post by: acetnt-2nd on November 04, 2003, 05:00:11 PM: Quote
Originally posted by beet1e
It got buried, but I actually ordered the XP2600, not the XP2500 - am I going to be OK?

you should be able to run the memory at 333Mhz
Title: PC Architecture
Post by: Roscoroo(work) on November 04, 2003, 05:18:23 PM: you can run the mem at 333 with that mb ... and the 2600+ is a unlocked cpu also ... w/ the 512 L-2 cache I dont think theres much difference between the 2500 and the 2600

the 2500 has been around longer so theres more info on it .

you should have a great running system w/ the 2600
Title: PC Architecture
Post by: bloom25 on November 04, 2003, 06:10:28 PM: Just run the memory at 333 MHz and you'll be fine. With the DDR400 memory you can try overclocking the FSB and still run the memory within specs. You could also try changing memory timings with the memory running at 333 MHz to something more aggressive. With the DDR400 memory you can also reuse the memory if you ever move to a 3200+ or Athlon 64.
Title: PC Architecture
Post by: bloom25 on November 04, 2003, 07:54:34 PM: This time around I'll talk about threading and "Hyperthreading." First of all, what is a thread? It's probably best to answer that by first talking about a "process." A "process" is, in general, a program. Each process can be made up of a series of seperate tasks, what is known as a thread. So basically a thread is a task and each process running can involve one to many threads. It is the job of the operating system to divide the resources of the processor to devote time to each thread (though each thread can have a different priority level). Each thread is given a slice of time to work and then processing is stopped on that thread and another thread is given a time slice. This is generally done in one of two ways, but I'll get back to that in a second. It is important to realize that a typical processor can only process one thread at a time. This means that even though it appears a computer is running several programs at once, the CPU is actually only running one thread at any given time and the operating system is switching between those threads to make it seem like multiple programs (processes/threads) are running at the same time. This switching of threads is usually called a "context switch." When the CPU switches from processing one thread to another it must first save the intermediate results of what it was working on to memory before it can begin processing the second thread. Depending on the CPU, a context switch can be quite painful to overall system performance. Since the CPU basically has to stop what it is doing and save the intermediate results to memory (this would be data being processed, along with processor status flags and the contents of several key registers within the CPU itself) there is a substantial performance hit. It's worth mentioning that a processor with a longer pipeline is, in general, going to take a bigger performance hit from a context switch than one with a shorter pipeline. That's because most of those pipeline stages are idle as the CPU saves results to memory and begins preparing to execute the next thread. As I said above, the operating system can handle this in two ways: 1. It can depend on each process to share nicely with other programs that are running. The operating system still divides up processor time, but does not strictly enforce the switching between threads, it simply requests the current running thread stop executing so another thread can be processed. 2. The operating system itself can set how long each thread has to be processed and the operating system switches between threads. Each thread has no idea how many other threads are running and to each thread it appears that it has the full resources of the CPU when it is running. Modern operating systems generally use the second approach. This is for a very good reason, as the results of a badly behaved thread using approach one can make the system seem to hang. Macintosh OSes 9 and below used method 1. (Any surprise then why they seemed to hang when a program crashed or a driver misbehaved... ;) ) Unfortunately using approach 2 does sacrifice a bit of performance as the overhead of the operating system enforcing the switching of threads results in a bit of a performance hit.

Ok, now that we've got an idea of threading and how a CPU switches between working on threads, lets quickly get back to something I briefly touched on in a previous post. Modern CPUs can issue and execute multiple instructions at once. If one instruction does not depend on another instruction, it is possible to execute them at the same time. Unfortunately if one instruction does depend on the results of another instruction, processing cannot complete on that instruction until the results of the instruction it depends on are known. It's probably not hard to imagine that instructions of one thread are far less likely to depend on instructions in a totally different thread. This means that a CPU (or using dual CPUs) working on more than one thread at once can result in higher performance if the CPU is capable of executing more instructions than it is working with in a single thread because of dependancies. Remember above that I said a P4 can issue and execute at most 4 instructions and the Athlon 6 instructions at the same time. (Those 4 or 6 instructions must fall into certain types to actually execute that many instructions at once, but we won't get into that.) If the CPU is capable of issuing and executing more instructions than it is actually executing it is wasting valuable resources. The number I generally see floating around is that the typical x86 CPU can average about 2.5 instructions at the same time. That means that a lot of the time both the P4 and the Athlon cannot execute at peak efficiency because for one reason or another they can't execute 4 (or 6) instructions at once.

Now lets talk about what Intel calls "Hyperthreading." Hyperthreading essentially fools the operating system into believing that a single CPU is actually two seperate CPUs. (The 3.06 GHz 'B' type P4 and all 'C' type P4s are Hyperthreading capable.) This allows Hyperthreaded P4s to be fed more than one thread at once and if the CPU has free resources it can use them to execute instructions in a second thread. Basically Hyperthreading can allow the CPU to make use of free resources to work on a separate thread and it also reduces the performance penalty of a context switch as the entire CPU need not be essentially idle when doing so. (Technically this would be a reduction in latency, or the amount of time it takes, to execute a context switch.) The net effect of this, from an end user perspective, is that the system is more responsive and feels quicker when executing multiple tasks. (I.E. Running more than one program at the same time.) Describing the techincal aspects of how this is done is far beyond the scope of this post, but basically certain key portions of the CPU are duplicated and other key portions of the CPU (most importantly the executing units and cache memory) are shared.

Unfortunately there are drawbacks to Hyperthreading, and I'm sure some of you have noticed that running several benchmarks with Hyperthreading enabled results in slightly lower scores. That's mainly because not all the resources of the CPU are devoted solely to running the benchmark, and thus the benchmark score drops slightly. There are a couple reasons for this. Probably the biggest is that cache memory requirements jump significantly when working with multiple threads. Cache memory works on the principle that a thread will tend to work with portions of memory relatively close in physical location to each other. When processing multiple threads this assumption doesn't work as well, as each thread may be working with portions of memory far from each other. This causes a reduction in the effective amount of cache memory that each thread has to work with and in some circumstances results in far more accesses to main memory (which takes a lot of time) than would have occured with Hyperthreading disabled. Also, if a thread would fit completely within cache with Hyperthreading disabled and won't with it enabled you will get a very significant performance hit. Again, this is because the number of accesses to main memory will go up significantly. Since the execution units are also shared there can be other complications. For example, a P4 does not have a fully pipelined floating point unit like the Athlon and for top performance must alternate multiply (division) and addition (subtraction) operations for best performance. (A fully pipelined floating point unit has totally seperate multiplication and addition units, meaning they don't share resources and can work at the same time.) Many P4 optimized programs know this and properly alternate execution of add/mult instructions for top performance. If multiple threads are executed, since each thread is fooled into believing it is the only thread executing, this can keep these optimizations from being as effective. If I had to make a guess for the drop in benchmark scores with Hyperthreading turned on, I would probably say the issue with cache memory is the far more critical restriction.
Title: PC Architecture
Post by: bloom25 on November 04, 2003, 07:55:17 PM: I'm sure many of you already know that the successor to the current Northwood P4s is called Prescott. Among the improvements in Prescott is supposedly improved Hyperthreading performance. The biggest improvement is undoubtedly that the amount of cache memory has been doubled, compared to Northwood, to 1 MB of L2 cache and a doubling of the L1 data cache and trace instruction cache. This will definately greatly improve Hyperthreading's effectiveness. Prescott also includes new special instructions (Prescott new instructions, SSE 3) which include a few new instructions specifically designed to increase Hyperthreading performance. (I haven't studied them in detail yet, but my guess would be that they allow one thread to temporarily halt execution of another thread.) It's also possible to specially optimize applications to take better advantage of Hyperthreading as well.

I think I'll also quickly note that AMD has not as of yet decided to implement some version of Hyperthreading in their new CPUs. I can't say that I blame them, because the Athlon and especally the Athlon 64 won't gain nearly as much by using them with current software. An on die memory controller, as noticed in previous posts, greatly reduces memory access latencies, which reduce the performance lost when executing a context switch. The Athlon and Athlon 64 also have significantly shorter pipelines than the P4, again reducing the advantage of Hyperthreading a bit. (Athlon - 10 stages, Athlon 64 - 12 stages, P4 - 20 stages) In addition, one of the new pipeline stages in the Athlon 64 analyses instruction dependancies to attempt to better schedule them to take better advantage of the CPUs resources. However, as new software begins to take better advantage of Hyperthreading I would not be surprised to see AMD eventually come up with some way to gain a bit of performance from that in future CPUs. (I'm not even considering any Intel patents on the technology.)
Title: PC Architecture
Post by: beet1e on November 05, 2003, 04:20:32 AM: Very interesting posts, Bloom25. Thanks for your advice, and to the other guys who answered my query. Last time I put a system together, the FSB and clock settings were in the BIOS (Asus A7V133). I didn't attempt any overclocking, and made no mods to the default speeds. All was well, with FPS in AH being 50-60 typically. As AH is about the most demanding app that I'm running right now, I hope to be OK with what I've bought. I'll look in the mobo manual to find out how to change settings, and if need be will post back.

As far as I can tell, the XP2600 I have is the thoroughbred, not the Barton. I got the 2600 even though I could have bought the 3200, because the 3200 cost about 5 times as much at the time! In the past few months it's dropped from £364 to £261 inclusive of tax. The 2600 I now have was about £75. I chose to apportion the main expense to the Radeon 9800 Pro vid card. Not much change out of £300. :(

Bloom. I was interested to read about multithreading. Same thing has existed on mainframes since the 70s, possibly earlier. But can you now explain to us about these "Dual Processors" that are being offered by AMD? Is this a hardware function to allow two threads to run at the same time? I bought single, not dual...
Title: PC Architecture
Post by: jonnyb on November 05, 2003, 12:16:45 PM: Intel's hyperthreading model attempts to mimic a dual processor system. It does this as bloom described in his posts. In a real dual processor system (Intel Xeon, AMD Opteron, etc) there is no need for that mimicry. The operating system sees the two processors and each gets its own processes to work on. Unlike the hyperthreaded model, a dual processor system can truly work on multiple processes in parallel.

The advantages of having two processors are numerous. For example, if a program is written to take advantage of multiple processors, it will complete it's work in far less time than if it were working on a single processor (or even on a hyperthreaded one). Take 3D rendering. Producing animation is extremely processor consuming simply because of the amount of math involved. Programs like 3DStudioMax and Bryce work extremely well with multiple processors. They can split tasks up to work on each processor and thereby reduce the total rendering time.

Databases and application servers also benefit greatly from multiprocessor systems. For example, I have built many large scale e-commerce type applications (bn.com, columbiahouse.com, kinkos.com, to name a few). Each of these applications serves many thousands of people. By using multiprocessor systems, these applications can respond much more efficiently to consumers.

Back to the bloom show...btw bloom, your commentary on architecture is exceptional. Ever think of writing a book or becoming a professor -- or perhaps you already have/are. I've thoroughly enjoyed the reading and brushing up on my knowledge is invaluable.
Title: PC Architecture
Post by: bloom25 on November 05, 2003, 09:38:36 PM: Lets go ahead and talk a bit about true multiprocessor architectural issues, since JohnnyB mentioned that briefly. There are some interesting differences between AMD and Intel CPUs here as well.

Many software applications are multithreaded, meaning they make use of more than one thread. If you have a multiprocessor capable operating system (Linux, Win2k Pro, WinXP Pro being most common), along with multithreaded applications, having more than one CPU can deliver much higher performance than a single CPU. Unfortunately the boost in having dual CPUs will rarely be anywhere near 2x that of a single CPU system, but I can't say that I've ever seen a hardware review talk about why that is. I'm sure quite a few of you may have wondered why dual CPU systems aren't twice as fast as single CPU systems, or why is it that dual CPU systems are sometimes slower than a single CPU system. There are several reasons for this.

First, lets talk about how the operating system makes use of dual (or more) CPUs. The most simple explaination would be to say that a multiprocessor capable operating system simply runs one thread on one CPU and another thread on the second CPU. This is essentially true, and as you might imagine applications that make use of more than one thread can realize tremendous performance gains on multiprocessor systems. But why isn't the boost 2x? There are several reasons, some relating to architectural limitations of the hardware being used (i.e. the CPUs themselves) and others relating to software issues. Unfortunately I don't even begin to consider myself an expert on software issues relating to multiple CPUs, but I'll do my best there. (A good SMP programmer will probably point out tons of flaws in my software explaination if I go too far, so I'll keep it simple.) Fortunately I do know hardware, so I'll talk about those issues first.

Hardware issue 1: Both CPUs in a dual processor system (notible exception being the new AMD Opterons - more on that later) share the same system memory and disk drive. I'm sure most of you can see that this will result in a significant amount of additional memory accesses and very slow disk drive accesses as the number of CPUs goes up. This, of course, results in a performance decrease. Unfortunately this gets even worse on real hardware platforms, because in the case of the P4 Xeons the CPU FSB bandwidth is actually shared between all processors in the system. This means each CPU shares bandwidth with every other CPU on the bus. This increases memory latencies and decreases bandwidth, which is a very bad thing. Unfortunately life isn't much better for the Athlon MP (which is simply a dual processor certified Athlon XP with some additional testing). I'm sure some of you know that the Athlon architecture is largely based on a server processor known as the Alpha (which was designed by Digital Equipment Corp, later aquired by HP and Intel). Many of the engineers of the Athlon architecture AMD hired to design the Athlon had previously worked on the EV6 and EV7 Alphas, which are 64 bit multiprocessor capable server CPUs. One of the key carryovers from the Alpha EV6 processor was its bus protocol, also called the EV6 bus. The EV6 bus has the advantage of each CPU has its full typical bandwidth to memory, rather than being shared on the P4 Xeon CPUs. Unfortunately AMD has squandered this very significant advantage by failing to keep pace with advancements in DDR memory in their only Athlon MP chipset. (The 760MP/MPX.) The 760 chipset only supports a 266 MHz FSB, and thus only officially supports DDR266 memory. The Athlon XP 3200+ has a 400 MHz FSB and supports DDR400 memory for comparison. This means that even though each Athlon MP does not share FSB bandwidth with each other, the chipset itself is out of date regarding its memory support. (This is also why the Athlon MP 2800+ has a 266 MHz FSB, where the XP 2800+ has a 333 MHz FSB.) Obviously a dual processor system can make better use of a faster FSB and faster memory than a single processor system can. (As I hinted above, the Opteron is different, and I'll get to that - I promise. :) )

Hardware issue number 2/Software issue - Lets now talk about a tremendous issue that every multiprocessor system must deal with. Consider what happens if both CPUs in a dual processor system were to access the same data in memory, and both wanted to make changes to that particular bit of data. The potential issues here are monumental. Think about this for a minute. Lets say CPU 1 in its thread is asked to make a decision based on the data value in a particular location in memory. Lets also say that CPU 2 just happens to be working with a thread that makes a change to that same location in memory. There is a very real chance that CPU 1 may not do what the programmer intended, if CPU 2 happens to change the bit of data that CPU 1 is making decisions based on. There is more to this though, think about CPU cache memory. Remember that CPU cache memory is basically very high speed memory that is filled with bits of main memory that the CPU happens to be working with in the near future. Basically cache memory can be thought of as high speed temporary memory that the CPU works with. If it makes a change to something in cache memory, like writing the results of an instruction to cache memory, that particular bit of data in main system memory must also be updated before some other thread (or much worse) or some other CPU works with the same bit of data. If CPU 1 happened to make a change to something temporarily located in its cache memory and CPU 2 needed to work with that same bit of memory and happened to read that data before CPU 1 could manage to write back its cache to main memory the system could crash. Basically what I'm getting at is one CPU tries to work with the same data as the second CPU at the same time - neither one can be sure the data it is working with is correct unless it can be absolutely sure that the other CPU isn't going to makes changes to that data until it has finished working with it. The software term for this potential nightmare is a "race condition", meaning that basically two seperate processes or threads are trying to work with the same bit of data. Obviously multithreaded programs must be very carefully written to ensure that one of their threads isn't working with the same bit of data as another thread. Fortunately the multithreaded program has the advantage of knowing that a properly written multiprocessor operating system is supposed to ensure that other programs don't tamper with locations in memory reserved for it. So there we've talked a bit about software issues with multithreaded programs, but this still doesn't solve the CPU cache memory issue. That issue is resolved by the CPU itself, as a CPU in a multiprocessor system checks on every memory read and write that another CPU in the system doesn't contain the data it is working with in their cache memory. If it does, and the data in the other CPUs cache has a more recent value than main memory, it will change main system memory to reflect that and get the most up to date value. As you can probably imagine, this puts even more traffic on the CPU FSB(es) and accesses to memory. This certainly relates back to issue number one. You can probably imagine that the Athlon MP, with its non-shared FSB is a bit better here than the Xeon. Once again, the Opteron has a big advantage here, and again - I'll get to that... ;)

Hardware issue 3 - Since main memory is shared the chipset must arbitrate between each CPU. There's not much to say here, bascially the chipset must take requests from each CPU to read and write to memory and give control of memory to each CPU when it request it. If both CPUs need to access memory at the same time (which they both nearly always will, since memory accesses take so long) one must wait until the other is done. This means higher memory read and write latencies, again hurting performance. Again, the Opteron is better, and I'm finally going to talk about why ... but first I'm taking a break so stay tuned as the Opteron has some really "cool" ways of increasing the efficiency of multiprocessor systems. :D (The on-die memory controller in each CPU should be fairly obvious, but there's more than that...)
Title: PC Architecture
Post by: bloom25 on November 05, 2003, 11:48:03 PM: Ok, now lets talk about the Opteron and what makes it so special when it comes to multiprocessing.

I'm sure all you read above that the Athlon 64 (Athlon 64, Athlon 64 FX, & Opteron) family features an on-die memory controller. The truth is that there's a lot more than just that in there. Remember that in a traditional multiprocessor system each CPU shares main memory with every other processor, which results in a performance hit for many reasons. With the Opteron EACH CPU has its own memory controller and its own DDR memory modules. This is a tremendous advantage for the Opteron, this means that every CPU in a multiprocessor setup can have its own memory modules, rather than sharing them. This advantage gets even more important as you get even more CPUs. (If 2 CPUs sharing the same memory is bad, picture 4, 8, or even more.) There's even more than that though; in a traditional multiprocessor system each CPU must communicate with the Northbridge portion of the chipset to gain access to the shared memory. What's more, they have to do this to communicate with each other as well, meaning every CPU you add results in less and less of a performance boost percentage wise. In the Opteron, each CPU can directly communicate with each other over a fast 6.4 GB/sec Hypertransport link. 6.4 GB/s is the bandwidth offered by dual channel DDR400 memory, so in effect every CPU acts as a Northbridge all by itself. This means every CPU can access its own memory directly, and can communicate through its extra Hypertransport links with every other CPU with only a minimal performance penalty. This high speed link also improves the efficiency of each individual processor's cache memory, as the other CPUs in the system can access other CPU's cache much more rapidly than in other multiprocessor setups. Not only is this tremendously faster than any other multiprocessing scheme today, it also eliminates the need for an ever more increasingly complicated Northbridge, which in a traditional MP setup every CPU must communicate with. This means multiprocessor chipsets can be MUCH simplier, basically they become only I/O controllers (controlling hard drives, USB ports, PCI bus, AGP slot, etc). The chipset communicates with either one or multiple CPUs over another fast Hypertransport link built into the Opteron. This means that in an Opteron system, only the drives and the rest of the system are shared.

There are actually 4 different series of Opterons being produced: The 100 series, which is only single processor capable, has only 1 active Hypertransport link. (Making it currently identical to the Athlon 64 FX CPU) This single link hooks the CPU to the motherboard chipset.

The 200 series, which can work in dual processor systems and has 2 active Hypertransport links. The extra link hooks the two CPUs together.

The 800 series, which can work in 4 or 8 way systems, and has a 3rd Hypertransport link. In this type of setup the links are arranged like a square or cube. Picture a square with one CPU in each corner. Each CPU uses 2 of its HT links to go to the closest 2 other CPUs. This makes 2 links on the sides of the square for each CPU and the 3rd link on one (or more in some cases) of the CPUs goes to the rest of the system. In an 8 way system the CPUs are arranged as a cube (or as a sort of rectangle), with each CPU using its 3rd HT link diagonally across the cube to link itself with the bottom or top 4 CPUs repectively. Again, one or more of the CPUs communicates with the rest of the system.

The last series is a special version of the Opteron being used for supercomputers. Cray, IBM, Sandia National Labs, and others are building or have planned to build supercomputers with it. This chip has even more HT links, which gives it enough links to build systems arranged as a giant 3 dimensional grid. Each CPUs HT links reach out to neighbor CPUs. The best way I can describe this arrangement is picture each CPU in the middle of a 3d "+" sign. There are supercomputers with well over 1000 individual Opterons either being planned or constructed, ranking them along the fastest in the world. These supercomputers generally run 64-bit Linux or Unix. (Windows currently does not support what's known as NUMA (non-uniform memory access) which such a setup with multiple memory controllers requires. Windows Server 2003 is the first to have some NUMA support included.) Basically the Opteron is the first x86 compatible CPU designed primarily with multiprocessing in mind. As you can see, it eliminates or minimizes the disadvantages of adding additional CPUs compared to other x86 multiprocessor CPUs.

This is really easy to picture with a simple diagram. If I find a good one, I'll link to it here. Basically the Opteron is capable of scaling in performance far better than any other multiprocessor capable CPU available today as you add more CPUs.
Title: PC Architecture
Post by: bloom25 on November 05, 2003, 11:58:31 PM: Here's a quick bit of info out of Anandtech's early Opteron article from back in April. (Note that Opteron now supports DDR400 memory.)

http://www.anandtech.com/cpu/showdoc.html?i=1815&p=7

Here's some more links about the supercomputers planned:

This is the computer Cray is building for Sandia National Labs, which uses 10000 (!!!) Opterons and would be the fastest computer in the world if running today.

http://zdnet.com.com/2100-1103-962787.html
http://www.cpuplanet.com/knowledge/casestudies/article.php/2198311
Title: PC Architecture
Post by: jonnyb on November 06, 2003, 10:38:15 AM: As I had briefly mentioned in my post, and bloom has now expanded upon, the benefits to a multiprocessor system are plain to see. What I didn't touch upon in too much detail was why your average MP system will not see a linear growth curve of application performance to number of processors. One would expect that adding a second processor would double the speed. Four processors must then quadruple it, right? Unfortunately, no. Bloom has described quite correctly the hardware limitations involved in multiple processor systems. To summarize, x86-based processors (until the Opteron, that is) shared system resources. They were forced to utilize the same memory, the same FSB, the same Northbridge, the same I/O controllers. All of this sharing leads to a lot of wasted time on the CPUs while they wait for the rest of the system to catch up.

Another issue that was mentioned (hardware/software issue 2 from the above post) was programmatic access to memory by multiple CPUs. I will expand on the programming issue as that is where my expertise lies.

First and foremost, probably 99.999% of all programs you run on your home pc are multi-threaded. It would just take way too long for programs to execute if they were not. Let's look at an example that we are all familiar with: this bulletin board. The architecture of this board involves a graphical user interface (GUI) that provides the look and feel of the board, an API to retrieve and store data, and an API to accept user input and perform actions based on that input. (There are more things involved, but this list is enough to get us started).

Let's first assume that this bulletin board is single-threaded. When a user accessed the board by typing in the URL, the application server would receive that request, perform a lookup in the database to verify the user's existence, retrieve information from the database about that user, retrieve information about forums, check to see if there are any forums that have not been read since the user's last visit, manage the retrieved data, compile the data into a usable format, generate the HTML to display the data (based on the GUI) and finally send that data back to you. Wow. Just by typing in the URL of this board, you've caused at least 9 major events to happen. Each of these events requires time to complete. Furthermore, a lot of this time is spent waiting on the retrieval of information needed to proceed to the next step of the program. During these waiting times the CPU would sit idle.

Compound the above everyday scenario by adding multiple users. In a single threaded application all of you would have to wait until my request had been completed. Assuming each request takes 6 seconds to complete and there are 100 users trying to access this board, the poor guy that came in at number 100 would have to wait an unbearable 10 minutes for his request to finally be processed (based on the 100 users coming into the application simultaneously and a first-in-first-out queue).

Can you imagine having to wait 10 minutes for the board to load? Obviously nobody would wait that long. Notice, too, that throughout that 10 minute period, the CPU of the app server would be mostly idle becuase of the waiting time to retrieve all of that data. The database servers would also sit idly by waiting for more requests from the app server....

continued below
Title: PC Architecture
Post by: jonnyb on November 06, 2003, 11:03:25 AM: Adding another processor (or 100 more) to the single-threaded model does not help our cause any. Each processor would still be sitting around idle for extended periods of time.

Multi-threading a process attempts to keep a CPU busy at all times. By breaking a process down into parts and giving each of those parts its own thread, allows the process to execute much more quickly. The best example of this type of behavior can be seen in a macro sense with the SETI application. I'm sure most of you have seen/used/heard about this app. Basically, it takes one giant chunk of radio telescope data, breaks it down into smaller pieces and farms those small pieces out to processors around the world. The efficiency of crunching the data in this way is leaps and bounds ahead of doing it all in a single bite.

In our bulletin board example, multi-threading allows our application to service many more readers concurrently. The application spawns multiple threads to handle user requests. Let's assume that there are 100 threads spawned, to match the number of users trying to access the board. Remember that I told you the CPU was sitting around idle most of the time? Well, here's where we take advantage of it. When the first thread is waiting on the fetch of data from the database, it gives up it's control on the CPU. The second thread now utilizes the CPU. When it's waiting, it gives up control, and thread 3 comes in. This goes on and on. The CPU services each thread and spends far less of its time idle. So what does this mean to us? It means that we are serviced much faster. Now the poor guy who came in 100th doesn't have to wait an unbearable 10 minutes.

If you've followed to this point you will know that my example has been based on a single CPU. We've seen that multi-threading a program gets things done faster by reducing CPU idle time. This can be expanded to multiple CPUs....to a point. As bloom mentioned, there are setbacks. The more CPUs you add to the system, the more overhead is involved. Now program states, thread locations, memory access, I/O access have to be communicated throughout the system. The system spends more time managing than processing. Programs become bloated because they have to deal with handling threads more carefully. Operating systems become more complex. The list goes on.

I think I'll break here as I've hijacked bloom's thread long enough. If people are interested, I'll start another thread that deals with software development.
Title: PC Architecture
Post by: bloom25 on November 06, 2003, 05:42:12 PM: Go ahead and continue in this thread if you want JohnnyB. Hardware and software are so closely intertwined that I think it would be best to just keep all the info in one thread. That allows hardware discussion to build on software issues and visa versa.

It would probably also be interesting to discuss some of the x86 architectural limitations and annoyances. Things like segmentation and the limited number of software accessable x86 registers to work with. That would also be a good primer to build the SIMD instruction info off of.

BTW: I found the software discussion very interesting.
Title: PC Architecture
Post by: Roscoroo on November 10, 2003, 12:03:20 PM: I vote ya keep going in the same thread here ... Still reading :aok
Title: PC Architecture
Post by: jonnyb on November 10, 2003, 12:53:40 PM: Alright, I'll continue in this post. Hopefully, I'll get a chance to post something later today. I think the discussion will be regarding the low-level software interactions with the hardware (MMX, SSE and SSE2).
Title: PC Architecture
Post by: bloom25 on November 10, 2003, 05:59:56 PM: That's something I was working on myself, but honestly it's not an easy topic to simplify and still get any useful information across.
Title: PC Architecture
Post by: jonnyb on November 11, 2003, 10:40:39 AM: lol... I hear that. The basic idea was to convey the advantages of the MMX, SSE and SSE2 instructions sets. When I realized that to do so would require a far greater amount of detail than I care to post, I started thinking of other ways around it.

The basic premise for the introduction of the SIMD (Single Instruction Multiple Data) was to speed up complex operations on a CPU. If porn makes the internet go 'round, then games do the same for hardware advancement, albeit more discreetly. To understand why these instruction sets were added to the x86 CPUs, one must understand the enormous amount of processing that must take place in a typical game. As before, I will use an example we are all familiar with: Aces High. Before I go on, I must include a disclaimer: I do not work for Hitech Creations, and I do not have any knowledge of, or access to, any of the source code that makes up Aces High. What I am going to write are general concepts, and nothing specific to the AH engine.

Ok, now that the disclaimer is out of the way, let's begin. In AH a huge amount of data must be processed to create the virtual reality we enter each time we start up the game. The processing power goes to rendering the world around you, computing flight characteristics, registering and tabulating damage inflicted both to and by you, etc. This is no small feat to accomplish, and tends to keep a CPU quite busy churning away.

What does it all boil down to? The answer, yes I'm sure you didn't believe your teacher when s/he told you this, is math. That's right. It's all about the math. Let's look at one specific part of AH: the graphics engine. There are many graphics engines on the market, some of which are quite famous, Quake3, Half-Life, Unreal, etc. The single purpose of a graphics engine is to render graphics (duh). How it does this is through the use of math. I know I'm diversifying here, but I'll get back to my original point, I swear. It really is related...

Let's consider a simple graphics engine. There are two components involved, a modeler and a renderer. The modeler is responsible for the generation of shapes and creation of coordinates. The renderer then takes the information produced by the modeler and produces the images on screen. For the sake of convenience, this is a grossly over-simplified explanation of the function of the components of a graphics engine. For example, I am not going into z-buffers, pixel shaders, dynamic light sourcing, ray-tracing, etc.

Anyway, the simplest representable polygon is a triangle. It is planar and has only three points (or in the graphics world, vertices). We can represent any shape we wish through the use of enough triangles. For example, a square is nothing more than two triangles. The more triangles we use, the higher the detail we can achieve. This is easily visible when creating curved surfaces. Less triangles, and the edges are jagged. More, and they smooth out.

This, however, comes at a price: computing power. The modeler must create all the vertices for all the points of all the triangles of all the shapes in an object. This can also include base color saturation levels for each vertex, and other pertinent information regarding points or shapes. Can you see where this gets computationally expensive? Take the creation of a sphere as an example of a very common shape. Our modeler must decide how many triangles will compose the sphere. With that knowledge, and some other inputs (like the radius of the sphere and the formula for calculating surface area -- 4*pi*r^2) the modeler generates the vertices of the triangles composing the sphere.

Once all the vertices have been computed, that information is fed to the renderer. Its job is to put all of those objects onto the screen. It has to deal with interaction between objects, including light sources, shadowing, obscurity, and so on. Again, it is the power of math. Without going into a college-level linear algebra and matrices course, suffice it to say that a LOT of math happens here. Very complex math including calculus, vectors and even 4-dimensional extrapolations.

continued below...
Title: PC Architecture
Post by: jonnyb on November 11, 2003, 12:58:58 PM: With all that math going on, especially the matrix operations, the CPU is kept exceptionally busy. It is so busy that today's games are completely unplayable on older systems. So, for all you guys out there wondering why your Pentium 166 with the integrated 4mb video won't play AH2, there's your answer.

So, what can we do about this? The answer is that we group like operations and data together. This is where SIMD comes into play. (See, I told you I would get back to this :)). SIMD allows a CPU to perform a single operation (like multiplication) on multiple data instead of a one-to-one relationship of operation to data (SISD). The key to SIMD and its success is known as data parallelism. You get data parallelism when you have a large mass of data of a uniform type that needs the same instruction performed on it. In my rendering example, one would have many dot products to perform on a scene. The basic unit of SIMD processing is the vector, which is why SIMD computing is also known as vector processing. A vector is nothing more than a row of individual numbers, or scalars. A regular CPU operates on scalars, one at a time. A vector processor, on the other hand, lines up a whole row of these scalars, all of the same type, and operates on them as a unit.

The introduction of this hardware, and the accompanying software instruction sets to the PC allowed for huge gains in processing power over just raw clock speed increases. MMX was first out of the gate. I will leave the hardware details to bloom, but will provide a brief summary. MMX is SIMD strictly for integer calculations. SSE on the other hand is for floating point calculations. MMX offers the ability to operate on 2x32 bit integers simultaneously, SSE offers 4x32 bit floats. Furthermore, MMX defined no new registers, opting instead to use registers currently on the die. SSE, on the other hand, introduced 8 new registers.

With these new instruction sets, software developers could compile their code to take advantage of the benefits offered. Programs like Photoshop, 3DStudio, mWave, etc that had been re-written to utilize the new instruction sets saw huge performance gains. Games also saw huge improvements in frame rates as the new hardware could process data that much quicker.

Ok, I've covered quite a bit here, so I'll stop. After all, I do have to develop some software :).
Title: PC Architecture
Post by: bloom25 on November 11, 2003, 05:11:39 PM: Great job JohnnyB! :) That definately gives a good base to build on.

I think at this point it might be a good idea to describe in a bit more detail some key concepts to understanding how SIMD instructions (MMX, SSE, SSE2, 3dnow, SSE3) are useful.

Let's go down to the absolute basics here and definite what's meant by bit, byte, word, float, integer, double, etc. If you know what these mean it will be much easier to understand how SIMD instructions can greatly boost performance. These concepts are actually quite simple once you understand how to think in binary. To get there, let's send everyone back to basic math class. ;)

First, the "bit": A bit is roughly analogous to a digit in our normal decimal way of thinking. For example the number "410" has three digits, the ones, tens, and hundreds places. If you think about this what we are really saying is that 410 is made up of 4x100 + 1x10 + 0x1 = 410. (4 times 100 plus 1 times 10 plus 0 times 1 equals 410.) The same holds true for binary numbers (and all other numbering schemes like octal and hexidecimal, of which the later is commonly used in software). If I want to represent a number in binary, which only has 0 and 1 (versus 0,1,2,3,4,5,6,7,8, and 9 for decimal) we simply do exactly what we do for decimal. There's only two possible choices for each place though, 0 times some place and 1 times some place. Think about this: In decimal each place (ones, tens, hundreds, thousands) is a power of 10. The ones place is 10^0 (or 10 to the 0 power, which equals 1.) The tens place is 10^1 (or 10 to the 1st power, which equals 10.) The hundreds place is 10^2 (equals 100) and so on. Thus again when we write 410 we are saying 4x10^2 + 1x10^1 + 0x10^0 which equals 410. (4 times 10 to the second power = 400. 1 times 10 to the first power = 10. 0 times 10 to the 0th power = 0.) Now lets talk about binary. Binary does not have places that are powers of 10, but rather places that are powers of 2. Thus for binary the first place is still the ones digit (2^0), but the next is the 2s place (2^1), then 4s (2^2) place, then 8 (2^3), 16, 32, 64, 128, 256, 512, etc. This means since you only have a 0 or 1 as a multiplier for each of these places it's really easy to count in binary. Lets start simple, in binary if I want to represent the number 10 I would write that as 1010. (8 times 1 plus 4 times 0 plus 2 times 1 plus 1 times 0 = 10 (decimal)) For reference, here's 0 to 15 in binary: 0000 = 0, 0001 = 1, 0010 = 2, 0011 = 3, 0100 = 4, 0101 = 5, 0110 = 6, 0111 = 7, 1000 = 8, 1001 = 9, 1010 = 10, 1011 = 11, 1100 = 12, 1101 = 13, 1110 = 14, 1111 = 15. Once you catch on to this you understand just how basic it really is. Now (finally) back to the bit. A "bit" is simply any individual binary digit. Thus to represent numbers from 0 to 15, you need to use 4 bits to do that. (Look above, 15 = 1111, which is the highest number I can represent using only 4 bits. If I wanted to write 16, that would be 10000, which needs 5 bits.)

Ok, I think we now have a fairly good understanding of a bit, now onto the byte. A byte is simply a number using 8 bits. This means that a single byte of information can have up to 256 possible values ranging from 00000000 = 0 to 11111111 = 255. The byte is a very commonly used term in computers and tech literature. (A Megabyte is technically one million bytes. A Gigabyte is one billion bytes.) Remember just above I said that with a single byte you can represent up to 256 different numbers (integers 0 to 255). Given that piece of information, is it hard to understand that an 8 bit CPU can work only with (positive) numbers from 0 to 255 in each operation it performs? For example, if an 8 bit CPU wants to add two numbers, the maximum value for each number it wants to add can be no greater than 255. Now obviously you CAN work with numbers bigger than 255, but you have to use more than one operation to do so. The first byte used can represent the lower digits and the second byte can be the higher digits. Unfortunately to explain this further will drag the discussion way off topic, so lets more on shall we. The important concept to gather is that an 8 bit CPU must perform more operations to work with numbers greater than 255. This puts it at a disadvantage when doing so against CPUs that can work with 16 bits (or more) at a time.

Now, how about 16 bits. With 16 bits (2 bytes) we can describe up to 65536 possible numbers (0 to 65535). 16 bits is actually a rather special number of bits for x86 CPUs, because originally the x86 instruction set was only 16 bit. (This would be CPUs including the 8086, 80186, and 80286.) The x86 instruction set was expanded to 32 bit with the 80386 (386), so x86 CPUs from the 386 onwards could processes either 16 bit or 32 bit instructions and data. 16 bits is also relavant for another reason, and that is the concept of the word. The basic definition of the binary word is the number of bits the CPU natively works with. In the x86 world this is not simple anymore, as x86 CPUs started out as 16 bit and were expanded to 32 bit (and now 64 bit with the Athlon 64). In the x86 world, the "word" is 2 bytes or 16 bits long. In other architectures this is not the case. For the PowerPC (Macs) the word is 32 bits or 4 bytes. Why is the "word" important? Simply put, the word is the (smallest) number of bits the CPU works with in each operation.

Ok, lets go farther. In software programming languages (like C or Java) you must definate how many bits are contained in a variable. To do this, software "declares" a variable, which essentially tells the computer how much memory to allocate for that particular variable. Thus if you declare a variable for an x86 CPU, the smallest amount of memory that variable is going to use up is 16 bits, or one word. Now we can define what a "double" is. A "double" is a variable made up of 2 words, or in the x86 world 32 bits. With a 32 bit variable you can describe just over 4 billion possible values.

Now in JohnnyB's excellent posts above he mentions a "float." What is a "float?" A float is short for a floating point number. A floating point number would be simplest to explain to those of us familiar with decimal numbers as those with a decimal point. Thus a floating point number would be those like 12.286, versus integer numbers like 410. Representing a floating point number in binary is very interesting. Picture this, I can also represent 12.286 as 12286 with the decimal point shifted 3 places to the left. Thus a floating point number can be represented as the digits that make up the number and the number of places to move the decimal point. This is how a computer treats floating point numbers. (The technical term for these two pieces of information used to describe a floating point number are the mantissa (or the 12286 in my example) and the exponent (or the number of places to move the decimal point). Now obviously to do any serious math and realistically any kind of division you need to have the capability of working with floating point numbers. Every x86 CPU since the 486DX has had a special unit within the processor dedicated to working with these numbers. That unit is the Floating Point Unit (FPU). The instruction set primarily used to work with floating point numbers is the x87 instruction set. (Why is it called x87? That's because some computers, like 286 based PCs, had the option of using a seperate CPU of sorts called the 80287 chip, which handled floating point instructions. The instructions this chip used were naturally called x87 instructions. After the FPU was integrated with the rest of the CPU with the 486DX, the x87 name for the instructions was still used.) The FPU can perform a LOT of different instructions. It can add, subtract, multiply, divide with remainder, divide, perform trig instructions like cosine, sine, and tangent and arctangent, raise number to powers of 2, powers of e (if you don't understand what e is, don't worry), perform log(arithm) instructions, compare numbers, along with read and write to memory. (And much much more.) As you might imagine the floating point unit is MUCH more complicated (and takes a longer to do some instructions) than the integer unit. (The integer unit is known as the ALU or arithmatic logic unit. The ALU performs most of the same tasks as the floating point unit, but only on integer numbers, or those without a decimal point in them.) Games make heavy use of the floating point unit to perform just about all mathematical operations. The integer unit simply (ALU) lacks the precision (since it can only work with integers) for many calculations that games need to perform. Thus for gaming in particular the performance of the FPU (at least for the CPU's part) is key to gaming performance.

Why have I spent all this time describing all this? Quite simply the SIMD instructions are designed to allow an x86 CPU to perform instructions on more than 32 bit numbers at the same time. The differences between them lie in the types and number of bits these instructions can deal with.
Title: PC Architecture
Post by: bloom25 on November 11, 2003, 05:59:46 PM: Ok, onto the SIMD (single instruction, multiple data) instruction sets themselves. The first to be introduced was MMX. MMX (MultiMedia Extensions), added with the Pentium MMX, instructions give an x86 CPU the ability to process integer (and integer only) operations on up to 2 32 bit numbers at the same time. In a way you could say that MMX instructions added 64 bit integer operations to the x86 instruction set, but that's not really the case as most of the CPU itself was still 32 bit, meaning to work with the 64 bit numbers it had to perform multiple internal operations. It did give greater performance on integer operations mainly because MMX instructions added mov (move) instructions for more than one 32 bit integer value to and from memory at the same time. Thus integer operation heavy programs could, using MMX instructions, perform "work" on 2 32 bit values with the same instruction. Unfortunately, games depend heavily on the floating point unit, and MMX does nothing directly to improve the performance of those instructions.

Enter 3dNow!. Back in the days of the AMD K6 series processors the x87 floating point unit in that CPU was not nearly as good as the Pentium 2s FPU. Thus in gaming tests the P2 nearly always beat up on the AMD K6. The K6 did have good integer performance, and thus did well on general office applications and the like that don't stress the FPU. To help the K6 out, AMD added new instructions they called 3dNow to the K6 and renamed it the K6-2 with 3dNow!. Basically 3dNow instructions were like MMX instructions for floating point numbers. Thus if games made heavy use of 3dnow instructions the K6-2 could catch up or even pass the P2. Unfortunately for AMD, the 3dNow instruction set had 2 main problems going for it: 1. Intel held a commanding marketshare advantage over AMD, so few software developers spent the considerable time needed to heavily optimize their programs for 3dNow. 2. The 3dNow instruction set did not have the same precision as the x87 floating point instruction set. x87 is actually internally 80 bit, and the FPU uses those extra 16 bits to avoid error when working with 64 bit numbers in operations. (For example a divide operation like 1/3 where one number does not evenly go into another.) 3dnow originally was 64 bit only. This meant that mixing original 3dnow and x87 instructions together resulted in a small amount of error being added to each calculation. For games, this was very small and didn't matter a whole lot, but for scientific applications and some video editting work this loss of precision could be major. Again, this greatly limited the use of 3dNow. (Think about this, if the x87 results are 80 bit and you pass that data to a 3dnow instruction the least signficant 16 bits are simply rounded off. If you then performed another x87 instruction on that data, those 16 bits were added back as 0s. Basically each operation added a bit of rounding error, which if you executed 1000s of instructions on a single piece of data could result in signficiant error in some cases.)

With the P3 Intel added their own version of 3dnow and called it SSE. SSE did not sacrifice any precision in its calculations, and thus was much better suited to the job than 3dNow. Again, SSE is like MMX for floating point numbers. SSE instructions can work with 2 floating point 32 bit values with a single instruction. AMD implemented partial SSE support with the original Athlon and added full SSE support with the Athlon XP (they just called it 3dNow! Professional). Again, like MMX, SSE instructions allow the transfer of 2 floating point numbers to and from memory at the same time. SSE also added special registers (known as XMM registers) for working with these instructions.

SSE2 simply extends SSE up to so called quad doubleword instructions (or 4 32 bit values = 128 bit). Thus SSE2 instructions can work with 4 standard 32 bit floating point values in a single instruction. SSE2 instructions were added with the Pentium 4, which in a strange twist of fate suffers from the same basic limitation as the AMD K6 did. The Athlon's x87 FPU is far superior to the P4s x87 FPU. (As noted in previous posts, the Athlon's FPU is fully pipelined, allowing it to execute multiply/divide and add/subtract instructions at the same time.) Like 3dNow! before it, using SSE2 instructions allows the P4 to narrow the gap and in some cases exceed the FPU performance of the Athlon. AMD finally added SSE2 support to the Athlon 64 CPUs and expanded it by adding more XMM registers.

The next Pentium chip (Prescott) will add 19 more new instructions, called SSE3 instructions. These instructions build upon SSE2 and add a few new instructions to help Hyperthreading performance. Prescott is currently scheduled for a Q1 2004 introduction (probably February).

(There's some other interesting things I could talk about if anyone is interested. For example, how CPUs handle negative numbers and how subtraction, multiplication, and division are executed.)
Title: PC Architecture
Post by: bloom25 on November 11, 2003, 06:42:53 PM: I just realized that both JohnnyB and I mentioned "registers," but neither of us really defined what that means. A "register" is a essentially a small amount of memory on the CPU itself that can be directly manipulated by the execution units of the CPU. The x86 instruction set (at least up until the Athlon 64) defines 8 general purpose registers (GPRs) that are directly accessable to software. (Which have names like EAX, EBX, ECX, etc. If you get a bluescreen on a computer you will often be presented with the values located in these registers.) There are also 8 XMM registers used by SSE, 3dnow, and SSE2 instructions. (The Athlon 64 increases this to 16 GPRs and 16 XMM registers.)

It's probably best to describe what a register is for by simply giving an example: Let's say a CPU wants to add two values, A and B. To do this it would move the value of A from memory into one of the general purpose registers, which is called a move (mov) instruction. The value of B would be placed in another of the 8 general purpose registers. The ALU would then be given an instruction telling it to add the contents of register A and register B and store the results of that operation somewhere, say register C. The contents of register C could then be used in another instruction to make a decision for example, or have some other instruction performed on it.

There are also additional registers in the CPU which store other useful pieces of information. Among these would be information concerning the segments of memory the CPU is currently using, capablilities and type of CPU, along with information about the results of the current instruction. (For example if the result of an instruction was 0 or if an addition or multiplication result resulted in a number bigger than the register could hold, a carry condition.)
Title: PC Architecture
Post by: WhiteHawk on November 19, 2003, 05:19:51 PM: damm bllon..you should get a job doin this sht.:D
Title: PC Architecture
Post by: Roscoroo on December 29, 2003, 06:08:08 PM: Great thread, Skuzzy sticky this puppy ... :aok
Title: PC Architecture
Post by: ker on December 29, 2003, 10:56:43 PM: For Bloom or Skuzzy, with the PCI express video cards and PCI-X expansion slots due out early next year, what kind of performance changes do you expect to see from all this new technology? Are we going to need new cases for the new BTX form factor motherboards? Are these new boards going to use regular DDR memory or something else??
Title: PC Architecture
Post by: bloom25 on December 30, 2003, 08:07:12 PM: There isn't a lot of information available right now, but I do know a few tidbits about what's coming in mid-2004.

Intel is planning to launch the next Pentium 4 socket, Socket T, which is a 775 "pin" (more like a bump) socket for their Prescott processors > 3.6 GHz. With this socket you will see the introduction of the Grantsdale and Alderwood chipsets. (The Grantsdale chipset is the budget chipset, and the Alderwood chipset is the high end chipset.) Both of these will support PCI Express, which is the replacement for the current PCI and AGP busses on current motherboards. Along with this, they offer support for DDR 2 533 MHz memory. I'd expect to see VIA, SIS, and ALI offer competing chipsets with similar feature sets within a couple months after the release of Grantsdale and Alderwood. I don't know when AMD plans on changing the Athlon 64 to support DDR2. Remember that one of the key improvements to the Athlon 64 is its on-die memory controller. This also means that memory support will be determined by the CPU, and not by the motherboard chipset. Certainly, future Athlon 64 chipsets will offer PCI Express support.

AMD is also planning to consolidate the Athlon 64 line to a single unifed Socket 939 in mid-2004. This socket will superceed the current Socket 754 used for the Athlon 64 and Socket 940, used for the Athlon FX 64. Socket 940 will likely continue to be used for the Opteron. AMD plans to offer the current Athlon 64 up to the 3700+ speed in socket 754 and they also have indicated the possiblity of offering 32bit Athlon XP based CPUs in the 754 pin package as well.

Now on to the question of whether or not these improvements will result in a significant immediate performance boost, the answer is most likely no. What these improvements do offer, however, is to remove a few bottlenecks to performance that would start to be significant in the near future.

PCI Express is probably the most important in the very near future. I'm sure a few of you know that the PCI bus in current computers is limited to no more than 133 MB/sec transfer rate. (Though some Intel and VIA chipsets actually impose a cap at around 90 MB/sec to avoid data corruption issues.) This bandwidth is shared between all devices on the PCI bus. What many of you probably don't know is that the PCI bus links not only the PCI expansion cards, but in nearly all cases the hard drive controller, which can be either Parallel ATA (the current standard) or Serial ATA (which is begining to show up on new drives, and most onboard peripherals. This means than in current systems the hard drive controller (which on current systems is capable of either 133 MB/sec for ATA7 or 150 MB/sec for Serial ATA) shares the PCI bus bus with other high bandwidth devices like the soundcard and network card. Remember that only 133 MB/sec is available to all devices on the PCI bus. This means that the 150 MB/sec offered by Serial ATA cannot be achieved (even if current drives could transfer data that fast), and even the 133 MB/sec offered by the ATA 7 standard can't be fully realized. PCI Express will offer increased bandwidth, which will provide greater system performance in the near future with high speed Serial ATA drives and Gigibit network cards becoming more common. On the first motherboards to offer PCI express, you will see connectors of both the 1x and 16x type. The PCI Express 16X connector is the replacement for AGP and the 1x connectors replace the current PCI connectors. With current graphics cards the extra bandwidth offered over AGP 8x by the PCI Express 16x connection will probably result in nearly no significant performance improvement, but in the future it will offer more. Another advantage of PCI Express is that the 1x connectors are much smaller than current PCI connectors. This is significant for small form factor computers, like the Shuttle SFF systems, which are becoming more popular. (With everything PC related, the fact that the connector is most likely cheaper may also have something to do with it.)

DDR 2 533 MHz memory is a different story. For the Athlon 64 family CPUs, DDR 2 533 MHz memory will offer a very significant performance improvement when AMD adds support for DDR2 memory. For the current 800 MHz FSB C type P4s, it will not, as an 800 MHz FSB lacks the bandwidth to take full advantage of dual channel DDR 2 533 MHz memory. For this reason, expect to see 1066 MHz FSB Pentium chips at some point. Where the Pentium 4 might see some benefit from DDR 2 memory is in the lower end of the new Prescott P4s, due for introduction in early February. These will be offered with a 533 MHz FSB option, which means a single stick of DDR 2 533 memory will be all that is required for top performance, rather than 2 sticks of DDR266 or higher. DDR2 memory will also offer improved integrated video performance, which is important to low end OEM systems.

As for the new BTX form factor, I don't know whether or not current ATX cases and power supplies will be compatible or not. It is possible a BTX motherboard will mount in some ATX cases, but it's really too soon to tell. The point of BTX is mainly to improve cooling. BTX moves the CPU to just behind the front case fan, rather than just below the power supply in current ATX cases. This is done to improve cooling. The PCI Express 16X slot is also placed just behind the CPU for the same reason. Basically BTX looks just like an ATX motherboard rotated 180 degrees. (CPU on lower right corner, rather than in top left.)
Title: PC Architecture
Post by: Roscoroo on February 04, 2004, 07:03:20 PM: Puntola ......
Title: PC Architecture
Post by: bloom25 on February 04, 2004, 11:56:35 PM: Just a minor update, in case anyone cares, March 29th is the current launch date for Socket 939 Athlon 64s, as well as a speed bump to 2.4 GHz (3700+ is the current expected rating). Not only will Socket 939 render Socket 754 all but end-of-lifed, they should be a bit faster as well. The Socket 939 Athlon 64 processors will have dual channel on-die memory controllers, making them like the current Athlon 64 FX, however they will support regular DDR400 memory. The current Athlon 64 FX requires registered memory, which is marginally slower than non-registered (unbuffered) memory. They standard Athlon 64s will all have a L2 cache size reduction to 512kB, but to compensate will run at higher clockspeeds than their identically rated Socket 754 counterparts. The higher clockspeed and dual channel memory capabilities should more than make up for the 512kB reduction in cache. Currently the only difference between the Athlon 64 3000+ and 3200+ is the size of their L2 cache. The 3200+ has 1 MB and the 3000+ has 512kB. Looking at current benchmarks shows that the performance drop is quite small, so overall I expect the higher clockspeed and additional memory bandwidth to more than make up for the reduction in cache size. The other direct benefit of reducing cache size is increased yields, which result in a significant reduction in production costs, as well as more frequency headroom.

(Just as a note, March 29th is my birthday. If anyone would like to send me a 3700+ as a birthday present, I wouldn't mind. ;) )
Title: PC Architecture
Post by: Kaz on February 05, 2004, 11:55:54 AM: March 28th is mine I don't mind a late present though! :)
Title: PC Architecture
Post by: Furious on June 12, 2004, 11:57:03 PM: Bump for XJ.
Title: PC Architecture
Post by: MOSQ on June 22, 2004, 07:00:06 PM: Bloom,
Impressive thread!

I just noted that Dell is now selling the Dimension 8400 in PCI/PCI Express only using the 925x chipset. The AGP slot is gone!

Where the market leader goes the rest of the industry will shortly follow.

So the question for everyone who is looking to upgrade because of AHII: Should they wait for PCI Express mobos and cards to be available widely, or go ahead now with AGP?

The question is important because future video cards starting with the top end ones will be PCI Express, not AGP. In a couple of years all the new cards across the price range will be PCI Express, with only low end cards still AGP.

From PC Mag:

In the most significant bus transition since PCI replaced ISA, PCI Express arrives with the new chipsets. Similar to the way hard drives transitioned from the parallel IDE connection to the serial ATA (SATA) connection, PCI's parallel connection is moving to PCI Express's serial connection. While a typical 33-MHz, 32-bit PCI bus has a unidirectional total bandwidth of 133 MBps (megabytes per second), each pair of wires on PCI Express is capable of transferring 2 Gbps (gigabits per second) in both upstream and downstream (read and write) directions for an effective total bandwidth of 4 Gbps—or 500 MBps.

Since PCI Express is a point-to-point serial connection, that bandwidth is available for each card connected to it, rather than being shared among all the cards as it is for PCI. These pairs of wires can also be grouped together to make x1, x2, x4, x8, x16, and x32 (pronounced "by 1, by 2" and so on) connections, with each pairing doubling the throughput.

Notably, the next generation of high-end graphics cards will no longer be on the 8X AGP bus but on the x16 PCI Express bus. Today's 8X AGP graphics bus has a unidirectional bandwidth of 2 GBps (gigabytes per second). With x16 PCI Express, bandwidth is 4 GBps in each direction, for a cumulative bidirectional bandwidth of 8 GBps. Besides graphics, the 915 and 925 chipsets will also support up to four x1 PCI Express slots in addition to up to six PCI slots.http://www.pcmag.com/article2/0,1759,1615226,00.asp (http://www.pcmag.com/article2/0,1759,1615226,00.asp)
Title: PC Architecture
Post by: Roscoroo on July 18, 2004, 06:38:54 PM: Puntola
Title: PC Architecture
Post by: Kev367th on July 19, 2004, 01:47:43 AM: Another update -
1) AMD has said it will NOT be adding support for DDR2 for the Athlon 64 (due to high latency probs). But will instead be hopping straight to DDR3.
2) Athlon 64 systems have no front side bus, the mem controller is on the CPU itself.
3) If you build an Athlon 64 based system buy the fastest memory you can afford.
Title: PC Architecture
Post by: Roscoroo on August 05, 2004, 12:45:02 AM: Punting again ...