Author Topic: PC Architecture (Read 2486 times)

bloom25 · « **on:** October 20, 2003, 04:14:27 PM »

(This thread started out as a post in a CPU and Video card recommendation thread. Rather than completely hijacking that thread, I think it's best if I move what I posted there to here before I expand on it.)

This all started out as a simple comment about the Front Side Bus on a Pentium 4 vs the Athlon XP, so the first post mainly deals with that. To really fully understand this subject I'll have to dig into the actual architecture of the Athlon and Pentium 4, so I'll be adding information about that as soon as I can get around to it.

(Reposted from other thread)

Here you go! I'll try to condense 4 years of college computer architecture classes down to 1000 words. I just hope that some of it makes sense when I'm done.

The first thing I should probably do it explain what a "front side bus" is in the first place.

A "front side bus" is the link between the CPU and the rest of the system, specifically what is typically known as the "Northbridge".

Most chipsets (which are on the motherboard itself) consist of two parts, the Northbridge and Southbridge. The Northbridge historically has contained the memory controller (Sdram, DDR Sdram, Rambus, etc) as well as more recently the controller for the AGP slot itself. The Southbridge typically controls just about everything else in the system. (PS2 ports, USB ports, LPT port, onboard sound, IDE controller, floppy controller, onboard network, etc.) The Northbridge and Southbridge are typically linked by the PCI bus, on which most of the expansion cards in the PC also connect. (There are exceptions to this, as some single chip solutions now exist (Sis chipsets), and sometimes a different separate bus links the NB and SB, as is the case with the nForce chipsets which use a Hypertransport link and VIA chipsets which use "Vlink".) Why does this matter? Basically the front side bus is the critical link between the CPU and the entire system. This means that the faster the FSB is, the faster the CPU can communicate with everything else in the system. If the CPU wants to get data or instructions from system RAM, that data travels over the FSB. If the CPU needs data from the hard drive, that data travels to the southbridge, over the PCI bus to the Northbridge, and then over the FSB to the CPU. As you would expect, the faster this link is, the faster the system will be. If this is true, why would I say that a faster FSB results in diminishing returns in system speed beyond a certain point? I'll get to that.

(BTW for all you tech historians: There used to be a "back side bus" which linked the CPU to it's Level 2 (L2) cache. The term is now obsolete, because just about every modern CPU since the Coppermine Pentium 3 core has had it's L2 cache as part of the CPU itself, meaning the BSB is part of the CPU itself as well. If you want to get really technical, the term front side bus is no longer really valid in its original context, because it is now the only bus.)

Perhaps the first thing I should cover when trying to explain why a faster FSB doesn't always result in a corresponding increase in system performance is to consider the case when the CPU needs data from the harddrive. (Which happens quite a bit when loading programs and when the data the CPU needs does not fit into system memory.) I'm sure all of you know that the harddrive is many orders of magnitude slower in transfering data than system memory is. The amount of delay imposed by the data traveling over the FSB is nearly negligable when compared to the amount of time it takes for the hard drive to retreve and store information. This makes the FSB speed itself very much a non-factor.

The next case is when the CPU needs data from main memory. There are two key concepts to understand here: "Latency" and "Bandwidth".

Latency is essentially the amount of time the CPU must wait between issuing a request for information and when the information actually is available to the processor. This time is generally measured in nanoseconds, but it's far more useful to look at it in terms of clockcycles the CPU executes. This is because the CPU is essentially wasting time during the clockcycles where it is waiting for data and/or instructions from memory. I'll come back to this later, because it is probably the most important thing to understand.

Bandwidth is the amount of data that can be transfered in a given unit of time.

Let's look at this from a more intuitive example. Consider a highway where vehicles travel from one point to another. In this example, bandwidth is essentally the number of lanes of the highway. Latency is essentially its length. Lets say you have a contest to get the most vehicles from one end of the highway to the other. Unfortunately, only a certain number of vehicles can enter the highway per second. This start of the highway is roughly analogous to main memory in a computer. The end of the highway is the CPU itself, and compared to main memory, is far faster. As you can well imagine, if you make the highway shorter (lower the latency) you can get more vehicles to the end (data to the CPU) in the same amount of time as that of a longer highway. Given you can get enough vehicles onto the highway, having more lanes will also get more vehicles to the end of the highway. Consider this though, what happens when you have 800 lanes on your freeway, but only 400 cars can enter it at any given time? Basically, 400 lanes are wasted. (Ok, enough car talk. I'm getting bored with it... )

Real memory in a computer cannot transmit data continuously. It takes a certain amount of time from when the CPU (or more correctly the memory controller in the Northbridge acting on behalf of the CPU) requests data, until when the memory can begin sending that information. This amount of time is the memory latency. To read data from DDR SDRAM memory, which is arranged as a giant grid of both rows and columns, it takes a certain amount of clock cycles to charge the individual cell the data is in (precharge), a certain amount of time to activate the row the data is in (RAS - row address strobe), and a certain amount of time to access the column the data is in (CAS - column address strobe, a term most people who buy memory have heard.), the final factor is the command rate (time between issuing a command to memory to when the command is executed, usually only a cycle or two). All of this is what is collectively known at memory latency. (You see this printed on memory and on review sites as a string of 4 or 5 numbers.) The lower the latency, the less time it takes for the memory to begin transfering data to the northbridge. DDR memory currently runs at 100 Mhz - PC1600, 133 Mhz - PC2100, 166 MHz - PC2700, and 200 MHz - PC3200 as standard rates. The latency is measured in the number of memory clock cycles. (You probably think I'm wrong here, and that PC3200 memory runs at 400 MHz. That's not actually true, and I'm getting to that.) DDR memory (double data rate) has the capability of transfering data on both the rising edge (low to high) of the clock pulse and on the falling edge (high to low) of the clock pulse. If it could do this all the time, it would have the same bandwidth as regular SDRAM, which transfers data on only the rising (low to high) clock edge. This is why PC3200 is also known as DDR400, because it is capable of transfering, at a maximum, at the same bandwidth as SDRAM running at 400 MHz. This also explains why you sometimes see DDR memory with a CAS latency of 2.5 cycles, this means the data can access that column after 5 clock edges (rising or falling). DDR memory can transfer data on both the rising and falling edges when it is performing a burst transfer of more than one location in memory. Most of the time it does, and there is a very good reason for this. Typically when a CPU wants data from memory, the next access from memory will be from a location very close to that of the first access. For this reason, SDRAM (and the older fast page memory) will transfer the entire contents of the memory row. This boosts performance, because if the CPU does end up needing data in the next cell, that data has already been transfered. If it ends up the access is not from the same row, nothing is really lost, as the CPU just discards the data it doesn't need. Note that I've hugely simplified this. This is what is known as "spacial locatity" in computer architecture classes, which basically says that a CPU will most request data from memory in a location near the last access most of the time. Basically SDRAM and DDR SDRAM assume this and just transfer all the data near to what the CPU requests. Wow, that's a lot of information to try to condense and "dumb" down, but hopefully those of you who stuck with it now better understand what memory latency is.

Now, lets briefly touch on bandwidth. Individual DDR memory modules in modern computers are 64 bits wide, meaning they transfer data in 64 bit chunks on both the rising and falling edges of their data clock. This is the amount of data transfered over a single channel. If we are talking about a DDR 400 module, this bandwidth will be 3.2 Gigabytes per second. If we have two independant channels (dual channel) transfering data at the same time this will be 6.4 Gigabytes per second.

bloom25 · « **Reply #1 on:** October 20, 2003, 04:15:38 PM »

(Too long for a single post)

Consider this: For best performance, the CPU's FSB should be capable of transfering data at the same rate as which it can be transfered at a maximum from main memory. For the case of Dual channel DDR400 (PC3200) memory, this is 6.4 GB/sec. The bandwidth offered by the P4 'C' type 800 MHz equiv. FSB is 6.4 GB/sec. You can imagine that if the FSB was slower than this, you'd have a traffic jam with dual channel DDR 400 memory when both channels are transfering at the same time. This means you are losing performance. (This is why a 'C' type P4 performs best with dual channel PC3200 memory.) The opposite case is also true, if the FSB is capable of transfering more data than the ram is capable of delivering, you aren't gaining much performance by having the capability to do so. This is the case on many systems. Consider a 'C' type P4 with an 800 MHz FSB, but using only PC2700 DDR333 modules. If you neglect the influence of Hyperthreading, the 'C' type P4 will perform no better than the 'B' type P4! This is also true with Athlons. Thus we have two cases where we are losing potential performance - faster memory than FSB, or a faster FSB than memory. Thus, the best system memory performance occurs when the FSB is equal (or faster) than memory. If it is faster you don't gain much though, and in fact may actually lose performance because the Northbridge must wait to transfer data from the CPU's FSB and memory bus until the next clock edge. This adds latency. This is why a Athlon 2500+ (333 MHz FSB) runs slower with DDR 400 memory than it does when using DDR 333 memory. The Athlon architecture is very sensitive to latency, more so than a P4.

I'm afraid I'm going to have to stop here for the night. It's 1:20 AM and I need to get up in 6 hours. At this point I haven't tied all the loose ends up, but I think you may begin to see where this is going. Tomorrow I'll try to post about neat little things like: Hyperthreading, integrated memory controller, hardware prefetch, cache memory influence, and if anyone actually reads and gets something from this, maybe more!

bloom25 · « **Reply #2 on:** October 20, 2003, 04:19:46 PM »

Since it may be a while until I get around to this tommorow, I think I should mention this:

The 'C' type P4 has a 800 MHz FSB, with enough bandwidth capable of handling the amount of data transfered from two channels of DDR 400 memory. As I mentioned before, SDRAM type memory transfers the entire row of data, which means that in a dual channel setup two entire rows of data will be sent for each memory request. If the CPU doesn't actually need all of that data, the benefits of transfering all of it, and thus the performance advantage of a 800 MHz FSB over that of a 400 MHz FSB (single channel DDR 400) is wasted. Remember that the CPU simply throws away what it doesn't need. To be technically correct, it stores all the data in it's L2 cache (L3 as well if it has one) until it runs out of room, in which case it dumps the oldest data. It also must discard portions or all of its data stored in cache when it writes back to memory. (If a CPU needs to write to memory, the data in the cache which was transfered from the location it wants to write to is no longer correct, and is discarded. The cache can also be flushed when the CPU switches processing from one thread to another. If this sounds related to Hyperthreading, it is, and more on that tommorow... )

bloom25 · « **Reply #3 on:** October 20, 2003, 04:42:10 PM »

I think at this point it is best to try to explain just what a CPU does, and how the two most common CPU types - Athlons and Pentium 4s - actually go about doing their job. I think if you can understand how each CPU type works, you will understand just how hard it is to compare the two. I think I can really simplify this and still get the main points across, but be aware that there are a lot of special cases and exceptions to everything.

First of all, lets talk about what a CPU does. The simplest definition would be that a CPU executes instructions and makes decisions based on the results of those instructions. There are quite a few types of instructions that a CPU can execute:

Arithmatic instuctions - These instructions are basically addition, subtracton, multiplication, division, and a few others (sine, cosine, tangent, square root, etc). These arithmatic instructions follow into two types, integer and floating point. Integer instructions are those that act on whole numbers, 1 + 2 for example. Floating point instructions are those involving fractions, 1.234 + 2.954 for example.

Memory load and store instructions - These are exactly what they sound like. These instructions read or write to memory (that can be any type of "memory" in a system - ram, hard drive, video card, other expansion cards, etc).

Branch instructions - These instructions essentially allow the CPU to make decisions. For example, if A + B is greater than 10, execute instruction C, but if A + B is not greater than 10, execute instruction D. It is these instructions that really make a CPU, and thus the PC, more than a simple calculator. They give the CPU the ability to make decisions based on the results of other calculations.

Both Athlons and Pentium 4s all use the same instruction set, x86, which means they can run the same programs. The basic steps they use to go about executing those instructions is the same as every other CPU out there. All CPUs do 4 basic things to execute an instruction: Fetch, Decode, Execute, and Retire. The Athlon and Pentium 4 just do these 4 basic tasks very differently.

(I'll continue this later on today.)

mrblack · « **Reply #4 on:** October 21, 2003, 01:15:19 AM »

Thx darn good read.
Even for and MCSE and A+ dude like myself there is always neat stuff to learn.
Heck I forgot most of what i learned in school anyway:D

bloom25 · « **Reply #5 on:** October 21, 2003, 02:13:45 AM »

What do I mean when I say "fetch", "decode", "execute", and "retire"? Actually this is pretty simple.

Fetch - Retreve the next instruction from memory. Pretty obvious...

Decode - Essentially, figure out what the instruction is.
This step has really become much more important with more recent CPUs. Even though both Athlons and Pentium 4s both execute x86 instructions, internally they break up x86 instructions into a bunch of smaller, simpler tasks. I'll give you an example: Lets say you get an instruction asking you to add A to B and store the result in C. To execute an instruction like this the following steps must be done: Get what data is in location A, get what data is in location B, add A and B together, store the result in location C. If you noticed there are 4 smaller operations that must be carried out to do A + B = store in C. I'll come back to this later, but consider for a minute if you had the capability to do more than one operation at once. In this case you could actually get the data in locations A and B at the same time, but you can't add the two together until you have them both. If your CPU was capable of performing more than one task at the same time you could save a step both getting both A and B at the same time, adding them in the next step, and then writing the result to C. This is critical to realize, because modern CPUs CAN execute more than one operation at the same time and it is ABSOLUTELY vital for them to properly "schedule" (this is the techincal term - pretty obvious what it means) these operations to make the most use of the processors resources. This is an area where big differences exist between the Athlon and Pentium 4. I'll get back to this later.

Execute - Carry out the operation. This would be the step in which my previous example retrieved A and B, and added them. Athlons and Pentium 4s have big differences in the amount of cycles it takes to execute various instructions, the number of instructions they can execute at once.

Retire - This isn't really obvious, and some people will group this under Execute, but I prefer it to be thought of as a separate step. Basically this is the step where the CPU writes the result back into memory. (Store C in my example.)

Ok, now that we understand what basic steps are necessary to execute in instruction we can begin to get a feel for just how vastly different the Athlon and Pentium 4 are in how they go about accomplishing these 4 things. The Athlon and Pentium 4 belong to two essentially different trains of thought on how to maximize performance. Personally, I feel that both approaches have their own unique advantages and disadvantages so right off I should say that both approaches are equally valid in acheiving maximum performance.

Lets introduce the concept of "pipelining". Pipelining can be thought of as roughly the same thing as an assembly line. Your instruction always has to go through the 4 steps (fetch, decode, execute, retire), so the most obvious number of stages to have in your pipeline is 4. You do one of these steps every clock cycle, and if you CPU is only capable of executing one instruction at a time, it takes 4 clock cycles to execute a single instruction. This is actually the case with most cheap inexpensive microcontrollers which are found in just about everything these days. (Some of you may have heard of or played with PIC microcontrollers. These cheap little microcontrollers take 4 clock cycles to execute each instruction.) Unfortunately if you only have 4 stages in your pipeline it severely limits your maximum clockspeed, as the decode and execute stages typically take longer to execute than the other two stages. Your maximum clockspeed for your CPU can't be any faster than it takes for the pipeline stage that takes the most time to execute. For example, if your fetch stage takes 0.1 seconds, decode takes 1/2 second, execute takes 1 second, and retire takes 0.1 second your maximum clock rate is only 1 Hz, because the execute stage is the limiting factor if it always takes up to 1 second to finish. As you might be catching on by now, if you broke execute up into more than one pipeline stage you can achieve higher clockspeeds and still get the same level of performance.

Now lets talk about modern CPUs. The Athlon has a 10 stage integer (Athlon 64 is 12 stages) pipeline and the Pentium 4 has a 20 stage integer pipeline. This means that the 4 main tasks I detailed above are broken up into smaller tasks. Notice just how long the pipeline is in the Pentium 4 compared to the Athlon.

Now things are going to start to get more technical, and I'll do my best to try to keep things as basic as possible. (Please do post if you have questions; I'm sure others will be wondering the same things you are.)

Assume for a moment that the Athlon can only execute one instruction at the same time and every stage in the pipeline is actually doing something. This means that every clock cycle a finished instruction completes stage 10 of the pipeline (ignoring the very first 9 clock cycles) and is completed. For the P4, finishing stage 20 completes the instruction. Now we all know that AMD uses a rating system and the XP 3200+ runs at a true clockspeed to 2.2 GHz. The current top end P4 runs at 3.2 GHz. If these two CPUs could only execute 1 instruction at the same time and every stage in their pipelines was busy we could conclude that the Pentium 4 is completing 3.2 Billion instructions per second and the Athlon is only completing 2.2 Billion instructions per second. Since having a longer pipeline allows a CPU to run at a higher clockspeed, with our assumptions in place, the Pentium 4 walks all over the Athlon in performance. We can also conclude a couple other things from this most simple of example: 1. Each instruction in the Pentium 4 takes 20 3.2 GHz cycles to complete and each instruction in the Athlon takes 10 2.2 GHz clock cycles to complete. The interesting thing to note here is that from start to finish, the Athlon completes an individual instruction in less total time.

bloom25 · « **Reply #6 on:** October 21, 2003, 02:15:11 AM »

Now lets throw two gigantic monkey wrenches into the equation. 1. Real CPUs execute more than one instruction at once.
2. Not every stage in the pipeline can be working every clock cycle.

Number 1 might seem simple enough, but you might be asking why on number 2. Consider this, since a CPU can make branch type instructions, which depend on the results of previous instructions, you must know the final result of that instruction before executing the branch instruction. Put another way, consider the A + B, store in C example above. Lets say the next instruction is: If C is greater than 100, subtract 1 from B. If C is less than 100, add one to B. If you don't know the result of A + B, you don't know whether or not you should add or subtract one from B in the next instruction. This is bad. (Really bad if your pipeline is long.) You can't begin working on the branch instruction until the A + B instruction is done. Since you can't do the branch instruction, you also don't know if your next instruction is to add 1 to B, or to subtract 1 from B. This means that the A + B instruction must go through all 10 or 20 stages in your pipeline before the branch instruction can even start. You also need to wait another 10 or 20 cycles to start the instruction after that. As you can imagine, a branch instruction has the potential to hurt the P4 a LOT worse than the Athlon. In this (very simple) example the P4 will waste almost 40 clock cycles with nothing to do, waiting for the results of other instructions. The Athlon would only waste about 20 cycles. The technical term for a pipeline stage with nothing to do is a pipeline stall or bubble. In this scenario, with these assumptions, the Athlon will be faster. If only it were this simple though.

One of the main jobs of CPU designers is to come up with ways to keep the CPU as busy as possible. One method all modern CPUs since the Pentium Pro (which later became the P2 and P3) have employed is "branch prediction." The idea here is actually really simple and very smart. Make an educated guess what the result of the branch instruction will be and act accordingly. If you assume A + B will be greater than 100, you can assume that you will be subtracting 1 from B. Rather than have stages in your pipeline doing nothing, if you can track and execute more than one instruction at once you can just assume you will be subtracting one and check to see if your assumption was true once A + B complete. If you guessed correctly, you can keep your predicted subtract 1 instructon result. The advantage is that the predicted instruction can be nearly finished through the pipeline when the A + B result is finally known. This means that if you guess correctly you haven't wasted any clock cycles waiting for A + B to finish executing. If you guess wrong, you just have to discard the predicted instruction result and execute the correct instruction. Basically by using branch prediction you have a lot to gain and nothing to lose. As you can imagine the Pentium 4 devotes considerable resources to the prediction of branch instructions. (SSE2 even adds instructions which tell the P4 that a branch is "strongly taken", "weakly taken", "weakly not-taken", and "strongly not-taken".) Both the Athlon and Pentium 4 employ very advanced branch prediction schemes that track the history of similar branches and guess whether a branch will be taken or not. Back in the days of the P2 you basically just assumed a branch will be taken and act accordingly. The Pentium 4 lives or dies by its branch prediction unit's success in correctly guessing which instruction to execute next. The Pentium 4 and Athlon typically can predict branches with well over 90% accuracy. Even still, you can probably imagine that extremely branch intensive code will execute faster on an Athlon than on a P4.

Moving on to something else I touched on above when I said that the Athlon and P4 can execute more than one instruction at once. In actuality the Athlon can track and execute 6 (!!!) instructions at once. Specifically 3 of these instructions can be integer instructions or floating point and 3 can be memory read/write [address] instructions. The Pentium 4 can only execute 2 integer instructions or floating point instructions at the same time and 2 address instructions. This might seem like a huge advantage for the Athlon, and it is definately a strong point of the architecture, but unfortunately it is quite rare that 6 instructions can be executed at the same time because of many different reasons. You might remember that I talked about both floating point and integer operations. (Floating point numbers are those like 1.023 and integers are obvously just that, whole numbers like 1.) There are quite a few x86 instructions that internally the CPU executes as a mix of both integer and floating point operations. In the current Athlon and Athlon XPs the CPU can either process 3 FP or 3 integer operations at the same time, but not both. This means that if a x86 instruction involved 1 integer add and 2 floating point operations, the integer operation would have to wait for the next clock cycle to begin. The Athlon 64 does not have this limitation.

Since we are on the subject of floating point, lets go ahead talk some more about it. In a CPU floating point operations are executed by a special unit, the FPU (floating point unit). Floating point operations, besides mathematical operations involving a decimal point, include x87 (standard FP operations every modern CPU can execute) and special instructions like MMX, SSE, SSE2, and 3dnow! instruction sets. The FPU has 3 main components. One unit handles floating point additions and subtractions, another unit handles multiplication and division, and the final unit handles FP memory operations (load/store). The FPUs of Athlons and P3/P4s have some major differences. The Athlon's FPU is what is known as "fully pipelined." Basically what this means is that the add/subtract, multiply/divide, and load/store units are separate from each other and can all work at the same time. The FPU in the P3 and P4 is not fully pipelined. The multiply/divide unit must make use of the add/subtract unit to execute multiply and divide instructions. A multiply instruction can (and almost always does, unless the number is being multiplied or divided by a power of 2) take a lot longer to execute than an addition/subtraction operation. In the Athlon the multiply/divide unit can be busy processing a multiply instruction while the addition subtraction unit is busy processing add or subtract instructions. Like the integer units described above (10 stages in the Athlon, 20 in the P4), the FPU itself also has several stages. The Athlon has 15 stages in the multiply/divide unit. Unfortunately I don't know the exact number in the P4, but I do know there are more stages than that. Since the Athlon can do add subtracts when working on a multiply, it has the capability to schedule up to 32 floating point instructions to maximize this capability. (In case you were wondering, trig instructions and some of the others often are broken up into simpler instructions involving all three units, or the result is retrieved from a table.) When it comes to raw x87 FP performance the Athlon can literally run circles around the P3 and P4. Making use of the Pentium 4s special SSE2 instructions can make up for this performance gap however. SSE and SSE2 instructions are special instructions introduced with the P3 and P4 respectively. These instructions are what's called SIMD (single instruction/multiple data) type instructions. Basically they can greatly speed up the execution of code that involves performing the same basic operation on many different bits of data. Essentially this cuts down on the number of instructions needed. (Video encoding applications are probably the best candidate for this.) The Athlon XP can make use of SSE instructions, but cannot execute SSE2 instructions. This lets the P4 catch up to the Athlon and in many cases surpass it when code is specially optimized to make use of SSE2 instructions. One of the key improvements in the Athlon 64 is that it can now execute SSE2 instructions. Unfortunately for AMD, since SSE2 instructions were designed for the P4, the Athlon 64 doesn't gain as much percentage wise as the P4 by code that uses them. (You might wonder why Intel didn't make the FPU in the P4 fully pipelined. The reason was primarily cost savings. Sharing hardware between the add/subtract and multiply/divide units saves a lot of space on the die.)

I wish I could think of a way to explain all of the above more clearly, so please do ask questions. In general, the following is true: The Athlon gets more work done (more instructions per clock - IPC ratio) than the P4 does. Of course, the P4 runs at a higher clockspeed. Probably the simplest analogy I can come up with is that of auto engines. The Athlon is a big V8 running at 3000 RPM and the P4 is a 4 cylinder running at 6000 RPM. They can both put out the same amount of horsepower, but the 4 cylinder has to rev higher to do it.

bloom25 · « **Reply #7 on:** October 21, 2003, 02:30:53 AM »

(Teaser post, besides I've got 6 hours to wait while 1.9901 downloads at 28.8...

)

I think I'll leave you guys with this, just to relate everything back to the FSB posts at the beginning of this monster of a thread. Do you think the Athlon would be more sensitive to latency (time between requesting and getting data or requesting a write and writing data) or bandwidth? Remember that it can execute 6 instructions at the same time, versus the P4s 4. It also has a shorter pipeline, meaning it takes less clock cycles to finish an instruction. I'm sure that if you think about it, you'll realize that latency is very important to both the P4 and Athlon, but the Athlon really needs low latencies for top performance. You can't get much work done if you can't get all those multiple instructions started or written back to memory when finished as quickly as possible. Is it any wonder that AMD removed the memory controller from the Northbridge on the Athlon 64 and placed it on the CPU. This drastically reduces latency, keeping as many of those multiple operational units busy as much of the time as possible.

On the plus side, now that we've discussed how the Athlon and P4 go about executing instructions, we can talk a little bit about some of the neat tricks that modern CPUs use to keep all those functional units and pipeline stages busy as much as possible.

I'll start covering that tommorow. I might also have time to talk about tradeoffs in one design philosphy over the other.

boxboy28 · « **Reply #8 on:** October 21, 2003, 03:22:33 PM »

Bloom you are my Hero to several excellent posts!

on another note is LAZ runing an AMD or Intel

"Probably the simplest analogy I can come up with is that of auto engines. The Athlon is a big V8 running at 3000 RPM and the P4 is a 4 cylinder running at 6000 RPM. They can both put out the same amount of horsepower, but the 4 cylinder has to rev higher to do it."

LOL well im an AMD fanboy so i had too!

Thorns · « **Reply #9 on:** October 21, 2003, 09:26:35 PM »

Thanks Bloom, good stuff. Now why doesn't someone buy Win98 from Microbloat, and keep improving it?

Thorns

Flacke · « **Reply #10 on:** November 01, 2003, 02:45:04 PM »

Wonderful post Bloom, lots of work for you but a lot of learning for me and others. Thanks a lot.

bloom25 · « **Reply #11 on:** November 01, 2003, 04:38:27 PM »

I was thinking about adding a little more about things like hyperthreading, SIMD instructions, hardware prefetch, etc. Are any of you interested in reading it?

Mini D · « **Reply #12 on:** November 01, 2003, 04:48:56 PM »

The gauntlet has been thrown down. Out-geek this guy skuzzy.

MiniD

Roscoroo · « **Reply #13 on:** November 01, 2003, 05:26:11 PM »

Go ahead Bloom .... I'm reading

bloom25 · « **Reply #14 on:** November 01, 2003, 06:36:32 PM »

He can try MiniD, but if he does I'll be forced to post pictures of my alarm clock. (Actually to get the full effect I'd need to make an AVI.)

Only I would spend almost $100 designing and building an alarm clock with the following features:

1. Contains a 90 second digital voice IC with 22 Homer Simpson quotes I recorded. The alarm is Homer screaming and then yelling "DOH!" until I turn off the alarm. It also randomly plays about 20 or so different "Homerisms" plus other things when the alarm time is programmed. (It can actually be programmed to say anything I want.)

2. It can turn on my computer in the morning when the alarm goes off.

3. Yellow backlight LCD display.

4. Accurate to around 10 seconds a month or so.

5. Nobody else on the planet has one.

It only took me 2 days (literally 20+ hours) to solder it all together, not counting the time it took me to write 1000+ lines of assembly.