Author Topic: PC Architecture (Read 2534 times)

bloom25 · « **Reply #30 on:** November 04, 2003, 06:10:28 PM »

Just run the memory at 333 MHz and you'll be fine. With the DDR400 memory you can try overclocking the FSB and still run the memory within specs. You could also try changing memory timings with the memory running at 333 MHz to something more aggressive. With the DDR400 memory you can also reuse the memory if you ever move to a 3200+ or Athlon 64.

bloom25 · « **Reply #31 on:** November 04, 2003, 07:54:34 PM »

This time around I'll talk about threading and "Hyperthreading." First of all, what is a thread? It's probably best to answer that by first talking about a "process." A "process" is, in general, a program. Each process can be made up of a series of seperate tasks, what is known as a thread. So basically a thread is a task and each process running can involve one to many threads. It is the job of the operating system to divide the resources of the processor to devote time to each thread (though each thread can have a different priority level). Each thread is given a slice of time to work and then processing is stopped on that thread and another thread is given a time slice. This is generally done in one of two ways, but I'll get back to that in a second. It is important to realize that a typical processor can only process one thread at a time. This means that even though it appears a computer is running several programs at once, the CPU is actually only running one thread at any given time and the operating system is switching between those threads to make it seem like multiple programs (processes/threads) are running at the same time. This switching of threads is usually called a "context switch." When the CPU switches from processing one thread to another it must first save the intermediate results of what it was working on to memory before it can begin processing the second thread. Depending on the CPU, a context switch can be quite painful to overall system performance. Since the CPU basically has to stop what it is doing and save the intermediate results to memory (this would be data being processed, along with processor status flags and the contents of several key registers within the CPU itself) there is a substantial performance hit. It's worth mentioning that a processor with a longer pipeline is, in general, going to take a bigger performance hit from a context switch than one with a shorter pipeline. That's because most of those pipeline stages are idle as the CPU saves results to memory and begins preparing to execute the next thread. As I said above, the operating system can handle this in two ways: 1. It can depend on each process to share nicely with other programs that are running. The operating system still divides up processor time, but does not strictly enforce the switching between threads, it simply requests the current running thread stop executing so another thread can be processed. 2. The operating system itself can set how long each thread has to be processed and the operating system switches between threads. Each thread has no idea how many other threads are running and to each thread it appears that it has the full resources of the CPU when it is running. Modern operating systems generally use the second approach. This is for a very good reason, as the results of a badly behaved thread using approach one can make the system seem to hang. Macintosh OSes 9 and below used method 1. (Any surprise then why they seemed to hang when a program crashed or a driver misbehaved...

) Unfortunately using approach 2 does sacrifice a bit of performance as the overhead of the operating system enforcing the switching of threads results in a bit of a performance hit.

Ok, now that we've got an idea of threading and how a CPU switches between working on threads, lets quickly get back to something I briefly touched on in a previous post. Modern CPUs can issue and execute multiple instructions at once. If one instruction does not depend on another instruction, it is possible to execute them at the same time. Unfortunately if one instruction does depend on the results of another instruction, processing cannot complete on that instruction until the results of the instruction it depends on are known. It's probably not hard to imagine that instructions of one thread are far less likely to depend on instructions in a totally different thread. This means that a CPU (or using dual CPUs) working on more than one thread at once can result in higher performance if the CPU is capable of executing more instructions than it is working with in a single thread because of dependancies. Remember above that I said a P4 can issue and execute at most 4 instructions and the Athlon 6 instructions at the same time. (Those 4 or 6 instructions must fall into certain types to actually execute that many instructions at once, but we won't get into that.) If the CPU is capable of issuing and executing more instructions than it is actually executing it is wasting valuable resources. The number I generally see floating around is that the typical x86 CPU can average about 2.5 instructions at the same time. That means that a lot of the time both the P4 and the Athlon cannot execute at peak efficiency because for one reason or another they can't execute 4 (or 6) instructions at once.

Now lets talk about what Intel calls "Hyperthreading." Hyperthreading essentially fools the operating system into believing that a single CPU is actually two seperate CPUs. (The 3.06 GHz 'B' type P4 and all 'C' type P4s are Hyperthreading capable.) This allows Hyperthreaded P4s to be fed more than one thread at once and if the CPU has free resources it can use them to execute instructions in a second thread. Basically Hyperthreading can allow the CPU to make use of free resources to work on a separate thread and it also reduces the performance penalty of a context switch as the entire CPU need not be essentially idle when doing so. (Technically this would be a reduction in latency, or the amount of time it takes, to execute a context switch.) The net effect of this, from an end user perspective, is that the system is more responsive and feels quicker when executing multiple tasks. (I.E. Running more than one program at the same time.) Describing the techincal aspects of how this is done is far beyond the scope of this post, but basically certain key portions of the CPU are duplicated and other key portions of the CPU (most importantly the executing units and cache memory) are shared.

Unfortunately there are drawbacks to Hyperthreading, and I'm sure some of you have noticed that running several benchmarks with Hyperthreading enabled results in slightly lower scores. That's mainly because not all the resources of the CPU are devoted solely to running the benchmark, and thus the benchmark score drops slightly. There are a couple reasons for this. Probably the biggest is that cache memory requirements jump significantly when working with multiple threads. Cache memory works on the principle that a thread will tend to work with portions of memory relatively close in physical location to each other. When processing multiple threads this assumption doesn't work as well, as each thread may be working with portions of memory far from each other. This causes a reduction in the effective amount of cache memory that each thread has to work with and in some circumstances results in far more accesses to main memory (which takes a lot of time) than would have occured with Hyperthreading disabled. Also, if a thread would fit completely within cache with Hyperthreading disabled and won't with it enabled you will get a very significant performance hit. Again, this is because the number of accesses to main memory will go up significantly. Since the execution units are also shared there can be other complications. For example, a P4 does not have a fully pipelined floating point unit like the Athlon and for top performance must alternate multiply (division) and addition (subtraction) operations for best performance. (A fully pipelined floating point unit has totally seperate multiplication and addition units, meaning they don't share resources and can work at the same time.) Many P4 optimized programs know this and properly alternate execution of add/mult instructions for top performance. If multiple threads are executed, since each thread is fooled into believing it is the only thread executing, this can keep these optimizations from being as effective. If I had to make a guess for the drop in benchmark scores with Hyperthreading turned on, I would probably say the issue with cache memory is the far more critical restriction.

bloom25 · « **Reply #32 on:** November 04, 2003, 07:55:17 PM »

I'm sure many of you already know that the successor to the current Northwood P4s is called Prescott. Among the improvements in Prescott is supposedly improved Hyperthreading performance. The biggest improvement is undoubtedly that the amount of cache memory has been doubled, compared to Northwood, to 1 MB of L2 cache and a doubling of the L1 data cache and trace instruction cache. This will definately greatly improve Hyperthreading's effectiveness. Prescott also includes new special instructions (Prescott new instructions, SSE 3) which include a few new instructions specifically designed to increase Hyperthreading performance. (I haven't studied them in detail yet, but my guess would be that they allow one thread to temporarily halt execution of another thread.) It's also possible to specially optimize applications to take better advantage of Hyperthreading as well.

I think I'll also quickly note that AMD has not as of yet decided to implement some version of Hyperthreading in their new CPUs. I can't say that I blame them, because the Athlon and especally the Athlon 64 won't gain nearly as much by using them with current software. An on die memory controller, as noticed in previous posts, greatly reduces memory access latencies, which reduce the performance lost when executing a context switch. The Athlon and Athlon 64 also have significantly shorter pipelines than the P4, again reducing the advantage of Hyperthreading a bit. (Athlon - 10 stages, Athlon 64 - 12 stages, P4 - 20 stages) In addition, one of the new pipeline stages in the Athlon 64 analyses instruction dependancies to attempt to better schedule them to take better advantage of the CPUs resources. However, as new software begins to take better advantage of Hyperthreading I would not be surprised to see AMD eventually come up with some way to gain a bit of performance from that in future CPUs. (I'm not even considering any Intel patents on the technology.)

beet1e · « **Reply #33 on:** November 05, 2003, 04:20:32 AM »

Very interesting posts, Bloom25. Thanks for your advice, and to the other guys who answered my query. Last time I put a system together, the FSB and clock settings were in the BIOS (Asus A7V133). I didn't attempt any overclocking, and made no mods to the default speeds. All was well, with FPS in AH being 50-60 typically. As AH is about the most demanding app that I'm running right now, I hope to be OK with what I've bought. I'll look in the mobo manual to find out how to change settings, and if need be will post back.

As far as I can tell, the XP2600 I have is the thoroughbred, not the Barton. I got the 2600 even though I could have bought the 3200, because the 3200 cost about 5 times as much at the time! In the past few months it's dropped from £364 to £261 inclusive of tax. The 2600 I now have was about £75. I chose to apportion the main expense to the Radeon 9800 Pro vid card. Not much change out of £300.

Bloom. I was interested to read about multithreading. Same thing has existed on mainframes since the 70s, possibly earlier. But can you now explain to us about these "Dual Processors" that are being offered by AMD? Is this a hardware function to allow two threads to run at the same time? I bought single, not dual...

jonnyb · « **Reply #34 on:** November 05, 2003, 12:16:45 PM »

Intel's hyperthreading model attempts to mimic a dual processor system. It does this as bloom described in his posts. In a real dual processor system (Intel Xeon, AMD Opteron, etc) there is no need for that mimicry. The operating system sees the two processors and each gets its own processes to work on. Unlike the hyperthreaded model, a dual processor system can truly work on multiple processes in parallel.

The advantages of having two processors are numerous. For example, if a program is written to take advantage of multiple processors, it will complete it's work in far less time than if it were working on a single processor (or even on a hyperthreaded one). Take 3D rendering. Producing animation is extremely processor consuming simply because of the amount of math involved. Programs like 3DStudioMax and Bryce work extremely well with multiple processors. They can split tasks up to work on each processor and thereby reduce the total rendering time.

Databases and application servers also benefit greatly from multiprocessor systems. For example, I have built many large scale e-commerce type applications (bn.com, columbiahouse.com, kinkos.com, to name a few). Each of these applications serves many thousands of people. By using multiprocessor systems, these applications can respond much more efficiently to consumers.

Back to the bloom show...btw bloom, your commentary on architecture is exceptional. Ever think of writing a book or becoming a professor -- or perhaps you already have/are. I've thoroughly enjoyed the reading and brushing up on my knowledge is invaluable.

bloom25 · « **Reply #35 on:** November 05, 2003, 09:38:36 PM »

Lets go ahead and talk a bit about true multiprocessor architectural issues, since JohnnyB mentioned that briefly. There are some interesting differences between AMD and Intel CPUs here as well.

Many software applications are multithreaded, meaning they make use of more than one thread. If you have a multiprocessor capable operating system (Linux, Win2k Pro, WinXP Pro being most common), along with multithreaded applications, having more than one CPU can deliver much higher performance than a single CPU. Unfortunately the boost in having dual CPUs will rarely be anywhere near 2x that of a single CPU system, but I can't say that I've ever seen a hardware review talk about why that is. I'm sure quite a few of you may have wondered why dual CPU systems aren't twice as fast as single CPU systems, or why is it that dual CPU systems are sometimes slower than a single CPU system. There are several reasons for this.

First, lets talk about how the operating system makes use of dual (or more) CPUs. The most simple explaination would be to say that a multiprocessor capable operating system simply runs one thread on one CPU and another thread on the second CPU. This is essentially true, and as you might imagine applications that make use of more than one thread can realize tremendous performance gains on multiprocessor systems. But why isn't the boost 2x? There are several reasons, some relating to architectural limitations of the hardware being used (i.e. the CPUs themselves) and others relating to software issues. Unfortunately I don't even begin to consider myself an expert on software issues relating to multiple CPUs, but I'll do my best there. (A good SMP programmer will probably point out tons of flaws in my software explaination if I go too far, so I'll keep it simple.) Fortunately I do know hardware, so I'll talk about those issues first.

Hardware issue 1: Both CPUs in a dual processor system (notible exception being the new AMD Opterons - more on that later) share the same system memory and disk drive. I'm sure most of you can see that this will result in a significant amount of additional memory accesses and very slow disk drive accesses as the number of CPUs goes up. This, of course, results in a performance decrease. Unfortunately this gets even worse on real hardware platforms, because in the case of the P4 Xeons the CPU FSB bandwidth is actually shared between all processors in the system. This means each CPU shares bandwidth with every other CPU on the bus. This increases memory latencies and decreases bandwidth, which is a very bad thing. Unfortunately life isn't much better for the Athlon MP (which is simply a dual processor certified Athlon XP with some additional testing). I'm sure some of you know that the Athlon architecture is largely based on a server processor known as the Alpha (which was designed by Digital Equipment Corp, later aquired by HP and Intel). Many of the engineers of the Athlon architecture AMD hired to design the Athlon had previously worked on the EV6 and EV7 Alphas, which are 64 bit multiprocessor capable server CPUs. One of the key carryovers from the Alpha EV6 processor was its bus protocol, also called the EV6 bus. The EV6 bus has the advantage of each CPU has its full typical bandwidth to memory, rather than being shared on the P4 Xeon CPUs. Unfortunately AMD has squandered this very significant advantage by failing to keep pace with advancements in DDR memory in their only Athlon MP chipset. (The 760MP/MPX.) The 760 chipset only supports a 266 MHz FSB, and thus only officially supports DDR266 memory. The Athlon XP 3200+ has a 400 MHz FSB and supports DDR400 memory for comparison. This means that even though each Athlon MP does not share FSB bandwidth with each other, the chipset itself is out of date regarding its memory support. (This is also why the Athlon MP 2800+ has a 266 MHz FSB, where the XP 2800+ has a 333 MHz FSB.) Obviously a dual processor system can make better use of a faster FSB and faster memory than a single processor system can. (As I hinted above, the Opteron is different, and I'll get to that - I promise.

)

Hardware issue number 2/Software issue - Lets now talk about a tremendous issue that every multiprocessor system must deal with. Consider what happens if both CPUs in a dual processor system were to access the same data in memory, and both wanted to make changes to that particular bit of data. The potential issues here are monumental. Think about this for a minute. Lets say CPU 1 in its thread is asked to make a decision based on the data value in a particular location in memory. Lets also say that CPU 2 just happens to be working with a thread that makes a change to that same location in memory. There is a very real chance that CPU 1 may not do what the programmer intended, if CPU 2 happens to change the bit of data that CPU 1 is making decisions based on. There is more to this though, think about CPU cache memory. Remember that CPU cache memory is basically very high speed memory that is filled with bits of main memory that the CPU happens to be working with in the near future. Basically cache memory can be thought of as high speed temporary memory that the CPU works with. If it makes a change to something in cache memory, like writing the results of an instruction to cache memory, that particular bit of data in main system memory must also be updated before some other thread (or much worse) or some other CPU works with the same bit of data. If CPU 1 happened to make a change to something temporarily located in its cache memory and CPU 2 needed to work with that same bit of memory and happened to read that data before CPU 1 could manage to write back its cache to main memory the system could crash. Basically what I'm getting at is one CPU tries to work with the same data as the second CPU at the same time - neither one can be sure the data it is working with is correct unless it can be absolutely sure that the other CPU isn't going to makes changes to that data until it has finished working with it. The software term for this potential nightmare is a "race condition", meaning that basically two seperate processes or threads are trying to work with the same bit of data. Obviously multithreaded programs must be very carefully written to ensure that one of their threads isn't working with the same bit of data as another thread. Fortunately the multithreaded program has the advantage of knowing that a properly written multiprocessor operating system is supposed to ensure that other programs don't tamper with locations in memory reserved for it. So there we've talked a bit about software issues with multithreaded programs, but this still doesn't solve the CPU cache memory issue. That issue is resolved by the CPU itself, as a CPU in a multiprocessor system checks on every memory read and write that another CPU in the system doesn't contain the data it is working with in their cache memory. If it does, and the data in the other CPUs cache has a more recent value than main memory, it will change main system memory to reflect that and get the most up to date value. As you can probably imagine, this puts even more traffic on the CPU FSB(es) and accesses to memory. This certainly relates back to issue number one. You can probably imagine that the Athlon MP, with its non-shared FSB is a bit better here than the Xeon. Once again, the Opteron has a big advantage here, and again - I'll get to that...

Hardware issue 3 - Since main memory is shared the chipset must arbitrate between each CPU. There's not much to say here, bascially the chipset must take requests from each CPU to read and write to memory and give control of memory to each CPU when it request it. If both CPUs need to access memory at the same time (which they both nearly always will, since memory accesses take so long) one must wait until the other is done. This means higher memory read and write latencies, again hurting performance. Again, the Opteron is better, and I'm finally going to talk about why ... but first I'm taking a break so stay tuned as the Opteron has some really "cool" ways of increasing the efficiency of multiprocessor systems.

(The on-die memory controller in each CPU should be fairly obvious, but there's more than that...)

bloom25 · « **Reply #36 on:** November 05, 2003, 11:48:03 PM »

Ok, now lets talk about the Opteron and what makes it so special when it comes to multiprocessing.

I'm sure all you read above that the Athlon 64 (Athlon 64, Athlon 64 FX, & Opteron) family features an on-die memory controller. The truth is that there's a lot more than just that in there. Remember that in a traditional multiprocessor system each CPU shares main memory with every other processor, which results in a performance hit for many reasons. With the Opteron EACH CPU has its own memory controller and its own DDR memory modules. This is a tremendous advantage for the Opteron, this means that every CPU in a multiprocessor setup can have its own memory modules, rather than sharing them. This advantage gets even more important as you get even more CPUs. (If 2 CPUs sharing the same memory is bad, picture 4, 8, or even more.) There's even more than that though; in a traditional multiprocessor system each CPU must communicate with the Northbridge portion of the chipset to gain access to the shared memory. What's more, they have to do this to communicate with each other as well, meaning every CPU you add results in less and less of a performance boost percentage wise. In the Opteron, each CPU can directly communicate with each other over a fast 6.4 GB/sec Hypertransport link. 6.4 GB/s is the bandwidth offered by dual channel DDR400 memory, so in effect every CPU acts as a Northbridge all by itself. This means every CPU can access its own memory directly, and can communicate through its extra Hypertransport links with every other CPU with only a minimal performance penalty. This high speed link also improves the efficiency of each individual processor's cache memory, as the other CPUs in the system can access other CPU's cache much more rapidly than in other multiprocessor setups. Not only is this tremendously faster than any other multiprocessing scheme today, it also eliminates the need for an ever more increasingly complicated Northbridge, which in a traditional MP setup every CPU must communicate with. This means multiprocessor chipsets can be MUCH simplier, basically they become only I/O controllers (controlling hard drives, USB ports, PCI bus, AGP slot, etc). The chipset communicates with either one or multiple CPUs over another fast Hypertransport link built into the Opteron. This means that in an Opteron system, only the drives and the rest of the system are shared.

There are actually 4 different series of Opterons being produced: The 100 series, which is only single processor capable, has only 1 active Hypertransport link. (Making it currently identical to the Athlon 64 FX CPU) This single link hooks the CPU to the motherboard chipset.

The 200 series, which can work in dual processor systems and has 2 active Hypertransport links. The extra link hooks the two CPUs together.

The 800 series, which can work in 4 or 8 way systems, and has a 3rd Hypertransport link. In this type of setup the links are arranged like a square or cube. Picture a square with one CPU in each corner. Each CPU uses 2 of its HT links to go to the closest 2 other CPUs. This makes 2 links on the sides of the square for each CPU and the 3rd link on one (or more in some cases) of the CPUs goes to the rest of the system. In an 8 way system the CPUs are arranged as a cube (or as a sort of rectangle), with each CPU using its 3rd HT link diagonally across the cube to link itself with the bottom or top 4 CPUs repectively. Again, one or more of the CPUs communicates with the rest of the system.

The last series is a special version of the Opteron being used for supercomputers. Cray, IBM, Sandia National Labs, and others are building or have planned to build supercomputers with it. This chip has even more HT links, which gives it enough links to build systems arranged as a giant 3 dimensional grid. Each CPUs HT links reach out to neighbor CPUs. The best way I can describe this arrangement is picture each CPU in the middle of a 3d "+" sign. There are supercomputers with well over 1000 individual Opterons either being planned or constructed, ranking them along the fastest in the world. These supercomputers generally run 64-bit Linux or Unix. (Windows currently does not support what's known as NUMA (non-uniform memory access) which such a setup with multiple memory controllers requires. Windows Server 2003 is the first to have some NUMA support included.) Basically the Opteron is the first x86 compatible CPU designed primarily with multiprocessing in mind. As you can see, it eliminates or minimizes the disadvantages of adding additional CPUs compared to other x86 multiprocessor CPUs.

This is really easy to picture with a simple diagram. If I find a good one, I'll link to it here. Basically the Opteron is capable of scaling in performance far better than any other multiprocessor capable CPU available today as you add more CPUs.

bloom25 · « **Reply #37 on:** November 05, 2003, 11:58:31 PM »

Here's a quick bit of info out of Anandtech's early Opteron article from back in April. (Note that Opteron now supports DDR400 memory.)

http://www.anandtech.com/cpu/showdoc.html?i=1815&p=7

Here's some more links about the supercomputers planned:

This is the computer Cray is building for Sandia National Labs, which uses 10000 (!!!) Opterons and would be the fastest computer in the world if running today.

http://zdnet.com.com/2100-1103-962787.html
http://www.cpuplanet.com/knowledge/casestudies/article.php/2198311

jonnyb · « **Reply #38 on:** November 06, 2003, 10:38:15 AM »

As I had briefly mentioned in my post, and bloom has now expanded upon, the benefits to a multiprocessor system are plain to see. What I didn't touch upon in too much detail was why your average MP system will not see a linear growth curve of application performance to number of processors. One would expect that adding a second processor would double the speed. Four processors must then quadruple it, right? Unfortunately, no. Bloom has described quite correctly the hardware limitations involved in multiple processor systems. To summarize, x86-based processors (until the Opteron, that is) shared system resources. They were forced to utilize the same memory, the same FSB, the same Northbridge, the same I/O controllers. All of this sharing leads to a lot of wasted time on the CPUs while they wait for the rest of the system to catch up.

Another issue that was mentioned (hardware/software issue 2 from the above post) was programmatic access to memory by multiple CPUs. I will expand on the programming issue as that is where my expertise lies.

First and foremost, probably 99.999% of all programs you run on your home pc are multi-threaded. It would just take way too long for programs to execute if they were not. Let's look at an example that we are all familiar with: this bulletin board. The architecture of this board involves a graphical user interface (GUI) that provides the look and feel of the board, an API to retrieve and store data, and an API to accept user input and perform actions based on that input. (There are more things involved, but this list is enough to get us started).

Let's first assume that this bulletin board is single-threaded. When a user accessed the board by typing in the URL, the application server would receive that request, perform a lookup in the database to verify the user's existence, retrieve information from the database about that user, retrieve information about forums, check to see if there are any forums that have not been read since the user's last visit, manage the retrieved data, compile the data into a usable format, generate the HTML to display the data (based on the GUI) and finally send that data back to you. Wow. Just by typing in the URL of this board, you've caused at least 9 major events to happen. Each of these events requires time to complete. Furthermore, a lot of this time is spent waiting on the retrieval of information needed to proceed to the next step of the program. During these waiting times the CPU would sit idle.

Compound the above everyday scenario by adding multiple users. In a single threaded application all of you would have to wait until my request had been completed. Assuming each request takes 6 seconds to complete and there are 100 users trying to access this board, the poor guy that came in at number 100 would have to wait an unbearable 10 minutes for his request to finally be processed (based on the 100 users coming into the application simultaneously and a first-in-first-out queue).

Can you imagine having to wait 10 minutes for the board to load? Obviously nobody would wait that long. Notice, too, that throughout that 10 minute period, the CPU of the app server would be mostly idle becuase of the waiting time to retrieve all of that data. The database servers would also sit idly by waiting for more requests from the app server....

continued below

jonnyb · « **Reply #39 on:** November 06, 2003, 11:03:25 AM »

Adding another processor (or 100 more) to the single-threaded model does not help our cause any. Each processor would still be sitting around idle for extended periods of time.

Multi-threading a process attempts to keep a CPU busy at all times. By breaking a process down into parts and giving each of those parts its own thread, allows the process to execute much more quickly. The best example of this type of behavior can be seen in a macro sense with the SETI application. I'm sure most of you have seen/used/heard about this app. Basically, it takes one giant chunk of radio telescope data, breaks it down into smaller pieces and farms those small pieces out to processors around the world. The efficiency of crunching the data in this way is leaps and bounds ahead of doing it all in a single bite.

In our bulletin board example, multi-threading allows our application to service many more readers concurrently. The application spawns multiple threads to handle user requests. Let's assume that there are 100 threads spawned, to match the number of users trying to access the board. Remember that I told you the CPU was sitting around idle most of the time? Well, here's where we take advantage of it. When the first thread is waiting on the fetch of data from the database, it gives up it's control on the CPU. The second thread now utilizes the CPU. When it's waiting, it gives up control, and thread 3 comes in. This goes on and on. The CPU services each thread and spends far less of its time idle. So what does this mean to us? It means that we are serviced much faster. Now the poor guy who came in 100th doesn't have to wait an unbearable 10 minutes.

If you've followed to this point you will know that my example has been based on a single CPU. We've seen that multi-threading a program gets things done faster by reducing CPU idle time. This can be expanded to multiple CPUs....to a point. As bloom mentioned, there are setbacks. The more CPUs you add to the system, the more overhead is involved. Now program states, thread locations, memory access, I/O access have to be communicated throughout the system. The system spends more time managing than processing. Programs become bloated because they have to deal with handling threads more carefully. Operating systems become more complex. The list goes on.

I think I'll break here as I've hijacked bloom's thread long enough. If people are interested, I'll start another thread that deals with software development.

bloom25 · « **Reply #40 on:** November 06, 2003, 05:42:12 PM »

Go ahead and continue in this thread if you want JohnnyB. Hardware and software are so closely intertwined that I think it would be best to just keep all the info in one thread. That allows hardware discussion to build on software issues and visa versa.

It would probably also be interesting to discuss some of the x86 architectural limitations and annoyances. Things like segmentation and the limited number of software accessable x86 registers to work with. That would also be a good primer to build the SIMD instruction info off of.

BTW: I found the software discussion very interesting.

Roscoroo · « **Reply #41 on:** November 10, 2003, 12:03:20 PM »

I vote ya keep going in the same thread here ... Still reading

jonnyb · « **Reply #42 on:** November 10, 2003, 12:53:40 PM »

Alright, I'll continue in this post. Hopefully, I'll get a chance to post something later today. I think the discussion will be regarding the low-level software interactions with the hardware (MMX, SSE and SSE2).

bloom25 · « **Reply #43 on:** November 10, 2003, 05:59:56 PM »

That's something I was working on myself, but honestly it's not an easy topic to simplify and still get any useful information across.

jonnyb · « **Reply #44 on:** November 11, 2003, 10:40:39 AM »

lol... I hear that. The basic idea was to convey the advantages of the MMX, SSE and SSE2 instructions sets. When I realized that to do so would require a far greater amount of detail than I care to post, I started thinking of other ways around it.

The basic premise for the introduction of the SIMD (Single Instruction Multiple Data) was to speed up complex operations on a CPU. If porn makes the internet go 'round, then games do the same for hardware advancement, albeit more discreetly. To understand why these instruction sets were added to the x86 CPUs, one must understand the enormous amount of processing that must take place in a typical game. As before, I will use an example we are all familiar with: Aces High. Before I go on, I must include a disclaimer: I do not work for Hitech Creations, and I do not have any knowledge of, or access to, any of the source code that makes up Aces High. What I am going to write are general concepts, and nothing specific to the AH engine.

Ok, now that the disclaimer is out of the way, let's begin. In AH a huge amount of data must be processed to create the virtual reality we enter each time we start up the game. The processing power goes to rendering the world around you, computing flight characteristics, registering and tabulating damage inflicted both to and by you, etc. This is no small feat to accomplish, and tends to keep a CPU quite busy churning away.

What does it all boil down to? The answer, yes I'm sure you didn't believe your teacher when s/he told you this, is math. That's right. It's all about the math. Let's look at one specific part of AH: the graphics engine. There are many graphics engines on the market, some of which are quite famous, Quake3, Half-Life, Unreal, etc. The single purpose of a graphics engine is to render graphics (duh). How it does this is through the use of math. I know I'm diversifying here, but I'll get back to my original point, I swear. It really is related...

Let's consider a simple graphics engine. There are two components involved, a modeler and a renderer. The modeler is responsible for the generation of shapes and creation of coordinates. The renderer then takes the information produced by the modeler and produces the images on screen. For the sake of convenience, this is a grossly over-simplified explanation of the function of the components of a graphics engine. For example, I am not going into z-buffers, pixel shaders, dynamic light sourcing, ray-tracing, etc.

Anyway, the simplest representable polygon is a triangle. It is planar and has only three points (or in the graphics world, vertices). We can represent any shape we wish through the use of enough triangles. For example, a square is nothing more than two triangles. The more triangles we use, the higher the detail we can achieve. This is easily visible when creating curved surfaces. Less triangles, and the edges are jagged. More, and they smooth out.

This, however, comes at a price: computing power. The modeler must create all the vertices for all the points of all the triangles of all the shapes in an object. This can also include base color saturation levels for each vertex, and other pertinent information regarding points or shapes. Can you see where this gets computationally expensive? Take the creation of a sphere as an example of a very common shape. Our modeler must decide how many triangles will compose the sphere. With that knowledge, and some other inputs (like the radius of the sphere and the formula for calculating surface area -- 4*pi*r^2) the modeler generates the vertices of the triangles composing the sphere.

Once all the vertices have been computed, that information is fed to the renderer. Its job is to put all of those objects onto the screen. It has to deal with interaction between objects, including light sources, shadowing, obscurity, and so on. Again, it is the power of math. Without going into a college-level linear algebra and matrices course, suffice it to say that a LOT of math happens here. Very complex math including calculus, vectors and even 4-dimensional extrapolations.

continued below...