Author Topic: PC Architecture (Read 2507 times)

jonnyb · « **Reply #45 on:** November 11, 2003, 12:58:58 PM »

With all that math going on, especially the matrix operations, the CPU is kept exceptionally busy. It is so busy that today's games are completely unplayable on older systems. So, for all you guys out there wondering why your Pentium 166 with the integrated 4mb video won't play AH2, there's your answer.

So, what can we do about this? The answer is that we group like operations and data together. This is where SIMD comes into play. (See, I told you I would get back to this

). SIMD allows a CPU to perform a single operation (like multiplication) on multiple data instead of a one-to-one relationship of operation to data (SISD). The key to SIMD and its success is known as data parallelism. You get data parallelism when you have a large mass of data of a uniform type that needs the same instruction performed on it. In my rendering example, one would have many dot products to perform on a scene. The basic unit of SIMD processing is the vector, which is why SIMD computing is also known as vector processing. A vector is nothing more than a row of individual numbers, or scalars. A regular CPU operates on scalars, one at a time. A vector processor, on the other hand, lines up a whole row of these scalars, all of the same type, and operates on them as a unit.

The introduction of this hardware, and the accompanying software instruction sets to the PC allowed for huge gains in processing power over just raw clock speed increases. MMX was first out of the gate. I will leave the hardware details to bloom, but will provide a brief summary. MMX is SIMD strictly for integer calculations. SSE on the other hand is for floating point calculations. MMX offers the ability to operate on 2x32 bit integers simultaneously, SSE offers 4x32 bit floats. Furthermore, MMX defined no new registers, opting instead to use registers currently on the die. SSE, on the other hand, introduced 8 new registers.

With these new instruction sets, software developers could compile their code to take advantage of the benefits offered. Programs like Photoshop, 3DStudio, mWave, etc that had been re-written to utilize the new instruction sets saw huge performance gains. Games also saw huge improvements in frame rates as the new hardware could process data that much quicker.

Ok, I've covered quite a bit here, so I'll stop. After all, I do have to develop some software

.

bloom25 · « **Reply #46 on:** November 11, 2003, 05:11:39 PM »

Great job JohnnyB!

That definately gives a good base to build on.

I think at this point it might be a good idea to describe in a bit more detail some key concepts to understanding how SIMD instructions (MMX, SSE, SSE2, 3dnow, SSE3) are useful.

Let's go down to the absolute basics here and definite what's meant by bit, byte, word, float, integer, double, etc. If you know what these mean it will be much easier to understand how SIMD instructions can greatly boost performance. These concepts are actually quite simple once you understand how to think in binary. To get there, let's send everyone back to basic math class.

First, the "bit": A bit is roughly analogous to a digit in our normal decimal way of thinking. For example the number "410" has three digits, the ones, tens, and hundreds places. If you think about this what we are really saying is that 410 is made up of 4x100 + 1x10 + 0x1 = 410. (4 times 100 plus 1 times 10 plus 0 times 1 equals 410.) The same holds true for binary numbers (and all other numbering schemes like octal and hexidecimal, of which the later is commonly used in software). If I want to represent a number in binary, which only has 0 and 1 (versus 0,1,2,3,4,5,6,7,8, and 9 for decimal) we simply do exactly what we do for decimal. There's only two possible choices for each place though, 0 times some place and 1 times some place. Think about this: In decimal each place (ones, tens, hundreds, thousands) is a power of 10. The ones place is 10^0 (or 10 to the 0 power, which equals 1.) The tens place is 10^1 (or 10 to the 1st power, which equals 10.) The hundreds place is 10^2 (equals 100) and so on. Thus again when we write 410 we are saying 4x10^2 + 1x10^1 + 0x10^0 which equals 410. (4 times 10 to the second power = 400. 1 times 10 to the first power = 10. 0 times 10 to the 0th power = 0.) Now lets talk about binary. Binary does not have places that are powers of 10, but rather places that are powers of 2. Thus for binary the first place is still the ones digit (2^0), but the next is the 2s place (2^1), then 4s (2^2) place, then 8 (2^3), 16, 32, 64, 128, 256, 512, etc. This means since you only have a 0 or 1 as a multiplier for each of these places it's really easy to count in binary. Lets start simple, in binary if I want to represent the number 10 I would write that as 1010. (8 times 1 plus 4 times 0 plus 2 times 1 plus 1 times 0 = 10 (decimal)) For reference, here's 0 to 15 in binary: 0000 = 0, 0001 = 1, 0010 = 2, 0011 = 3, 0100 = 4, 0101 = 5, 0110 = 6, 0111 = 7, 1000 = 8, 1001 = 9, 1010 = 10, 1011 = 11, 1100 = 12, 1101 = 13, 1110 = 14, 1111 = 15. Once you catch on to this you understand just how basic it really is. Now (finally) back to the bit. A "bit" is simply any individual binary digit. Thus to represent numbers from 0 to 15, you need to use 4 bits to do that. (Look above, 15 = 1111, which is the highest number I can represent using only 4 bits. If I wanted to write 16, that would be 10000, which needs 5 bits.)

Ok, I think we now have a fairly good understanding of a bit, now onto the byte. A byte is simply a number using 8 bits. This means that a single byte of information can have up to 256 possible values ranging from 00000000 = 0 to 11111111 = 255. The byte is a very commonly used term in computers and tech literature. (A Megabyte is technically one million bytes. A Gigabyte is one billion bytes.) Remember just above I said that with a single byte you can represent up to 256 different numbers (integers 0 to 255). Given that piece of information, is it hard to understand that an 8 bit CPU can work only with (positive) numbers from 0 to 255 in each operation it performs? For example, if an 8 bit CPU wants to add two numbers, the maximum value for each number it wants to add can be no greater than 255. Now obviously you CAN work with numbers bigger than 255, but you have to use more than one operation to do so. The first byte used can represent the lower digits and the second byte can be the higher digits. Unfortunately to explain this further will drag the discussion way off topic, so lets more on shall we. The important concept to gather is that an 8 bit CPU must perform more operations to work with numbers greater than 255. This puts it at a disadvantage when doing so against CPUs that can work with 16 bits (or more) at a time.

Now, how about 16 bits. With 16 bits (2 bytes) we can describe up to 65536 possible numbers (0 to 65535). 16 bits is actually a rather special number of bits for x86 CPUs, because originally the x86 instruction set was only 16 bit. (This would be CPUs including the 8086, 80186, and 80286.) The x86 instruction set was expanded to 32 bit with the 80386 (386), so x86 CPUs from the 386 onwards could processes either 16 bit or 32 bit instructions and data. 16 bits is also relavant for another reason, and that is the concept of the word. The basic definition of the binary word is the number of bits the CPU natively works with. In the x86 world this is not simple anymore, as x86 CPUs started out as 16 bit and were expanded to 32 bit (and now 64 bit with the Athlon 64). In the x86 world, the "word" is 2 bytes or 16 bits long. In other architectures this is not the case. For the PowerPC (Macs) the word is 32 bits or 4 bytes. Why is the "word" important? Simply put, the word is the (smallest) number of bits the CPU works with in each operation.

Ok, lets go farther. In software programming languages (like C or Java) you must definate how many bits are contained in a variable. To do this, software "declares" a variable, which essentially tells the computer how much memory to allocate for that particular variable. Thus if you declare a variable for an x86 CPU, the smallest amount of memory that variable is going to use up is 16 bits, or one word. Now we can define what a "double" is. A "double" is a variable made up of 2 words, or in the x86 world 32 bits. With a 32 bit variable you can describe just over 4 billion possible values.

Now in JohnnyB's excellent posts above he mentions a "float." What is a "float?" A float is short for a floating point number. A floating point number would be simplest to explain to those of us familiar with decimal numbers as those with a decimal point. Thus a floating point number would be those like 12.286, versus integer numbers like 410. Representing a floating point number in binary is very interesting. Picture this, I can also represent 12.286 as 12286 with the decimal point shifted 3 places to the left. Thus a floating point number can be represented as the digits that make up the number and the number of places to move the decimal point. This is how a computer treats floating point numbers. (The technical term for these two pieces of information used to describe a floating point number are the mantissa (or the 12286 in my example) and the exponent (or the number of places to move the decimal point). Now obviously to do any serious math and realistically any kind of division you need to have the capability of working with floating point numbers. Every x86 CPU since the 486DX has had a special unit within the processor dedicated to working with these numbers. That unit is the Floating Point Unit (FPU). The instruction set primarily used to work with floating point numbers is the x87 instruction set. (Why is it called x87? That's because some computers, like 286 based PCs, had the option of using a seperate CPU of sorts called the 80287 chip, which handled floating point instructions. The instructions this chip used were naturally called x87 instructions. After the FPU was integrated with the rest of the CPU with the 486DX, the x87 name for the instructions was still used.) The FPU can perform a LOT of different instructions. It can add, subtract, multiply, divide with remainder, divide, perform trig instructions like cosine, sine, and tangent and arctangent, raise number to powers of 2, powers of e (if you don't understand what e is, don't worry), perform log(arithm) instructions, compare numbers, along with read and write to memory. (And much much more.) As you might imagine the floating point unit is MUCH more complicated (and takes a longer to do some instructions) than the integer unit. (The integer unit is known as the ALU or arithmatic logic unit. The ALU performs most of the same tasks as the floating point unit, but only on integer numbers, or those without a decimal point in them.) Games make heavy use of the floating point unit to perform just about all mathematical operations. The integer unit simply (ALU) lacks the precision (since it can only work with integers) for many calculations that games need to perform. Thus for gaming in particular the performance of the FPU (at least for the CPU's part) is key to gaming performance.

Why have I spent all this time describing all this? Quite simply the SIMD instructions are designed to allow an x86 CPU to perform instructions on more than 32 bit numbers at the same time. The differences between them lie in the types and number of bits these instructions can deal with.

bloom25 · « **Reply #47 on:** November 11, 2003, 05:59:46 PM »

Ok, onto the SIMD (single instruction, multiple data) instruction sets themselves. The first to be introduced was MMX. MMX (MultiMedia Extensions), added with the Pentium MMX, instructions give an x86 CPU the ability to process integer (and integer only) operations on up to 2 32 bit numbers at the same time. In a way you could say that MMX instructions added 64 bit integer operations to the x86 instruction set, but that's not really the case as most of the CPU itself was still 32 bit, meaning to work with the 64 bit numbers it had to perform multiple internal operations. It did give greater performance on integer operations mainly because MMX instructions added mov (move) instructions for more than one 32 bit integer value to and from memory at the same time. Thus integer operation heavy programs could, using MMX instructions, perform "work" on 2 32 bit values with the same instruction. Unfortunately, games depend heavily on the floating point unit, and MMX does nothing directly to improve the performance of those instructions.

Enter 3dNow!. Back in the days of the AMD K6 series processors the x87 floating point unit in that CPU was not nearly as good as the Pentium 2s FPU. Thus in gaming tests the P2 nearly always beat up on the AMD K6. The K6 did have good integer performance, and thus did well on general office applications and the like that don't stress the FPU. To help the K6 out, AMD added new instructions they called 3dNow to the K6 and renamed it the K6-2 with 3dNow!. Basically 3dNow instructions were like MMX instructions for floating point numbers. Thus if games made heavy use of 3dnow instructions the K6-2 could catch up or even pass the P2. Unfortunately for AMD, the 3dNow instruction set had 2 main problems going for it: 1. Intel held a commanding marketshare advantage over AMD, so few software developers spent the considerable time needed to heavily optimize their programs for 3dNow. 2. The 3dNow instruction set did not have the same precision as the x87 floating point instruction set. x87 is actually internally 80 bit, and the FPU uses those extra 16 bits to avoid error when working with 64 bit numbers in operations. (For example a divide operation like 1/3 where one number does not evenly go into another.) 3dnow originally was 64 bit only. This meant that mixing original 3dnow and x87 instructions together resulted in a small amount of error being added to each calculation. For games, this was very small and didn't matter a whole lot, but for scientific applications and some video editting work this loss of precision could be major. Again, this greatly limited the use of 3dNow. (Think about this, if the x87 results are 80 bit and you pass that data to a 3dnow instruction the least signficant 16 bits are simply rounded off. If you then performed another x87 instruction on that data, those 16 bits were added back as 0s. Basically each operation added a bit of rounding error, which if you executed 1000s of instructions on a single piece of data could result in signficiant error in some cases.)

With the P3 Intel added their own version of 3dnow and called it SSE. SSE did not sacrifice any precision in its calculations, and thus was much better suited to the job than 3dNow. Again, SSE is like MMX for floating point numbers. SSE instructions can work with 2 floating point 32 bit values with a single instruction. AMD implemented partial SSE support with the original Athlon and added full SSE support with the Athlon XP (they just called it 3dNow! Professional). Again, like MMX, SSE instructions allow the transfer of 2 floating point numbers to and from memory at the same time. SSE also added special registers (known as XMM registers) for working with these instructions.

SSE2 simply extends SSE up to so called quad doubleword instructions (or 4 32 bit values = 128 bit). Thus SSE2 instructions can work with 4 standard 32 bit floating point values in a single instruction. SSE2 instructions were added with the Pentium 4, which in a strange twist of fate suffers from the same basic limitation as the AMD K6 did. The Athlon's x87 FPU is far superior to the P4s x87 FPU. (As noted in previous posts, the Athlon's FPU is fully pipelined, allowing it to execute multiply/divide and add/subtract instructions at the same time.) Like 3dNow! before it, using SSE2 instructions allows the P4 to narrow the gap and in some cases exceed the FPU performance of the Athlon. AMD finally added SSE2 support to the Athlon 64 CPUs and expanded it by adding more XMM registers.

The next Pentium chip (Prescott) will add 19 more new instructions, called SSE3 instructions. These instructions build upon SSE2 and add a few new instructions to help Hyperthreading performance. Prescott is currently scheduled for a Q1 2004 introduction (probably February).

(There's some other interesting things I could talk about if anyone is interested. For example, how CPUs handle negative numbers and how subtraction, multiplication, and division are executed.)

bloom25 · « **Reply #48 on:** November 11, 2003, 06:42:53 PM »

I just realized that both JohnnyB and I mentioned "registers," but neither of us really defined what that means. A "register" is a essentially a small amount of memory on the CPU itself that can be directly manipulated by the execution units of the CPU. The x86 instruction set (at least up until the Athlon 64) defines 8 general purpose registers (GPRs) that are directly accessable to software. (Which have names like EAX, EBX, ECX, etc. If you get a bluescreen on a computer you will often be presented with the values located in these registers.) There are also 8 XMM registers used by SSE, 3dnow, and SSE2 instructions. (The Athlon 64 increases this to 16 GPRs and 16 XMM registers.)

It's probably best to describe what a register is for by simply giving an example: Let's say a CPU wants to add two values, A and B. To do this it would move the value of A from memory into one of the general purpose registers, which is called a move (mov) instruction. The value of B would be placed in another of the 8 general purpose registers. The ALU would then be given an instruction telling it to add the contents of register A and register B and store the results of that operation somewhere, say register C. The contents of register C could then be used in another instruction to make a decision for example, or have some other instruction performed on it.

There are also additional registers in the CPU which store other useful pieces of information. Among these would be information concerning the segments of memory the CPU is currently using, capablilities and type of CPU, along with information about the results of the current instruction. (For example if the result of an instruction was 0 or if an addition or multiplication result resulted in a number bigger than the register could hold, a carry condition.)

WhiteHawk · « **Reply #49 on:** November 19, 2003, 05:19:51 PM »

damm bllon..you should get a job doin this sht.

Roscoroo · « **Reply #50 on:** December 29, 2003, 06:08:08 PM »

Great thread, Skuzzy sticky this puppy ...

ker · « **Reply #51 on:** December 29, 2003, 10:56:43 PM »

For Bloom or Skuzzy, with the PCI express video cards and PCI-X expansion slots due out early next year, what kind of performance changes do you expect to see from all this new technology? Are we going to need new cases for the new BTX form factor motherboards? Are these new boards going to use regular DDR memory or something else??

bloom25 · « **Reply #52 on:** December 30, 2003, 08:07:12 PM »

There isn't a lot of information available right now, but I do know a few tidbits about what's coming in mid-2004.

Intel is planning to launch the next Pentium 4 socket, Socket T, which is a 775 "pin" (more like a bump) socket for their Prescott processors > 3.6 GHz. With this socket you will see the introduction of the Grantsdale and Alderwood chipsets. (The Grantsdale chipset is the budget chipset, and the Alderwood chipset is the high end chipset.) Both of these will support PCI Express, which is the replacement for the current PCI and AGP busses on current motherboards. Along with this, they offer support for DDR 2 533 MHz memory. I'd expect to see VIA, SIS, and ALI offer competing chipsets with similar feature sets within a couple months after the release of Grantsdale and Alderwood. I don't know when AMD plans on changing the Athlon 64 to support DDR2. Remember that one of the key improvements to the Athlon 64 is its on-die memory controller. This also means that memory support will be determined by the CPU, and not by the motherboard chipset. Certainly, future Athlon 64 chipsets will offer PCI Express support.

AMD is also planning to consolidate the Athlon 64 line to a single unifed Socket 939 in mid-2004. This socket will superceed the current Socket 754 used for the Athlon 64 and Socket 940, used for the Athlon FX 64. Socket 940 will likely continue to be used for the Opteron. AMD plans to offer the current Athlon 64 up to the 3700+ speed in socket 754 and they also have indicated the possiblity of offering 32bit Athlon XP based CPUs in the 754 pin package as well.

Now on to the question of whether or not these improvements will result in a significant immediate performance boost, the answer is most likely no. What these improvements do offer, however, is to remove a few bottlenecks to performance that would start to be significant in the near future.

PCI Express is probably the most important in the very near future. I'm sure a few of you know that the PCI bus in current computers is limited to no more than 133 MB/sec transfer rate. (Though some Intel and VIA chipsets actually impose a cap at around 90 MB/sec to avoid data corruption issues.) This bandwidth is shared between all devices on the PCI bus. What many of you probably don't know is that the PCI bus links not only the PCI expansion cards, but in nearly all cases the hard drive controller, which can be either Parallel ATA (the current standard) or Serial ATA (which is begining to show up on new drives, and most onboard peripherals. This means than in current systems the hard drive controller (which on current systems is capable of either 133 MB/sec for ATA7 or 150 MB/sec for Serial ATA) shares the PCI bus bus with other high bandwidth devices like the soundcard and network card. Remember that only 133 MB/sec is available to all devices on the PCI bus. This means that the 150 MB/sec offered by Serial ATA cannot be achieved (even if current drives could transfer data that fast), and even the 133 MB/sec offered by the ATA 7 standard can't be fully realized. PCI Express will offer increased bandwidth, which will provide greater system performance in the near future with high speed Serial ATA drives and Gigibit network cards becoming more common. On the first motherboards to offer PCI express, you will see connectors of both the 1x and 16x type. The PCI Express 16X connector is the replacement for AGP and the 1x connectors replace the current PCI connectors. With current graphics cards the extra bandwidth offered over AGP 8x by the PCI Express 16x connection will probably result in nearly no significant performance improvement, but in the future it will offer more. Another advantage of PCI Express is that the 1x connectors are much smaller than current PCI connectors. This is significant for small form factor computers, like the Shuttle SFF systems, which are becoming more popular. (With everything PC related, the fact that the connector is most likely cheaper may also have something to do with it.)

DDR 2 533 MHz memory is a different story. For the Athlon 64 family CPUs, DDR 2 533 MHz memory will offer a very significant performance improvement when AMD adds support for DDR2 memory. For the current 800 MHz FSB C type P4s, it will not, as an 800 MHz FSB lacks the bandwidth to take full advantage of dual channel DDR 2 533 MHz memory. For this reason, expect to see 1066 MHz FSB Pentium chips at some point. Where the Pentium 4 might see some benefit from DDR 2 memory is in the lower end of the new Prescott P4s, due for introduction in early February. These will be offered with a 533 MHz FSB option, which means a single stick of DDR 2 533 memory will be all that is required for top performance, rather than 2 sticks of DDR266 or higher. DDR2 memory will also offer improved integrated video performance, which is important to low end OEM systems.

As for the new BTX form factor, I don't know whether or not current ATX cases and power supplies will be compatible or not. It is possible a BTX motherboard will mount in some ATX cases, but it's really too soon to tell. The point of BTX is mainly to improve cooling. BTX moves the CPU to just behind the front case fan, rather than just below the power supply in current ATX cases. This is done to improve cooling. The PCI Express 16X slot is also placed just behind the CPU for the same reason. Basically BTX looks just like an ATX motherboard rotated 180 degrees. (CPU on lower right corner, rather than in top left.)

Roscoroo · « **Reply #53 on:** February 04, 2004, 07:03:20 PM »

Puntola ......

bloom25 · « **Reply #54 on:** February 04, 2004, 11:56:35 PM »

Just a minor update, in case anyone cares, March 29th is the current launch date for Socket 939 Athlon 64s, as well as a speed bump to 2.4 GHz (3700+ is the current expected rating). Not only will Socket 939 render Socket 754 all but end-of-lifed, they should be a bit faster as well. The Socket 939 Athlon 64 processors will have dual channel on-die memory controllers, making them like the current Athlon 64 FX, however they will support regular DDR400 memory. The current Athlon 64 FX requires registered memory, which is marginally slower than non-registered (unbuffered) memory. They standard Athlon 64s will all have a L2 cache size reduction to 512kB, but to compensate will run at higher clockspeeds than their identically rated Socket 754 counterparts. The higher clockspeed and dual channel memory capabilities should more than make up for the 512kB reduction in cache. Currently the only difference between the Athlon 64 3000+ and 3200+ is the size of their L2 cache. The 3200+ has 1 MB and the 3000+ has 512kB. Looking at current benchmarks shows that the performance drop is quite small, so overall I expect the higher clockspeed and additional memory bandwidth to more than make up for the reduction in cache size. The other direct benefit of reducing cache size is increased yields, which result in a significant reduction in production costs, as well as more frequency headroom.

(Just as a note, March 29th is my birthday. If anyone would like to send me a 3700+ as a birthday present, I wouldn't mind.

)

Kaz · « **Reply #55 on:** February 05, 2004, 11:55:54 AM »

March 28th is mine I don't mind a late present though!

Furious · « **Reply #56 on:** June 12, 2004, 11:57:03 PM »

Bump for XJ.

MOSQ · « **Reply #57 on:** June 22, 2004, 07:00:06 PM »

Bloom,
Impressive thread!

I just noted that Dell is now selling the Dimension 8400 in PCI/PCI Express only using the 925x chipset. The AGP slot is gone!

Where the market leader goes the rest of the industry will shortly follow.

So the question for everyone who is looking to upgrade because of AHII: Should they wait for PCI Express mobos and cards to be available widely, or go ahead now with AGP?

The question is important because future video cards starting with the top end ones will be PCI Express, not AGP. In a couple of years all the new cards across the price range will be PCI Express, with only low end cards still AGP.

From PC Mag:

In the most significant bus transition since PCI replaced ISA, PCI Express arrives with the new chipsets. Similar to the way hard drives transitioned from the parallel IDE connection to the serial ATA (SATA) connection, PCI's parallel connection is moving to PCI Express's serial connection. While a typical 33-MHz, 32-bit PCI bus has a unidirectional total bandwidth of 133 MBps (megabytes per second), each pair of wires on PCI Express is capable of transferring 2 Gbps (gigabits per second) in both upstream and downstream (read and write) directions for an effective total bandwidth of 4 Gbps—or 500 MBps.

Since PCI Express is a point-to-point serial connection, that bandwidth is available for each card connected to it, rather than being shared among all the cards as it is for PCI. These pairs of wires can also be grouped together to make x1, x2, x4, x8, x16, and x32 (pronounced "by 1, by 2" and so on) connections, with each pairing doubling the throughput.

Notably, the next generation of high-end graphics cards will no longer be on the 8X AGP bus but on the x16 PCI Express bus. Today's 8X AGP graphics bus has a unidirectional bandwidth of 2 GBps (gigabytes per second). With x16 PCI Express, bandwidth is 4 GBps in each direction, for a cumulative bidirectional bandwidth of 8 GBps. Besides graphics, the 915 and 925 chipsets will also support up to four x1 PCI Express slots in addition to up to six PCI slots.http://www.pcmag.com/article2/0,1759,1615226,00.asp

Roscoroo · « **Reply #58 on:** July 18, 2004, 06:38:54 PM »

Puntola

Kev367th · « **Reply #59 on:** July 19, 2004, 01:47:43 AM »

Another update -
1) AMD has said it will NOT be adding support for DDR2 for the Athlon 64 (due to high latency probs). But will instead be hopping straight to DDR3.
2) Athlon 64 systems have no front side bus, the mem controller is on the CPU itself.
3) If you build an Athlon 64 based system buy the fastest memory you can afford.