Author Topic: FPU Analysis article (Read 813 times)

Lephturn · « **on:** August 16, 2001, 06:39:00 AM »

I found this linked off of http://www.iamnotageek.com/
http://www.idius.net/fpucomparison/

I'd like to see the numbers with SSE2 factored into the equasion though. Interesting comparison, but I recommend you take it with a grain of salt.

bloom25 · « **Reply #1 on:** August 16, 2001, 12:23:00 PM »

The only problem with SSE2 is it will take some time before it becomes used by most FPU intensive software (including AH). P4s FPU is very weak without it.

The guys chart seems a little off to me though. The Athlon is either too low or the P6 cores (P pro, p2, p3, celeron) are too high. The Athlon's FPU is much more powerful than the P6s (fully pipelined in the Athlon.) I also think the K62 is too high.

You might be able to find some more info on www.arstechnica.com about this too.

Ozark · « **Reply #2 on:** August 16, 2001, 12:30:00 PM »

Anyone have a clue what they're talking about?

AKDejaVu · « **Reply #3 on:** August 16, 2001, 12:37:00 PM »

Hey bloom.. in regards to SSE2.. they were saying the exact same thing about the '3DNow!' stuff too.

I also tend to agree with you about the numbers... K6-2/3 look too high and Athlon looks too low.

AKDejaVu

Lephturn · « **Reply #4 on:** August 16, 2001, 01:28:00 PM »

Yeah, I noticed the K6/2 was way high myself.

I'd like to see a real comparison with actual tests.

bloom25 · « **Reply #5 on:** August 19, 2001, 12:26:00 AM »

Yeah, the k62 is WAY too high here. My guess is 3dnow must have been used in the test.

As for SSE2, I do think it will eventually become common, but it's important to remember that software must be recompiled at a minimum for it to be included. This will take time and by that time all modern CPUs will support the technology. For example, SSE 1. At one time only the p3 had it and very few programs used it. The original Athlon added support for 24 of these instructions, and the new Athlon 4 ( LOL ) supports the rest of them as well. SSE 1 is now just starting to show up.

Ozark, FPU is an acronym meaning "floating point unit." Basically this is a part of the processor that does math calculations.

Take this as an example:

I can represent the number 1201 as 0000001201.00000000; they are the same. Now lets say I wanted to multiply this number by 10. All I have to do is move the decimal place to the right by one position. Now lets extend this concept to a CPU and work in binary. In a CPU all number are represented in binary and they are a fixed length. (For example 32, 64, 128 bit, meaning they are composed of a series of 1s and 0s of length 32, 64, or 128.) Just as in decimal notation by moving the decimal place to divide or multiply by 10, moving the decimal place in binary results in multiplying or dividing by 2. In a computer a floating point number is composed of 2 parts. One is what is known as the mantissa, which is in my 1201 in my example. The other part is the exponent, which basically tells the computer where to place the decimal point. There are two ways of doing this, one is called an integer mantissa, the other is fractional mantissa. This basically refers to where the decimal point starts. This is easy to see in an example. I can write 1201 as .12010000 and move the decimal 4 places to the right to get the correct number. This is a fractional mantissa. The other way would be 12010000. and move 4 places to the left. This is integer mantissa. Since in a computer moving the decimal place is like multiplying by a power of two you can imagine how simple it is to multiply by 2, 4, 8, 16, etc. All you have to do is either increment or decrement the exponent. (Essentially move the decimal place.)

For this reason software writers will tend to make heavy use of powers of 2 type multiplications and divisions when seeking to optimize mathematical operations.

A CPU normally contains 3 different floating point operation units. One handles addition (subtraction), the next multiplication (division), and the last loads and stores the numbers. One of the touted features of the Athlon is a "fully pipelined" floating point unit. This means (unlike P3) that each unit does not depend on any part of another unit. This means the Athlon can perform and addtion, multiplication, and load store all at the same time. In the P3 the multiplication unit shares parts of the addtion unit, so this is not possible. Now lets consider and example: Lets say I wanted to take the number 48 and multiply it by 3. Knowing that in a CPU muliplications and divisions by powers of 2 are very simple the fastest way to do this would be to multiply by 2, then add 48. Now lets consider the P3 and the Athlon. (Keep in mind this is is EXTREMELY generalized, but the basics are not too hard to grasp.)

In the p3 I would have to do the following: I first need to grab the number 48 and 3 from memory. Next I schedule the operations needed to complete the task. First I'll multiply by 2, and then all add 48 to this result, then I'll have to write the result back into memory somewhere. In the P3 I must do the addition and multiplication in separate steps, the multiplication unit requires the use of some of the components in the addition unit. In the Athlon I can do the 48 + 48 addtion AND the 48 x 2 multiplication at the exact same time then combine the two. Thus I've saved 1 step in the process and increased speed.

Now for the concepts of SIMD instructions. (3dnow, SSE, and SSE2 are examples.) SIMD means "Single Instruction Multiple Data". What this basically means is that I can take 2 seperate sets of data and perform the same operation on them. Lets try an AH example here. (A bad one.

) Accept for the moment that a group of clouds are moving at 30 miles per hour. This means that each individual cloud in the group must have it's position altered by a set amount in one direction. To do this lets say every cloud in the group move to the south by 4 feet. To do the job the CPU needs to take the postion information of every cloud in the group and add 4 feet. Not too big a deal to do really, but there is a lot of repetition involved in doing so. Lets imagine now that we could add 4 feet to the position of EVERY cloud in the group at the same time. As you can imagine we've saved a TON of time. This is the concept of SIMD and the instructions to do just this are what 3dnow, sse, and sse2 are. (Some load multiple information elements in to memory, some do addtions on several data pieces at the same time, etc.) As you can probably guess not every task is going to benefit from the use of SIMD type intructions. (What if we only needed to move one single cloud?) On the other hand, some tasks are high repetive and the use of SIMD can speed things up greatly. These types of applications are usually multimedia related and probably the very best example is MPEG or MP3 encoding. In these tasks you are performing the same mathematical operations over and over again until the job is done. The use of SIMD can really speed things up. (This is the main reason why in benchmarks of the P4 MPEG 4 encoding is one if it's strengths. The application they are using support the advanced SSE2 instructions, and only the P4 supports them right now.)

I hope I've explained to you the concept of a FPU in a way that doesn't confuse you as much as it did me the first time I was introduced to it.