Great job JohnnyB!

That definately gives a good base to build on.
I think at this point it might be a good idea to describe in a bit more detail some key concepts to understanding how SIMD instructions (MMX, SSE, SSE2, 3dnow, SSE3) are useful.
Let's go down to the absolute basics here and definite what's meant by bit, byte, word, float, integer, double, etc. If you know what these mean it will be much easier to understand how SIMD instructions can greatly boost performance. These concepts are actually quite simple once you understand how to think in binary. To get there, let's send everyone back to basic math class.
First, the "bit": A bit is roughly analogous to a digit in our normal decimal way of thinking. For example the number "410" has three digits, the ones, tens, and hundreds places. If you think about this what we are really saying is that 410 is made up of 4x100 + 1x10 + 0x1 = 410. (4 times 100 plus 1 times 10 plus 0 times 1 equals 410.) The same holds true for binary numbers (and all other numbering schemes like octal and hexidecimal, of which the later is commonly used in software). If I want to represent a number in binary, which only has 0 and 1 (versus 0,1,2,3,4,5,6,7,8, and 9 for decimal) we simply do exactly what we do for decimal. There's only two possible choices for each place though, 0 times some place and 1 times some place. Think about this: In decimal each place (ones, tens, hundreds, thousands) is a power of 10. The ones place is 10^0 (or 10 to the 0 power, which equals 1.) The tens place is 10^1 (or 10 to the 1st power, which equals 10.) The hundreds place is 10^2 (equals 100) and so on. Thus again when we write 410 we are saying 4x10^2 + 1x10^1 + 0x10^0 which equals 410. (4 times 10 to the second power = 400. 1 times 10 to the first power = 10. 0 times 10 to the 0th power = 0.) Now lets talk about binary. Binary does not have places that are powers of 10, but rather places that are powers of 2. Thus for binary the first place is still the ones digit (2^0), but the next is the 2s place (2^1), then 4s (2^2) place, then 8 (2^3), 16, 32, 64, 128, 256, 512, etc. This means since you only have a 0 or 1 as a multiplier for each of these places it's really easy to count in binary. Lets start simple, in binary if I want to represent the number 10 I would write that as 1010. (8 times 1 plus 4 times 0 plus 2 times 1 plus 1 times 0 = 10 (decimal)) For reference, here's 0 to 15 in binary: 0000 = 0, 0001 = 1, 0010 = 2, 0011 = 3, 0100 = 4, 0101 = 5, 0110 = 6, 0111 = 7, 1000 = 8, 1001 = 9, 1010 = 10, 1011 = 11, 1100 = 12, 1101 = 13, 1110 = 14, 1111 = 15. Once you catch on to this you understand just how basic it really is. Now (finally) back to the bit. A "bit" is simply any individual binary digit. Thus to represent numbers from 0 to 15, you need to use 4 bits to do that. (Look above, 15 = 1111, which is the highest number I can represent using only 4 bits. If I wanted to write 16, that would be 10000, which needs 5 bits.)
Ok, I think we now have a fairly good understanding of a bit, now onto the byte. A byte is simply a number using 8 bits. This means that a single byte of information can have up to 256 possible values ranging from 00000000 = 0 to 11111111 = 255. The byte is a very commonly used term in computers and tech literature. (A Megabyte is technically one million bytes. A Gigabyte is one billion bytes.) Remember just above I said that with a single byte you can represent up to 256 different numbers (integers 0 to 255). Given that piece of information, is it hard to understand that an 8 bit CPU can work only with (positive) numbers from 0 to 255 in each operation it performs? For example, if an 8 bit CPU wants to add two numbers, the maximum value for each number it wants to add can be no greater than 255. Now obviously you CAN work with numbers bigger than 255, but you have to use more than one operation to do so. The first byte used can represent the lower digits and the second byte can be the higher digits. Unfortunately to explain this further will drag the discussion way off topic, so lets more on shall we. The important concept to gather is that an 8 bit CPU must perform more operations to work with numbers greater than 255. This puts it at a disadvantage when doing so against CPUs that can work with 16 bits (or more) at a time.
Now, how about 16 bits. With 16 bits (2 bytes) we can describe up to 65536 possible numbers (0 to 65535). 16 bits is actually a rather special number of bits for x86 CPUs, because originally the x86 instruction set was only 16 bit. (This would be CPUs including the 8086, 80186, and 80286.) The x86 instruction set was expanded to 32 bit with the 80386 (386), so x86 CPUs from the 386 onwards could processes either 16 bit or 32 bit instructions and data. 16 bits is also relavant for another reason, and that is the concept of the word. The basic definition of the binary word is the number of bits the CPU natively works with. In the x86 world this is not simple anymore, as x86 CPUs started out as 16 bit and were expanded to 32 bit (and now 64 bit with the Athlon 64). In the x86 world, the "word" is 2 bytes or 16 bits long. In other architectures this is not the case. For the PowerPC (Macs) the word is 32 bits or 4 bytes. Why is the "word" important? Simply put, the word is the (smallest) number of bits the CPU works with in each operation.
Ok, lets go farther. In software programming languages (like C or Java) you must definate how many bits are contained in a variable. To do this, software "declares" a variable, which essentially tells the computer how much memory to allocate for that particular variable. Thus if you declare a variable for an x86 CPU, the smallest amount of memory that variable is going to use up is 16 bits, or one word. Now we can define what a "double" is. A "double" is a variable made up of 2 words, or in the x86 world 32 bits. With a 32 bit variable you can describe just over 4 billion possible values.
Now in JohnnyB's excellent posts above he mentions a "float." What is a "float?" A float is short for a floating point number. A floating point number would be simplest to explain to those of us familiar with decimal numbers as those with a decimal point. Thus a floating point number would be those like 12.286, versus integer numbers like 410. Representing a floating point number in binary is very interesting. Picture this, I can also represent 12.286 as 12286 with the decimal point shifted 3 places to the left. Thus a floating point number can be represented as the digits that make up the number and the number of places to move the decimal point. This is how a computer treats floating point numbers. (The technical term for these two pieces of information used to describe a floating point number are the mantissa (or the 12286 in my example) and the exponent (or the number of places to move the decimal point). Now obviously to do any serious math and realistically any kind of division you need to have the capability of working with floating point numbers. Every x86 CPU since the 486DX has had a special unit within the processor dedicated to working with these numbers. That unit is the Floating Point Unit (FPU). The instruction set primarily used to work with floating point numbers is the x87 instruction set. (Why is it called x87? That's because some computers, like 286 based PCs, had the option of using a seperate CPU of sorts called the 80287 chip, which handled floating point instructions. The instructions this chip used were naturally called x87 instructions. After the FPU was integrated with the rest of the CPU with the 486DX, the x87 name for the instructions was still used.) The FPU can perform a LOT of different instructions. It can add, subtract, multiply, divide with remainder, divide, perform trig instructions like cosine, sine, and tangent and arctangent, raise number to powers of 2, powers of e (if you don't understand what e is, don't worry), perform log(arithm) instructions, compare numbers, along with read and write to memory. (And much much more.) As you might imagine the floating point unit is MUCH more complicated (and takes a longer to do some instructions) than the integer unit. (The integer unit is known as the ALU or arithmatic logic unit. The ALU performs most of the same tasks as the floating point unit, but only on integer numbers, or those without a decimal point in them.) Games make heavy use of the floating point unit to perform just about all mathematical operations. The integer unit simply (ALU) lacks the precision (since it can only work with integers) for many calculations that games need to perform. Thus for gaming in particular the performance of the FPU (at least for the CPU's part) is key to gaming performance.
Why have I spent all this time describing all this? Quite simply the SIMD instructions are designed to allow an x86 CPU to perform instructions on more than 32 bit numbers at the same time. The differences between them lie in the types and number of bits these instructions can deal with.