Skuzzy, about the SSE2 optimizations in the Catalyst drivers: In the CPU benchmarks I've seen (previous to the Athlon 64) where benchmarks with both ATI and nVidia cards are present the Athlon XP typically did better with ATI cards than nVidia cards. I'm sure there's some significant SSE2 optimizations there, because the Athlon 64 reviews using ATI 9800 Pros (Aces Hardware, Anandtech) showed a much bigger gain versus Athlon XP than than those reviews using 5900 Ultras (Tom's Hardware), but it looks like ATIs non-SSE2 Athlon XP codepath was not bad at all either.
Realistically though, I'm wondering just how heavily a video driver can be SSE2 optimized. (I wish I knew exactly what operations the video drivers are doing, but I'd assume it's primarily memory read and write operations.) In the cases where SSE2 isn't used, the Athlon can dispatch 3 load/store ops plus 3 integer/fp ops (6 total) per clock versus the P4s 2 load/store and 2 int/fp (4 total). If the P4 wasn't using SSE 2 instructions it would seem to be at a significant disadvantage to the Athlon on a per clock basis.
Regardless, adding SSE2 support (and actually adding 8 more SSE2 registers when in 64 bit mode) to the Athlon 64 was definately the right thing for AMD to do. They were considering their own new SIMD instruction set for the Athlon 64. SSE2 support gives the Athlon 64 an immediate performance boost in heavily P4 optimized applications. I think from a gaming standpoint the on-die memory controller, which greatly reduces memory latencies, is probably just as important though.