"Intel has chosen a markedly different direction than Alpha. Intel is
introducing a new 64-bit instruction set architecture called IA64. They have called the architecture EPIC, for Explicitly Parallel Instruction Computing, but it is essentially a VLIW (Very Long Instruction Word) architecture. The IA64 architecture is very similar to the Cydrome machine, a failed minisupercomputer company of the 1980s. The first implementation of IA64 is called Merced, with a follow-on implementation called McKinley. With the IA64, Intel is focusing on a compiler-driven technology to increase instruction-level
parallelism, and is ignoring other proven ways to improve performance on large
applications. IA64 is developed for an in-order execution model, with a set of new architectural extensions to permit compilers to identify more
instruction-level parallelism. These architectural extensions will make it very difficult for IA64 processors to implement out-of-order execution or simultaneous multithreading efficiently. For most applications, the small benefit that these architectural extensions give compilers do not equal the performance lost by not using these dynamic techniques."
"The IA64 design is a derivative of the VLIW machines designed by
Multiflow and Cydrome in the 1980s. The key idea is a generalization of horizontal microcode: in a wide instruction word the processor presents control of all of the functional units to the compiler, and the compiler precisely schedules where every operation, every register file read, every bypass, will occur. In effect, the compiler creates a record of execution for the program, and the machine plays that record. In the early VLIWs, if the compiler made
a mistake, the machine generated the wrong results; the machine had no logic to check that registers were read in the correct order or if resources were oversubscribed. In more modern machines such as the IA64 processors, the machine will run slowly (but correctly) when the compiler is wrong.
The IA64 design requires the compiler to predict at compile-time how a
program will behave. Traditionally, VLIW-style machines have been built without caches and focused on loop-intensive, vectorizable code. These restrictions mean the memory latency is fixed and branch behavior is very predictable at compile-time. However, IA64 will be implemented as a general-purpose processor, with a data cache, running a wide variety of applications. In most applications, the latency of a memory operation is very difficult to
predict; a cache-miss may have a latency that is 100 times longer than a
cache hit. Alpha's out-of-order design can dynamically adjust to the cache pattern of the program; on an IA64 processor, when the compiler makes a mistake, the machine will stall. Similarly, the IA64 design requires the compiler to move code across branches to find parallelism. However, this decision requires the compiler to predict branch direction at compile-time. This is very difficult to do, and even with elaborate profile-feedback systems, where a program is run to gather information about its behavior before it is compiled, compile-time branch prediction rates are at best 85%. Without feedback, the
compile-time rates are much closer to 50%. In contrast, hardware branch predictors are 95-98% accurate. An IA64 design will be executing unprofitable
speculative instructions 3-10x more frequently than an Alpha design.
The IA64 is an architectural idea that was developed for vectorizable programs. Intel has tried to extend it to commercial applications, but it is fundamentally the wrong design for these problems."
"An explicit goal in the development of the Alpha architecture was to enable innovative performance improvements in compilers, architecture, and circuit implementation. We did not add features to the instruction set architecture that make compiler improvements easy but hardware improvements difficult. In the early 1990s, we designed a VLIW version of Alpha similar to IA64 [1,2,3,4,5,6]. During this process we discovered that most of the compiler technology for a VLIW processor could equally well be applied to a RISC processor, and that by avoiding IA64-style extensions to Alpha, we could also
implement an out-of-order processor.
Alpha is designed to exploit both compile-time and run-time information. We agree with the IA64 designers that the compiler should create a record of execution for a program. However, we also recognized that the processor will know at run-time additional information about a program's behavior, for example, whether a memory reference is a cache miss and what direction a branch executes. Rather than stall the processor when the compiler is wrong, we designed an out-of-order issue mechanism that allowed the machine to adapt to the run-time behavior of the program. In addition, a compiler has a restricted view of the program and often cannot optimize across routine or module boundaries. At run-time, an out-of-order processor can find parallelism across these boundaries. Compiler technology must be combined with out-of-order execution to extract the most instruction-level parallelism from a program."