Chipsets Background

The functions needed to make a microprocessor into a computer reads like a laundry list for a three-ring circus—long and exotic, with a trace of the unexpected. A fully operational computer requires a clock or oscillator to generate the signals that lock the circuits together; a memory mastermind that ensures each byte goes to the proper place and stays there; a traffic cop for the expansion bus and other interconnecting circuitry to control the flow of data in and out of the chip and the rest of the system; office-assistant circuits for taking care of the daily routines and lifting the responsibility for minor chores from the microprocessor; and, in most modern computers, communications, video, and audio circuitry that let you and the computer deal with one another like rational beings.

Every computer manufactured since the first amphibians crawled from the primeval swamp has required these functions, although in the earliest computers they did not take the form of a chipset. The micro-miniaturization technology was nascent, and the need for combining so many functions in so small a package was negligible. Instead, computer engineers built the required functions from a variety of discrete circuits—small, general-purpose integrated circuits such as logic gates—and a few functional blocks, which are larger integrated circuits designed to a particular purpose but not necessarily one that had anything to do with computers. These garden-variety circuits, together termed support chips, were combined to build all the necessary computer functions into the first computer. The modern chipset not only combines these support functions onto a single silicon slice but also adds features beyond the dreams of the first computer’s designers—things such as USB ports, surround-sound systems, and power-management circuits. This all-in-one design makes for simpler and far cheaper computers.

Trivial as it seems, encapsulating a piece of silicon in an epoxy plastic package is one of the more expensive parts of making chips. For chipmakers, it’s much more expensive than adding more functions to the silicon itself. The more functions the chipmaker puts on a single chip, the less each function costs. Those savings dribble down to computer-makers and, eventually, you. Moreover, designing a computer motherboard with discrete support circuitry was a true engineering challenge because it required a deep understanding of the electronic function of all the elements of a computer. Using a chipset, a computer engineer need only be concerned with the signals going in and out of a few components. The chipset might be a magical black box for all the designer cares. In fact, in many cases the only skill required to design a computer from a chipset is the ability to navigate from a roadmap. Most chipset manufacturers provide circuit designs for motherboards to aid in the evaluation of their products. Many motherboard manufacturers (all too many, perhaps) simply take the chipset-maker’s evaluation design and turn it into a commercial product.

You can trace the chipset genealogy all the way back to the first computer. This heritage results from the simple need for compatibility. Today’s computers mimic the function of the earliest computers—and IBM’s original Personal Computer of 1981 in particular—so that they can all run the same software. Although seemingly anachronistic in an age when those who can remember the first computer also harbor memories of Desotos and dinosaurs, even the most current chipsets must precisely mimic the actions of early computers so that the oldest software will still operate properly in new computers—provided, of course, all the other required support is also present. After all, some Neanderthal will set switchboards glowing from I-95 to the Silicon Valley with threats of lawsuits and aspersions about the parenthood of chipset designers when the DOS utilities he downloaded in 1982 won’t run on his new Pentium 4.

Microprocessors History

No matter the designation or origin, all microprocessors in today’s Windows-based computers share a unique characteristic and heritage. All are direct descendents of the very first microprocessor. The instruction set used by all current computer microprocessors is rooted in the instructions selected for that first-ever chip. Even the fastest of today’s Pentium 4 chips has, hidden in its millions of transistors, the capability of acting exactly like that first chip.

In a way, that’s good because this backward-looking design assures us that each new generation of microprocessor remains compatible with its predecessors. When a new chip arrives, manufacturers can plug it into a computer and give you reasonable expectations that all your old software will still work. But holding to the historical standard also heaps extra baggage on chip designs that holds back performance. By switching to a radically new design, engineers could create a faster, simpler microprocessor—one that could run circles around any of today’s chips, but, alas, one that can’t use any of your current programs or operating systems.

Initial Development

The history of the microprocessor stretches back to a 1969 request to Intel by a now-defunct Japanese calculator company, Busicom. The original plan was to build a series of calculators, each one different and each requiring a custom integrated circuit. Using conventional IC technology, the project would have required the design of 12 different chips. The small volumes of each design would have made development costs prohibitive.

Intel engineer Mercian E. (Ted) Hoff had a better idea, one that could slash the necessary design work. Instead of a collection of individually tailored circuits, he envisioned creating one general-purpose device that would satisfy the needs of all the calculators. Hoff laid out an integrated circuit with 2,300 transistors using 10-micron design rules with four-bit registers and a four-bit data bus. Using a 12-bit multiplexed addressing system, it was able to address 640 bytes of memory for storing subproducts and results.

Most amazing of all, once fabricated, the chip worked. It became the first general-purpose microprocessor, which Intel put on sale as the 4004 on November 15, 1971.

The chip was a success. Not only did it usher in the age of low-cost calculators, it also gave designers a single solid-state programmable device for the first time. Instead of designing the digital decision-making circuits in products from scratch, developers could buy an off-the-shelf component and tailor it to their needs simply by writing the appropriate program.

With the microprocessor’s ability to handle numbers proven, the logical next step was to enable chips to deal with a broader range of data, including text characters. The 4004’s narrow four-bit design was sufficient for encoding only numbers and basic operations—a total of 16 symbols. The registers would need to be wider to accommodate a wider repertory. Rather than simply bump up the registers a couple of bits, Intel’s engineers chose to go double and design a full eight-bit microprocessor with eight-bit registers and an eight-bit data bus. In addition, this endowed the chip with the ability to address a full 16KB of memory using 14 multiplexed address lines. The result, which required a total of 3450 transistors, was the Intel 8008, introduced in April 1972.

Intel continued development (as did other integrated circuit manufacturers) and, in April 1974, created a rather more drastic revision, the 8080, which required nearly twice as many transistors (6,000) as the earlier chip. Unlike the 8008, the new 8080 chip was planned from the start for byte-size data. Intel gave the 8080 a 16-bit address bus that could handle a full 64KB of memory and a richer command set, one that embraced all the commands of the 8008 but went further. This set a pattern for Intel microprocessors: Every increase in power and range of command set enlarged on what had gone before rather than replacing it, thus ensuring backward compatibility (at least to some degree) of the software. To this day, the Intel-architecture chips used in personal computers can run program code written using 8080 instructions. From the 8080 on, the story of the microprocessor is simply one of improvements in fabrication technology and increasingly complex designs.

With each new generation of microprocessor, manufacturers relied on improving technology in circuit design and fabrication to increase the number and size of the registers in each microprocessor, broadening the data and address buses to match. When that strategy stalled, they moved to superscalar designs with multiple pipelines. Improvements in semiconductor fabrication technology made the increasing complexity of modern microprocessor designs both practical and affordable. In the three decades since the introduction of the first microprocessor, the linear dimensions of semiconductor circuits have decreased to 1/50th their original size, from 10-micron design rules to 0.13 micron, which means microprocessor-makers can squeeze 6,000 transistors where only one fit originally. This size reduction also facilitates higher speeds. Today’s microprocessors run nearly 25,000 times faster than the first chip out of the Intel foundry, 2.5GHz in comparison to the 108KHz of the first 4004 chip.

Personal Computer Influence

The success of the personal computer marked a major turning point in microprocessor design. Before the PC, microprocessor engineers designed what they regarded as the best possible chips. Afterward, they focused their efforts on making chips for PCs. This change came between what is now regarded as the first generation of Intel microprocessors and the third generation, in the years 1981 to 1987.

8086 Family

The engineers who designed the IBM Personal Computer chose to use a chip from the Intel 8086 family. Intel introduced the 8086 chip in 1978 as an improvement over its first chips. Intel’s engineers doubled the size of the registers in its 8080 to create a chip with 16-bit registers and about 10 times the performance. The 16-bit design carried through completely, also doubling the size of the data bus of earlier chips to 16 bits to move information in and out twice as fast.

In addition, Intel broadened the address bus from 16-bit to 20-bits to allow the 8086 to directly address up to one megabyte of RAM. Intel divided this memory into 64KB segments to make programming and the transition to the new chip easier. A single 16-bit register could address any byte in a given segment. Another, separate register indicated which of the segments that address was in.

A year after the introduction of the 8086, Intel introduced the 8088. The new chip was identical to the 8086 in every way—16-bit registers, 20 address lines, and the same command set—except one. Its data bus was reduced to eight bits, enabling the 8088 to exploit readily available eight-bit support hardware. At that, the 8088 broke no new ground and should have been little more than a footnote in the history of the microprocessor. However, its compromise design that mated 16-bit power with cheap 8-bit support chips made the 8088 IBM’s choice for its first personal computer. With that, the 8088 entered history as the second most important product in the development of the microprocessor, after the ground-breaking 4004.

286 Family

After the release of the 8086, Intel’s engineers began to work on a successor chip with even more power. Designated the 80286, the new chip was to feature several times the speed and 16 times more addressable memory than its predecessors. Inherent in its design was the capability of multitasking, with new instructions for managing tasks and a new operating mode, called protected mode, that made its full 16MB of memory fodder for advanced operating systems.

The 80286 chip itself was introduced in 1982, but its first major (and most important) application didn’t come until 1984 with the introduction of IBM’s Personal Computer AT. Unfortunately, this development work began before the PC arrived, and few of the new features were compatible with the personal computer design. The DOS operating system for PCs and all the software that ran under it could not take advantage of the chip’s new protected mode—which effectively put most of the new chip’s memory off limits to PC programs.

With all its innovations ignored by PCs, the only thing the 80286 had going for it was its higher clock speed, which yielded better computer performance. Initially released running at 6MHz, computers powered by the 80286 quickly climbed to 8MHz, and then 10MHz. Versions operating at 12.5MHz, 16MHz, 20MHz, and ultimately 24MHz were eventually marketed.

The 80286 proved to be an important chip for Intel, although not because of any enduring success. It taught the company’s engineers two lessons. First was the new importance of the personal computer to Intel’s microprocessor market. Second was licensing. Although the 80286 was designed by Intel, the company licensed the design to several manufacturers, including AMD, Harris Semiconductor, IBM, and Siemens. Intel granted these licenses not only for income but also to assure the chip buyers that they had alternate sources of supply for the 80286, just in case Intel went out of business. At the time, Intel was a relatively new company, one of many struggling chipmakers. With the success of the PC and its future ensured, however, Intel would never again license its designs so freely.

Even before the 80286 made it to the marketplace, Intel’s engineers were working on its successor, a chip designed with the power of hindsight. By then they could see the importance that the personal computer’s primeval DOS operating system had on the microprocessor market, so they designed to match DOS instead of some vaguely conceived successor. They also added in enough power to make the chip a fearsome competitor.

386 Family

The next chip, the third generation of Intel design, was the 80386. Two features distinguish it from the 80286: a full 32-bit design, for both data and addressing, and the new Virtual 8086 mode. The first gave the third generation unprecedented power. The second made that power useful.

Moreover, Intel learned to tailor the basic microprocessor design to specific niches in the marketplace. In addition to the mainstream microprocessor, the company saw the need to introduce an “entry level” chip, which would enable computer makers to sell lower-cost systems, and a version designed particularly for the needs of battery-powered portable computers. Intel renamed the mainstream 80386 as the 386DX, designated an entry-level chip the 386SX (introduced in 1988), and reengineered the same logic core for low-power applications as the 386SL (introduced in 1990).

The only difference between the 386DX and 386SX was that the latter had a 16-bit external data bus whereas the former had a 32-bit external bus. Internally, however, both chips had full 32-bit registers. The origin of the D/S nomenclature is easily explained. The external bus of the 386DX handled double words (32 bits), and that of the 386SX, single words (16 bits).

Intel knew it had a winner and severely restricted its licensing of the 386 design. IBM (Intel’s biggest customer at the time) got a license only by promising not to sell chips. It could only market the 386-based microprocessors it built inside complete computers or on fully assembled motherboards. AMD won its license to duplicate the 386 in court based on technology-sharing agreements with Intel dating before even the 80286 had been announced. Another company, Chip and Technologies, reverse-engineered the 386 to build clones, but these were introduced too late—well after Intel advanced to its fourth generation of chips—to see much market success.

Age of Refinement

The 386 established Intel Architecture in essentially its final form. Later chips differ only in details. They have no new modes. Although Intel has added new instructions to the basic 386 command set, almost any commercial software written today will run on any Intel processor all the way back to the 386—but not likely any earlier processor, if the software is Windows based. The 386 design had proven itself and had become the foundation for a multibillion-dollar software industry. The one area for improvement was performance. Today’s programs may run on a 386-based machine, but they are likely to run very slowly. Current chips are about 100 times faster than any 386.

486 Family

The next major processor after the 386 was, as you might expect, the 486. Even Intel conceded its new chip was basically an improved 386. The most significant difference was that Intel added three features that could boost processing speed by working around handicaps in circuitry external to the microprocessor. These innovations included an integral Level One cache that helped compensate for slow memory systems, pipelining within the microprocessor to get more processing power from low clock speeds, and an integral floating-point unit that eliminated the handicap of an external connection. As this generation matured, Intel added one further refinement that let the microprocessor race ahead of laggardly support circuits—splitting the chip so that its core logic and external bus interface could operate at different speeds.

Intel introduced the first of this new generation in 1989 in the form of a chip then designated 80486, continuing with its traditional nomenclature. When the company added other models derived from this basic design, it renamed the then-flagship chip as the 486DX and distinguished lower-priced models by substituting the SX suffix and low-power designs for portable computers using the SL designation, as it had with the third generation. Other manufacturers followed suit, using the 486 designation for their similar products—and often the D/S indicators for top-of-the-line and economy models.

In the 486 family, however, the D/S split does not distinguish the width of the data bus. The designations had become disconnected from their origins. In the 486 family, Intel economized on the SX version by eliminating the integral floating-point unit. The savings from this strategy was substantial—without the floating-point circuitry, the 486SX required only about half the silicon of the full-fledged chip, making it cheaper to make. In the first runs of the 486SX, however, the difference was more marketing. The SX chips were identical to the DX chips except that their floating-point circuitry was either defective or deliberately disabled to make a less capable processor.

As far as hardware basics are concerned, the 486 series retained the principal features of the earlier generation of processors. Chips in both the third and fourth generations have three operating modes (real, protected, and virtual 8086), full 32-bit registers, and a 32-bit address bus enabling up to 4GB of memory to be directly addressed. Both support virtual memory that extends their addressing to 64TB. Both have built-in memory-management units that can remap memory in 4KB pages.

But the hardware of the 486 also differs substantially from the 386 (or any previous Intel microprocessor). The pipelining in the core logic allows the chip to work on parts of several instructions at the same time. At times the 486 could carry out one instruction every clock cycle. Tighter silicon design rules (smaller details etched into the actual silicon that makes up the chip) gave the 486 more speed potential than preceding chips. The small but robust 8KB integral primary cache helped the 486 work around the memory wait states that plagued faster 386-based computers.

The streamlined hardware design (particularly pipelining) meant that the 486-level microprocessors could think faster than 386 chips when the two operated at the same clock speed. On most applications, the 486 proved about twice as fast as a 386 at the same clock rate, so a 20MHz 486 delivered about the same program throughput as a 40MHz 386.

Pentium Family

In March 1993, Intel introduced its first superscalar microprocessor, the first chip to bear the designation Pentium. At the time the computer industry expected Intel to continue its naming tradition and label the new chip the 80586. In fact, the competition was banking on it. Many had already decided to use that numerical designation for their next generation of products. Intel, however, wanted to distinguish its new chip from any potential clones and establish its own recognizable brand on the marketplace. Getting trademark protection for the 586 designation was unlikely. A federal court had earlier ruled that the 386 numeric designation was generic—that is, it described a type of product rather than something exclusive to a particular manufacturer—so trademark status was not available for it. Intel coined the word Pentium because it could get trademark protection. It also implied the number 5, signifying fifth generation, much as “586″ would have.

Intel has used the Pentium name quite broadly as the designation for mainstream (or desktop performance) microprocessors, but even in its initial usage the singular Pentium designation obscured changes in silicon circuitry. Two very different chips wear the plain designation “Pentium.” The original Pentium began its life under the code name P5 and was the designated successor to the 486DX. Characterized by 5-volt operation, low operating speeds, and high power consumption, the Intel made the P5 available only at three speeds: 60MHz, 66MHz, and 90MHz. Later, Intel refined the initial Pentium design as the P54C (another internal code name), with tighter design rules and lower voltage operation. These innovations raised the speed potential of the design, and commercial chips gradually stepped up from 100MHz to 200MHz. The same basic design underlies the Pentium OverDrive (or P24T) processor used for upgrading 486-based PCs.

In January 1997, Intel enhanced the Pentium instruction set to better handle multimedia applications and created Pentium Processor with MMX Technology (code-named P55C during development). These chips also incorporated a larger on-chip primary memory cache, 32KB.

To put the latest in Pentium power in the field. Intel reengineered the Pentium with MMX Technology chip for low-power operation to make the Mobile Pentium with MMX Technology chip, also released in January 1997. Unlike the deskbound version, the addressing capability of the mobile chip was enhanced by four more lines to allow direct access to 64GB of physical memory.

Mature Design

The Pentium was Intel’s last CISC design. Other manufacturers were adapting RISC designs to handle the Intel instruction set and achieving results that put Intel on notice. The company responded with its own RISC-based design in 1995 that became the standard Intel core logic until the introduction of the Pentium 4 in the year 2000. Intel developed this logic core under the code name P6, and it has appeared in a wide variety of chips, including those bearing the names Pentium Pro, Pentium II, Celeron, Xeon, and Pentium III.

That’s not to say all these chips are the same. Although the entire series uses essentially the same execution units, the floating-point unit continued to evolve throughout the series. The Pentium Pro incorporates a traditional floating-point unit. That of the Pentium II is enhanced to handle the MMX instruction set. The Pentium III adds Streaming SIMD Extensions. In addition, Intel altered the memory cache and bus of these chips to match the requirements of particular market segments to distinguish the Celeron and Xeon lines from the plain Pentium series.

The basic P6 design uses its own internal circuits to translate classic Intel instructions into micro-ops that can be processed in a RISC-based core, which has been tuned using all the RISC design tricks to massage extra processing speed from the code. Intel called this design Dynamic Execution. In the standard language of RISC processors, Dynamic Execution merely indicates a combination of out-of-order instruction execution and the underlying technologies that enable its operation (branch prediction, register renaming, and so on).

The P6 pipeline has 12 stages, divided into three sections: an in-order fetch/decode stage, an out-of-order execution/dispatch stage, and an in-order retirement stage. The design is superscalar, incorporating two integer units and one floating-point unit.

Pentium Pro

One look and there’s no mistaking the Pentium Pro. Instead of a neat square chip, it’s a rectangular giant. Intel gives this package the name Multi-Chip Module (MCM). It is also termed a dual-cavity PGA (pin-grid array) package because it holds two distinct slices of silicon, the microprocessor core and secondary cache memory. This was Intel’s first chip with an integral secondary cache. Notably, this design results in more pins than any previous Intel microprocessor and a new socket requirement, Socket 8 (discussed earlier).

The main processor chip of the Pentium Pro uses the equivalent of 5.5 million transistors. About 4.5 million of them are devoted to the actual processor itself. The other million provide the circuitry of the chip’s primary cache, which provides a total of 16KB storage bifurcated into separate 8KB sections for program instructions and data. Compared to true RISC processors, the Pentium Pro uses about twice as many transistors. The circuitry that translates instructions into RISC-compatible micro-ops requires the additional transistor logic.

The integral secondary RAM cache fits onto a separate slice of silicon in the other cavity of the MCM. Its circuitry involves another 15.5 million transistors for 256KB of storage and operates at the same speed as the core logic of the rest of the Pentium Pro.

The secondary cache connects with the microprocessor core logic through a dedicated 64-bit bus, termed a back-side bus, that is separate and distinct from the 64-bit front-side bus that connects to main memory. The back-side bus operates at the full internal speed of the microprocessor, whereas the front-side bus operates at a fraction of the internal speed of the microprocessor.

The Pentium Pro bus design superficially appears identical to that of the Pentium with 32-bit addressing, a 64-bit data path, and a maximum clock rate of 66MHz. Below the surface, however, Intel enhanced the design by shifting to a split-transaction protocol. Whereas the Pentium (and, indeed, all previous Intel processors) handled memory accessing as a two-step process (on one clock cycle the chip sends an address out the bus, and reads the data at the next clock cycle), the Pentium Pro can put an address on the bus at the same time it reads data from a previously posted address. Because the address and data buses use separate lines, these two operations can occur simultaneously. In effect, the throughput of the bus can nearly double without an increase in its clock speed.

The internal bus interface logic of the Pentium Pro is designed for multiprocessor systems. Up to four Pentium Pro chips can be directly connected together, pin for pin, without any additional support circuitry. The computer’s chipset arbitrates the combination.

Pentium II

The chief distinctions of the Pentium II are its addition of MMX technology and new package called the Single Edge Contact cartridge, or SEC cartridge. The socket it plugs into is termed Slot 1.

One underlying reason for the cartridge-style design is to accommodate the Pentium II’s larger secondary cache, which is not integral to the chip package but rather co-mounted on the circuit board inside the cartridge. The 512KB of static cache memory connect through a 64-bit back-side bus. Note that the secondary cache memory of a Pentium II operates at one-half the speed of the core logic of the chip itself. This reduced speed is, of course, a handicap. It was a design expediency. It lowers the cost of the technology, allowing Intel to use off-the-shelf cache memory (from another manufacturer, at least initially) in a lower-cost package. The Pentium II secondary cache design has another limitation. Although the Pentium II can address up to 64GB of memory, its cache can track only 512MB. The Pentium II also has a 32KB primary cache that’s split with 16KB assigned to data and 16KB to instructions. Table 5.3 summarizes the Intel Pentium II line.

computer

Mobile Pentium II

To bring the power of the Pentium II processor to notebook computers, Intel reengineered the desktop chip to reduce its power consumption and altered its packaging to fit slim systems. The resulting chip—the Mobile Pentium II, introduced on April 2, 1997—preserved the full power of the Pentium II while sacrificing only its multiprocessor support. The power savings comes from two changes. The core logic of the Mobile Pentium II is specifically designed for low-voltage operation and has been engineered to work well with higher external voltages. It also incorporates an enriched set of power-management modes, including a new QuickStart mode that essentially shuts down the chip, except for the logic that monitors for bus activity by the PCI bridge chip, and allows the chip to wake up when it’s needed. This design, because it does not monitor for other processor activity, prevents the Mobile Pentium II from being used in multiprocessor applications. The Mobile Pentium II can also switch off its cache clock during its sleep or QuickStart states.

Initially, the Mobile Pentium II shared the same P6 core logic design and cache design with the desktop Pentium II (full-speed 32KB primary cache and half-speed 512KB secondary cache inside its mini-cartridge package). However, as fabrication technology improved, Intel was able to integrate the secondary cache on the same die as the processor core, and on January 25, 1999, the company introduced a new version of the Mobile Pentium II with an integral 256KB cache operating at full core speed. Unlike the Pentium II, the mobile chip has the ratio between its core and bus clocks fixed at the factory to operate with a 66MHz front-side bus. Table 5.4 lists the introduction dates and basic characteristics of the Mobile Pentium II models.

computer

Pentium II Celeron

Introduced on March 4, 1998, the Pentium II Celeron was Intel’s entry-level processor derived from the Pentium II. Although it had the same processor core as what was at the time Intel’s premier chip (the second-generation Pentium II with 0.45-micron design rules), Intel trimmed the cost of building the chip by eliminating the integral 512KB secondary (Level Two) memory cache installed in the Pentium II cartridge. The company also opted to lower the packaging cost of the chip by omitting the metal outer shell of the full Pentium II and instead leaving the Celeron’s circuit board substrate bare. In addition, the cartridge-based Celeron package lacked the thermal plate of the Pentium II and the latches that secure it to the slot. Intel terms the Celeron a Single Edge Processor Package to distinguish it from the Single Edge Contact cartridge used by the Pentium II.

In 1999, Intel introduced a new, lower-cost package for the Celeron, a plastic pin-grid array (PPGA) shell that looks like a first generation Pentium on steroids. It has 370 pins and mates with Intel’s PGA370 socket. The chip itself measures just under two inches square (nominally 49.5 millimeters) and about three millimeters thick, not counting the pins, which hang down another three millimeters or so (the actual specification is 3.05 to 3.30 millimeters).

When the Celeron chip was initially introduced, the absence of a cache made such a hit on the performance that Intel was forced by market pressure to revise its design. In August, 1998, the company added a 128KB cache operating at one-half core speed to the Celeron. Code names distinguished the two chips: The first Celeron was code-named Covington during development; the revised chip was code-named Mendocino. Intel further increased the cache to 256KB on October 2, 2001, with the introduction of a 1.2GHz Celeron variant.

Intel also distinguished the Celeron from its more expensive processor lines by limiting its front-side bus speed to 66MHz. All Celerons sold before January 3, 2001 were limited to that speed. With the introduction of the 800MHz Celeron, Intel kicked the chip’s front-side bus up to 100MHz. With the introduction of a 1.7GHz Celeron on May 2, 2002, Intel started quad-clocking the chip’s front-side bus, yielding an effective data rate of 400MHz.

Intel also limited the memory addressing of the Celeron to 4GB of physical RAM by omitting the four highest address bus signals used by the Pentiums II and III from the Celeron pin-out. The Celeron does not support multiprocessor operation, and, until Intel introduced the Streaming SIMD Extensions to the 1.2GHz version, the Celeron understood only the MMX extension to the Intel instruction set.

Table 5.5 lists the features and introduction dates of various Celeron models.

computer

Pentium II Xeon

In 1998, Intel sought to distinguish its higher performance microprocessors from its economy line. In the process, the company created the Xeon, a refined Pentium II microprocessor core enhanced by a higher-speed memory cache, one that operated at the same clock rate as the core logic of the chip.

At heart, the Xeon is a full 32-bit microprocessor with a 64-bit data bus, as with all Pentium-series processors. Its address bus provides for direct access to up to 64GB of RAM. The internal logic of the chip allows for up to four Xeons to be linked together without external circuitry to form powerful multiprocessor systems.

A sixth generation processor, the Xeon is a Pentium Pro derivative by way of the standard Pentium II. It incorporates two 12-stage pipelines to make what Intel terms Dynamic Execution micro-architecture.

The Xeon incorporates two levels of caching. One is integral to the logic core itself, a primary 32KB cache split 16KB for instructions, 16KB for data. In addition, a separate secondary cache is part of the Xeon processor module but is mounted separately from the core logic on the cartridge substrate. This integral-but-separate design allows flexibility in configuring the Xeon. Current chips are available equipped with either 512KB or 1MB of L2 cache, and the architecture and slot design allow for secondary caches of up to 2MB. This integral cache runs at the full core speed of the microprocessor.

This design required a new interface, tagged Slot 2 by Intel.

Initially the core operating speed of the Xeon started where the Pentium II left off (at the time, 400MHz) and followed the Pentium II up to 450MHz.

The front-side bus of the Xeon was initially designed for 100MHz operation, although higher speeds are possible and expected. A set of contacts on the SEC cartridge allows the motherboard to adjust the multiplier that determines the ratio between front-side bus and core logic speed.

The independence of the logic core and cache is emphasized by the power requirements of the Xeon. Each section requires its own voltage level. The design of the Xeon allows Intel flexibility in the power requirements of the chip through a special coding scheme. A set of pins indicates the core voltage and the cache voltage required by the chip, and the chip expects the motherboard to determine the requirements of the board and deliver the required voltages. The Xeon design allows for core voltages as low as 1.8 volts or as high as 2.1 volts (the level required by the first chips). Cache voltage requirements may reach as high as 2.8 volts. Nominally, the Xeon is a 2-volt chip.

Overall, the Xeon is optimized for workstations and servers and features built-in provide support for up to four identical chips in a single computer. Table 5.6 summarizes the original Xeon product line.

computer

Electrical Characteristics

At its heart, a microprocessor is an electronic device. This electronic foundation has important ramifications in the construction and operation of chips. The “free lunch” principle (that is, there is none) tells us that every operation has its cost. Even the quick electronic thinking of a microprocessor takes a toll. The thinking involves the switching of the state of tiny transistors, and each state change consumes a bit of electrical power, which gets converted to heat. The transistors are so small that the process generates a minuscule amount of heat, but with millions of them in a single chip, the heat adds up. Modern microprocessors generate so much heat that keeping them cool is a major concern in their design.

Thermal Constraints

Heat is the enemy of the semiconductor because it can destroy the delicate crystal structure of a chip. If a chip gets too hot, it will be irrevocably destroyed. Packing circuits tightly concentrates the heat they generate, and the small size of the individual circuit components makes them more vulnerable to damage.

Heat can cause problems more subtle than simple destruction. Because the conductivity of semiconductor circuits also varies with temperature, the effective switching speed of transistors and logic gates also changes when chips get too hot or too cold. Although this temperature-induced speed change does not alter how fast a microprocessor can compute (the chip must stay locked to the system clock at all times), it can affect the relative timing between signals inside the microprocessor. Should the timing get too far off, a microprocessor might make a mistake, with the inevitable result of crashing your system. All chips have rated temperature ranges within which they are guaranteed to operate without such timing errors.

Because chips generate more heat as speed increases, they can produce heat faster than it can radiate away. This heat buildup can alter the timing of the internal signals of the chip so drastically that the microprocessor will stop working and—as if you couldn’t guess—cause your system to crash. To avoid such problems, computer manufacturers often attach heatsinks to microprocessors and other semiconductor components to aid in their cooling.

A heatsink is simply a metal extrusion that increases the surface area from which heat can radiate from a microprocessor or other heat-generating circuit element. Most heatsinks have several fins (rows of pins) or some geometry that increases its surface area. Heatsinks are usually made from aluminum because that metal is one of the better thermal conductors, enabling the heat from the microprocessor to quickly spread across the heatsink.

Heatsinks provide passive cooling (passive because cooling requires no power-using mechanism). Heatsinks work by convection, transferring heat to the air that circulates past the heatsink. Air circulates around the heatsink because the warmed air rises away from the heatsink and cooler air flows in to replace it.

In contrast, active cooling involves some kind of mechanical or electrical assistance to remove heat. The most common form of active cooling is a fan, which blows a greater volume of air past the heatsink than would be possible with convection alone. Nearly all modern microprocessors require a fan for active cooling, typically built into the chip’s heatsink.

The makers of notebook computers face another challenge in efficiently managing the cooling of their computers. Using a fan to cool a notebook system is problematic. The fan consumes substantial energy, which trims battery life. Moreover, the heat generated by the fan motor itself can be a significant part of the thermal load of the system. Most designers of notebook machines have turned to more innovative passive thermal controls, such as heat pipes and using the entire chassis of the computer as a heatsink.

Operating Voltages

In desktop computers, overheating rather than excess electrical consumption is the major power concern. Even the most wasteful of microprocessors uses far less power than an ordinary light bulb. The most that any computer-compatible microprocessor consumes is about nine watts, hardly more than a night light and of little concern when the power grid supplying your computer has megawatts at its disposal.

If you switch to battery power, however, every last milliwatt is important. The more power used by a computer, the shorter the time its battery can power the system or the heavier the battery it will need to achieve a given life between charges. Every degree a microprocessor raises its case temperature clips minutes from its battery runtime.

Battery-powered notebooks and sub-notebook computers consequently caused microprocessor engineers to do a quick about-face. Whereas once they were content to use bigger and bigger heatsinks, fans, and refrigerators to keep their chips cool, today they focus on reducing temperatures and wasted power at the source.

One way to cut power requirements is to make the design elements of a chip smaller. Smaller digital circuits require less power. But shrinking chips is not an option; microprocessors are invariably designed to be as small as possible with the prevailing technology.

To further trim the power required by microprocessors to make them more amenable to battery operation, engineers have come up with two new design twists: low-voltage operation and system-management mode. Although founded on separate ideas, both are often used together to minimize microprocessor power consumption. All new microprocessor designs incorporate both technologies.

Since the very beginning of the transistor-transistor logic family of digital circuits (the design technology that later blossomed into the microprocessor), digital logic has operated with a supply voltage of 5 volts. That level is essentially arbitrary. Almost any voltage would work. But 5-volt technology offers some practical advantages. It’s low enough to be both safe and frugal with power needs but high enough to avoid noise and allow for several diode drops, the inevitable reduction of voltage that occurs when a current flows across a semiconductor junction.

Every semiconductor junction, which essentially forms a diode, reduces or drops the voltage flowing through it. Silicon junctions impose a diode drop of about 0.7 volts, and there may be one or more such junctions in a logic gate. Other materials impose smaller drops—that of germanium, for example, is 0.4 volts—but the drop is unavoidable.

There’s nothing magical about using 5 volts. Reducing the voltage used by logic circuits dramatically reduces power consumption because power consumption in electrical circuits increases by the square of the voltage. That is, doubling the voltage of a circuit increases the power it uses by fourfold. Reducing the voltage by one-half reduces power consumption by three-quarters (providing, of course, that the circuit will continue to operate at the lower voltage).

All current microprocessor designs operate at about 2 volts or less. The Pentium 4 operates at just over 1.3 volts with minor variations, depending on the clock frequency of the chip. For example, the 2.53GHz version requires 1.325 volts. Microprocessors designed for mobile applications typically operate at about 1.1 volts; some as low as 0.95 volt.

To minimize power consumption, Intel sets the operating voltage of the core logic of its chips as low as possible—some, such as Intel’s ultra-low-voltage Mobile Pentium III-M, to just under 1 volt. The integral secondary caches of these chips (which are fabricated separately from the core logic) usually require their own, often higher, voltage supply. In fact, operating voltage has become so critical that Intel devotes several pins of its Pentium II and later microprocessors to encoding the voltage needs of the chip, and the host computer must adjust its supply to the chip to precisely meet those needs.

Most bus architectures and most of today’s memory modules operate at the 3.3 volt level. Future designs will push that level lower. Rambus memory systems, for example, operate at 2.5 volts (see Chapter 6, “Chipsets,” for more information).

Power Management

Trimming the power need of a microprocessor both reduces the heat the chip generates and increases how long it can run off a battery supply, an important consideration for portable computers. Reducing the voltage and power use of the chip is one way of keeping the heat down and the battery running, but chipmakers have discovered they can save even more power through frugality. The chips use power only when they have to, thus managing their power consumption.

Chipmakers have two basic power-management strategies: shutting off circuits when they are not needed and slowing down the microprocessor when high performance is not required.

The earliest form of power-savings built into microprocessors was part of system management mode (SMM), which allowed the circuitry of the chip to be shut off. In terms of clock speed, the chip went from full speed to zero. Initially, chips switched off after a period of system inactivity and woke up to full speed when triggered by an appropriate interrupt.

More advanced systems cycled the microprocessor between on and off states as they required processing power. The chief difficulty with this design is that nothing gets done when the chip isn’t processing. This kind of power management only works when you’re not looking (not exactly a salesman’s dream) and is a benefit you should never be able to see. Intel gives this technique the name QuickStart and claims that it can save enough energy between your keystrokes to significantly reduce overall power consumption by briefly cutting the microprocessor’s electrical needs by 95 percent. Intel introduced QuickStart in the Mobile Pentium II processor, although it has not widely publicized the technology.

In the last few years, chipmakers have approached the power problem with more advanced power-saving systems that take an intermediary approach. One way is to reduce microprocessor power when it doesn’t need it for particular operations. Intel slightly reduces the voltage applied to its core logic based on the activity of the processor. Called Intel Mobile Voltage Positioning (IMVP), this technology can reduce the thermal design power—which means the heat produced by the microprocessor—by about 8.5 percent. According to Intel, this reduction is equivalent to reducing the speed of a 750MHz Mobile Pentium III by 100MHz.

Another technique for saving power is to reduce the performance of a microprocessor when its top speed is not required by your applications. Instead of entirely switching off the microprocessor, the chipmakers reduce its performance to trim power consumption. Each of the three current major microprocessor manufacturers puts its own spin on this performance-as-needed technology, labeling it with a clever trademark. Intel offers SpeedStep, AMD offers PowerNow!, and Transmeta offers LongRun. Although at heart all three are conceptually much the same, in operation you’ll find distinct differences between them.

Intel SpeedStep

Internal mobile microprocessor power savings started with SpeedStep, introduced by Intel on January 18, 2000, with the Mobile Pentium III microprocessors, operating at 600MHz and 650MHz. To save power, these chips can be configured to reduce their operating speed when running on battery power to 500MHz. All Mobile Pentium III and Mobile Pentium 4 chips since that date have incorporated SpeedStep into their designs. Mobile Celeron processors do not use SpeedStep.

The triggering event is a reduction of power to the chip. For example, the initial Mobile Pentium III chips go from the 1.7 volts that is required for operating at their top speeds to 1.35 volts. As noted earlier, the M-series step down from 1.4 to 1.15 volts, the low-voltage M-series from 1.35 to 1.1 volts, and the ultra-low-voltage chips from 1.1 to 0.975 volts. Note that a 15-percent reduction in voltage in itself reduces power consumption by about 29 percent, with a further reduction that’s proportional to the speed decrease. The 600MHz Pentium III, for example, cuts its power consumption an additional 17 percent thanks to voltage reduction when slipping down from 600MHz to 500MHz.

Intel calls the two modes Maximum Performance Mode (for high speed) and Battery Optimized Mode (for low speed). According to Intel, switching between speeds requires about one two-thousandths of a second. The M-series of Mobile Pentium III adds an additional step to provide an intermediary level of performance when operating on battery. Intel calls this technology Enhanced SpeedStep. Table 5.1 lists the SpeedStep capabilities of many Intel chips.

computer

AMD PowerNow!

Advanced Micro Devices devised its own power-saving technology, called PowerNow!, for its mobile processors. The AMD technology differs from Intel’s SpeedStep by providing up to 32 levels of speed reduction and power savings. Note that 32 levels is the design limit. Actual implementations of the technology from AMD have far fewer levels. All current AMD mobile processors—both the Mobile Athlon and Mobile Duron lines—use PowerNow! technology.

PowerNow! operates in one of three modes:

Hi-Performance. This mode runs the microprocessor at full speed and full voltage so that the chip maximizes its processing power.

Battery Saver. This mode runs the chip at a lower speed using a lower voltage to conserve battery power, exactly as a SpeedStep chip would, but with multiple levels. The speed and voltage are determined by the chip’s requirements and programming of the BIOS (which triggers the change).

Automatic. This mode makes the changes in voltage and clock speed dynamic, responding to the needs of the system. When an application requires maximum processing power, the chip runs at full speed. As the need for processing declines, the chip adjusts its performance to match. Current implementations allow for four discrete speeds at various operating voltages. Automatic mode is the best compromise for normal operation of portable computers.

The actual means of varying the clock frequency involves dynamic control of the clock multiplier inside the AMD chip. The external oscillator or clock frequency the computer supplies the microprocessor does not change, regardless of the performance demand. In the case of the initial chip to use PowerNow! (the Mobile K6-2+ chip, operating at 550MHz), its actual operating speed could vary from 200MHz to 550MHz in a system that takes full advantage of the technology.

Control of PowerNow! starts with the operating system (which is almost always Windows). Windows monitors processor usage, and when it dips below a predetermined level, such as 50 percent, Windows signals the PowerNow! system to cut back the clock multiplier inside the microprocessor and then signals a programmable voltage regulator to trim the voltage going to the chip. Note that even with PowerNow!, the chip’s supply voltage must be adjusted externally to the chip to achieve the greatest power savings.

If the operating system detects that the available processing power is still underused, it signals to cut back another step. Similarly, should processing needs reach above a predetermined level (say, 90 percent of the available ability), the operating system signals PowerNow! to kick up performance (and voltage) by a notch.

Transmeta LongRun

Transmeta calls its proprietary power-saving technology LongRun. It is a feature of both current Crusoe processors, the TM5500 and TM5800. In concept, LongRun is much like AMD’s PowerNow! The chief difference is control. Because of the design of Transmeta’s Crusoe processors, Windows instructions are irrelevant to their power usage—Crusoe chips translate the Windows instructions into their own format. To gauge its power needs, the Crusoe chip monitors the flow of its own native instructions and adjusts its speed to match the processing needs of that code stream. In other words, the Crusoe chip does its own monitoring and decision-making regarding power savings without regard to Windows power-conservation information.

According to Transmeta, LongRun allows its microprocessors to adjust their power consumption by changing their clock frequency on the fly, just as PowerNow! does, as well as to adjust their operating voltage. The processor core steps down processor speed in 33MHz increments, and each step holds the potential of reducing chip voltage. For example, trimming the speed of a chip from 667MHz to 633MHz also allows for reducing the operating voltage from 1.65 to 1.60 volts.

Packaging

The working part of a microprocessor is exactly what the nickname “chip” implies: a small flake of a silicon crystal no larger than a postage stamp. Although silicon is a fairly robust material with moderate physical strength, it is sensitive to chemical contamination. After all, semiconductors are grown in precisely controlled atmospheres, the chemical content of which affects the operating properties of the final chip. To prevent oxygen and contaminants in the atmosphere from adversely affecting the precision-engineered silicon, the chip itself must be sealed away. The first semiconductors, transistors, were hermetically sealed in tiny metal cans.

The art and science of semiconductor packaging has advanced since those early days. Modern integrated circuits (ICs) are often surrounded in epoxy plastic, an inexpensive material that can be easily molded to the proper shape. Unfortunately, microprocessors can get very hot, sometimes too hot for plastics to safely contain. Most powerful modern microprocessors are consequently cased in ceramic materials that are fused together at high temperatures. Older, cooler chips reside in plastic. The most recent trend in chip packaging is the development of inexpensive tape-based packages optimized for automated assembly of circuit boards.

The most primitive of microprocessors (that is, those of the early generation that had neither substantial signal nor power requirements) fit in the same style housing popular for other integrated circuits—the infamous dual inline pin (DIP) package. The packages grew more pins—or legs, as engineers sometimes call them—to accommodate the ever-increasing number of signals in data and address buses.

The DIP package is far from ideal for a number of reasons. Adding more connections, for example, makes for an ungainly chip. A centipede microprocessor would be a beast measuring a full five inches long. Not only would such a critter be hard to fit onto a reasonably sized circuit board, it would require that signals travel substantially farther to reach the end pins than those in the center. At modern operating frequencies, that difference in distance can amount to a substantial fraction of a clock cycle, potentially putting the pins out of sync.

Modern chip packages are compact squares that avoid these problems. Engineers developed several separate styles to accommodate the needs of the latest microprocessors.

The most common is the pin grid array (PGA), a square package that varies in size with the number of pins that it must accommodate (typically about two inches square). The first PGA chips had 68 pins. Pentium 4 chips in PGA packages have up to 478.

No matter their number, the pins are spaced as if they were laid out on a checkerboard, making the “grid array” of the package name (see Figure 5.1).

computer

To fit the larger number of pins used by wider-bus microprocessors into a reasonable space, Intel rearranged the pins of some processors (notably the Pentium Pro), staggering them so that they can fit closer together. The result is a staggered pin grid array (SPGA) package, as shown in Figure 5.2.

computer

Pins take up space and add to the cost of fabrication, so chipmakers have developed a number of pinless packages. The first of these to find general use was the Leadless Chip Carrier (LCC) socket. Instead of pins, this style of package has contact pads on one of its surfaces. The pads are plated with gold to avoid corrosion or oxidation that would impede the flow of the minute electrical signals used by the chip (see Figure 5.3). The pads are designed to contact special springy mating contacts in a special socket. Once installed, the chip itself may be hidden in the socket, under a heat sink, or perhaps only the top of the chip may be visible, framed by the four sides of the socket.

computer

A related design, the Plastic Leaded Chip Carrier (PLCC), substitutes epoxy plastic for the ceramic materials ordinarily used for encasing chips. Plastic is less expensive and easier to work with. Some microprocessors with low thermal output sometimes use a housing designed to be soldered down—the Plastic Quad Flat Package (PQFP), sometimes called simply the quad flat pack because the chips are flat (they fit flat against the circuit board) and they have four sides (making them a quadrilateral, as shown in Figure 5.4).

computer

The Tape Carrier Package takes the advantage of the quad flat pack a step further, reducing the chip to what looks like a pregnant bulge in the middle of a piece of photographic film (see Figure 5.5).

computer

Another way to deal with the problem of pins is to reduce them to vestigial bumps, substituting precision-formed globs of solder that can mate with socket contacts. Alternately, the globs can be soldered directly to a circuit board using surface-mount technology. Because the solder contacts start out as tiny balls but use a variation on the PGA layout, the package is termed solder-ball grid array. (Note that solder is often omitted from the name, thus yielding the abbreviation BGA.)

When Intel’s engineers first decided to add secondary caches to the company’s microprocessors, they used a separately housed slice of silicon for the cache. Initially Intel put the CPU and cache chips in separate chambers in one big, black chip. The design, called the Multi-Cavity Module (MCM), was used only for the Pentium Pro chip.

Next, Intel shifted to putting the CPU and cache on a small circuit board inside a cartridge, initially called the Single Edge Contact cartridge or SEC cartridge (which Intel often abbreviates SECC) when it was used for the Pentium II chip. Figure 5.6 shows the Pentium II microprocessor SEC cartridge.

computer

Intel used a functionally similar but physically different design for its Pentium II Xeon chips and a similar cartridge but slightly different bus design for the Pentium III.

To cut the cost of the cartridge for the inexpensive Celeron line, Intel eliminated the case around the chip to make the Singe-Edge Processor (SEP) package (see Figure 5.7).

computer

When Intel developed the capability to put the CPU and secondary cache on a single piece of silicon, called a die, the need for cartridges disappeared. Both later Celeron and Pentium III had on-die caches and were packaged both as cartridges and as individual chips in PGA and similar packages. With the Pentium 4, the circle was complete. Intel offers the latest Pentiums only in compact chip-style packages.

The package that the chip is housed in has no effect on its performance. It can, however, be important when you want to replace or upgrade your microprocessor with a new chip or upgrade card. Many of these enhancement products require that you replace your system’s microprocessor with a new chip or adapter cable that links to a circuit board. If you want the upgrade or a replacement part to fit on your motherboard, you may have to specify which package your computer uses for its microprocessor.

Ordinarily you don’t have to deal with microprocessor sockets unless you’re curious and want to pull out the chip, hold it in your hand, and watch a static discharge turn a $300 circuit into epoxy-encapsulated sand. Choose to upgrade your computer to a new and better microprocessor, and you’ll tangle with the details of socketry, particularly if you want to improve your Pentium.

Intel recognizes nine different microprocessor sockets for its processors, from the 486 to the Pentium Pro. In 1999, it added a new socket for some incarnations of the Pentium II Celeron. Other Pentium II and Pentium III chips, packaged as modules or cartridges, mate with slots instead of sockets. Table 5.2 summarizes these socket types, the chips that use them, and the upgrades appropriate to them.

computer

Performance Enhancing Architectures

Functionally, the first microprocessors operated a lot like meat grinders. You put something in such as meat scraps, turned a crank, and something new and wonderful came out—a sausage. Microprocessors started with data and instructions and yielded answers, but operationally they were as simple and direct as turning a crank. Every operation carried out by the microprocessor clicked with a turn of the crank—one clock cycle, one operation.

Such a design is straightforward and almost elegant. But its wonderful simplicity imposes a heavy constraint. The computer’s clock becomes an unforgiving jailor, locking up the performance of the microprocessor. A chip with this turn-the-crank design is locked to the clock speed and can never improve its performance beyond one operation per clock cycle. The situation is worse than that. The use of microcode almost ensures that at least some instructions will require multiple clock cycles.

One way to speed up the execution of instructions is to reduce the number of internal steps the microprocessor must take for execution. That idea was the guiding principle behind the first RISC microprocessors and what made them so interesting to chip designers. Actually, however, step reduction can take one of two forms: making the microprocessor more complex so that steps can be combined or making the instructions simpler so that fewer steps are required. Both approaches have been used successfully by microprocessor designers—the former as CISC microprocessors, the latter as RISC.

Ideally, it would seem, executing one instruction every clock cycle would be the best anyone could hope for, the ultimate design goal. With conventional microprocessor designs, that would be true. But engineers have found another way to trim the clock cycles required by each instruction—by processing more than one instruction at the same time.

Two basic approaches to processing more instructions at once are pipelining and superscalar architecture. All modern microprocessors take advantage of these technologies as well as several other architectural refinements that help them carry out more instructions for every cycle of the system clock.

Clock Speed

The operating speed of a microprocessor is usually called its clock speed, which describes the frequency at which the core logic of the chip operates. Clock speed is usually measured in megahertz (one million hertz or clock cycles per second) or gigahertz (a billion hertz). All else being equal, a higher number in megahertz means a faster microprocessor.

Faster does not necessarily mean the microprocessor will compute an answer more quickly, however. Different microprocessor designs can execute instructions more efficiently because there’s no one-to-one correspondence between instruction processing and clock speed. In fact, each new generation of microprocessor has been able to execute more instructions per clock cycle, so a new microprocessor can carry out more instructions at a given megahertz rating. At the same megahertz rating, a Pentium 4 is faster than a Pentium III. Why? Because of pipelining, superscalar architecture, and other design features.

Sometimes microprocessor-makers take advantage of this fact and claim that megahertz doesn’t matter. For example, AMD’s Athlon processors carry out more instructions per clock cycle than Intel’s Pentium III, so AMD stopped using megahertz numbers to describe its chips. Instead, it substituted model designations that hinted at the speed of a comparable Pentium chips. An Athlon XP 2200+ processes data as quickly as a Pentium 4 at 2200MHz chip, although the Athlon chip actually operates at less than 2000MHz. With the introduction of its Itanium series of processors, Intel also made assertions that megahertz doesn’t matter because Itanium chips have clock speeds substantially lower than Pentium chips.

A further complication is software overhead. Microprocessor speed doesn’t affect the performance of Windows or its applications very much. That’s because the performance of Windows depends on the speed of your hard disk, video system, memory system, and other system resources as well as your microprocessor. Although a Windows system using a 2GHz processor will appear faster than a system with a 1GHz processor, it won’t be anywhere near twice as fast.

In other words, the megahertz rating of a microprocessor gives only rough guidance in comparing microprocessor performance in real-world applications. Faster is better, but a comparison of megahertz (or gigahertz) numbers does not necessarily express the relationship between the performance of two chips or computer systems.

Pipelining

In older microprocessor designs, a chip works single-mindedly. It reads an instruction from memory, carries it out, step by step, and then advances to the next instruction. Each step requires at least one tick of the microprocessor’s clock. Pipelining enables a microprocessor to read an instruction, start to process it, and then, before finishing with the first instruction, read another instruction. Because every instruction requires several steps, each in a different part of the chip, several instructions can be worked on at once and passed along through the chip like a bucket brigade (or its more efficient alternative, the pipeline). Intel’s Pentium chips, for example, have four levels of pipelining. Up to four different instructions may be undergoing different phases of execution at the same time inside the chip. When operating at its best, pipelining reduces the multiple-step/multiple-clock-cycle processing of an instruction to a single clock cycle.

Pipelining is very powerful, but it is also demanding. The pipeline must be carefully organized, and the parallel paths kept carefully in step. It’s sort of like a chorus singing a canon such as Fréré Jacques—one missed beat and the harmony falls apart. If one of the execution stages delays, all the rest delay as well. The demands of pipelining push microprocessor designers to make all instructions execute in the same number of clock cycles. That way, keeping the pipeline in step is easier.

In general, the more stages to a pipeline, the greater acceleration it can offer. Intel has added superlatives to the pipeline name to convey the enhancement. Super-pipelining is Intel’s term for breaking the basic pipeline stages into several steps, resulting in a 12-stage design used for its Pentium Pro through Pentium III chips. Later, Intel further sliced the stages to create the current Pentium 4 chip with 20 stages, a design Intel calls hyper-pipelining.

Real-world programs conspire against lengthy pipelines, however. Nearly all programs branch. That is, their execution can take alternate paths down different instruction streams, depending on the results of calculations and decision-making. A pipeline can load up with instructions of one program branch before it discovers that another branch is the one the program is supposed to follow. In that case, the entire contents of the pipeline must be dumped and the whole thing loaded up again. The result is a lot of logical wheel-spinning and wasted time. The bigger the pipeline, the more time that’s wasted. The waste resulting from branching begins to outweigh the benefits of bigger pipelines in the vicinity of five stages.

Branch Prediction

Today’s most powerful microprocessors are adopting a technology called branch prediction logic to deal with this problem. The microprocessor makes its best guess at which branch a program will take as it is filling up the pipeline. It then executes these most likely instructions. Because the chip is guessing at what to do, this technology is sometimes called speculative execution.

When the microprocessor’s guesses turn out to be correct, the chip benefits from the multiple-pipeline stages and is able to run through more instructions than clock cycles. When the chip’s guess turns out wrong, however, it must discard the results obtained under speculation and execute the correct code. The chip marks the data in later pipeline stages as invalid and discards it. Although the chip doesn’t lose time—the program would have executed in the same order anyway—it does lose the extra boost bequeathed by the pipeline.

Speculative Execution

To further increase performance, more modern microprocessors use speculative execution. That is, the chip may carry out an instruction in a predicted branch before it confirms whether it has properly predicted the branch. If the chip’s prediction is correct, the instruction has already been executed, so the chip wastes no time. If the prediction was incorrect, the chip will have to execute a different instruction, which it would have to have done anyhow, so it suffers no penalty.

Superscalar Architectures

The steps in a program normally are listed sequentially, but they don’t always need to be carried out exactly in order. Just as tough problems can be broken into easier pieces, program code can be divided as well. If, for example, you want to know the larger of two rooms, you have to compute the volume of each and then make your comparison. If you had two brains, you could compute the two volumes simultaneously. A superscalar microprocessor design does essentially that. By providing two or more execution paths for programs, it can process two or more program parts simultaneously. Of course, the chip needs enough innate intelligence to determine which problems can be split up and how to do it. The Pentium, for example, has two parallel, pipelined execution paths.

The first superscalar computer design was the Control Data Corporation 6600 mainframe, introduced in 1964. Designed specifically for intense scientific applications, the initial 6600 machines were built from eight functional units and were the fastest computers in the world at the time of their introduction.

Superscalar architecture gets its name because it goes beyond the incremental increase in speed made possible by scaling down microprocessor technology. An improvement to the scale of a microprocessor design would reduce the size of the microcircuitry on the silicon chip. The size reduction shortens the distance signals must travel and lowers the amount of heat generated by the circuit (because the elements are smaller and need less current to effect changes). Some microprocessor designs lend themselves to scaling down. Superscalar designs get a more substantial performance increase by incorporating a more dramatic change in circuit complexity.

Using pipelining and superscalar architecture cycle-saving techniques has cut the number of clock cycles required for the execution of a typical microprocessor instruction dramatically. Early microprocessors needed, on average, several cycles for each instruction. Today’s chips can often carry out multiple instructions in a single clock cycle. Engineers describe pipelined, superscalar chips by the number of instructions they can retire per clock cycle. They look at the number of instructions that are completed, because this best describes how much work the chip has (or can) actually accomplish.

Out-of-Order Execution

No matter how well the logic of a superscalar microprocessor divides up a program, each pipeline is unlikely to get an equal share of the work. One or another pipeline will grind away while another finishes in an instant. Certainly the chip logic can shove another instruction down the free pipeline (if another instruction is ready). But if the next instruction depends on the results of the one before it, and that instruction is the one stuck grinding away in the other pipeline, the free pipeline stalls. It is available for work but can do no work, thus potential processor power gets wasted.

Like a good Type-A employee who always looks for something to do, microprocessors can do the same. They can check the program for the next instruction that doesn’t depend on previous work that’s not finished and work on the new instruction. This sort of ambitious approach to programs is termed out-of-order execution, and it helps microprocessors take full advantage of superscalar designs.

This sort of ambitious microprocessor faces a problem, however. It is no longer running the program in the order it was written, and the results might be other than the programmer had intended. Consequently, microprocessors capable of out-of-order execution don’t immediately post the results from their processing into their registers. The work gets carried out invisibly and the results of the instructions that are processed out of order are held in a buffer until the chip has finished the processing of all the previous instructions. The chip puts the results back into the proper order, checking to be sure that the out-of-order execution has not caused any anomalies, before posting the results to its registers. To the program and the rest of the outside world, the results appear in the microprocessor’s registers as if they had been processed in normal order, only faster.

Register Renaming

Out-of-order execution often runs into its own problems. Two independently executable instructions may refer to or change the same register. In the original program, one would carry out its operation, then the other would do its work later. During superscalar out-of-order execution, the two instructions may want to work on the register simultaneously. Because that conflict would inevitably lead to confusing results and errors, an ordinary superscalar microprocessor would have to ensure the two instructions referencing the same register executed sequentially instead of in parallel, thus eliminating the advantage of its superscalar design.

To avoid such problems, advanced microprocessors use register renaming. Instead of a small number of registers with fixed names, they use a larger bank of registers that can be named dynamically. The circuitry in each chip converts the references made by an instruction to a specific register name to point instead to its choice of physical register. In effect, the program asks for the EAX register, and the chip says, “Sure,” and gives the program a register it calls EAX. If another part of the program asks for EAX, the chip pulls out a different register and tells the program that this one is EAX, too. The program takes the microprocessor’s word for it, and the microprocessor doesn’t worry because it has several million transistors to sort things out in the end.

And it takes several million transistors because the chip must track all references to registers. It does this to ensure that when one program instruction depends on the result in a given register, it has the right register and results dished up to it.

Explicitly Parallel Instruction Computing

With Intel’s shift to 64-bit architecture for its most powerful line of microprocessors (aimed, for now, at the server market), the company introduced a new instruction set to compliment the new architecture. Called Explicitly Parallel Instruction Computing (EPIC), the design follows the precepts of RISC architecture by putting the hard work into software (the compiler), while retaining the advantages of longer instructions used by SIMD and VLIW technologies.

The difference between EPIC and older Intel chips is that the compiler takes a swipe at each program and determines where parallel processes can occur. It then optimizes the program code to sort out separate streams of execution that can be routed to different microprocessor pipelines and carried out concurrently. This not only relieves the chip from working to figure out how to divide up the instruction stream, but it also allows the software to more thoroughly analyze the code rather than trying to do it on the fly.

By analyzing and dividing the instruction streams before they are submitted to the microprocessor, EPIC trims the need and use of speculative execution and branch prediction. The compiler can look ahead in the program, so it doesn’t have to speculate or predict. It knows how best to carry out a complex program.

Front-Side Bus

The instruction stream is not the only bottleneck in a modern computer. The core logic of most microprocessors operates much faster than other parts of most computers, including the memory and support circuitry. The microprocessor links to the rest of the computer through a connection called the system bus or the front-side bus. The speed at which the system bus operates sets a maximum limit on how fast the microprocessor can send data to other circuits (including memory) in the computer.

When the microprocessor needs to retrieve data or an instruction from memory, it must wait for this to come across the system bus. The slower the bus, the longer the microprocessor has to wait. More importantly, the greater the mismatch between microprocessor speed and system bus speed, the more clock cycles the microprocessor needs to wait. Applications that involve repeated calculations on large blocks of data—graphics and video in particular—are apt to require the most access to the system bus to retrieve data from memory. These applications are most likely to suffer from a slow system bus.

The first commercial microprocessors of the current generation operated their system buses at 66MHz. Through the years, manufacturers have boosted this speed and increased the speed at which the microprocessor can communicate with the rest of the computer. Chips now use clock speeds of 100MHz or 133MHz for their system buses.

With the Pentium 4, Intel added a further refinement to the system bus. Using a technology called quad-pumping, Intel forces four data bits onto each clock cycle of the system bus. A quad-pumped system bus operating at 100MHz can actually move 400Mb of data in a second. A quad-pumped system bus running at 133MHz achieves a 533Mbps data rate.

The performance of the system bus is often described in its bandwidth, the number of total bytes of data that can move through the bus in one second. The data buses of all current computers are 64 bits wide—that’s 8 bytes. Multiplying the clock speed or data rate of the bus by its width yields its bandwidth. A 100MHz system bus therefore has an 800MBps (megabytes per second) bandwidth. A 400MHz bus has a 3.2GBps bandwidth.

Intel usually locks the system bus speed to the clock speed of the microprocessor. This synchronous operation optimizes the transfer rate between the bus and the microprocessor. It also explains some of the odd frequencies at which some microprocessors are designed to operate (for example, 1.33GHz or 2.56GHz). Sometimes the system bus operates at a speed that’s not an even divisor of the microprocessor speed—for example, the microprocessor clock speed may be 4.5, 5, or 5.5 times the system bus speed. Such a mismatch can slow down system performance, although such mismatches are minimized by effective microprocessor caching (see the upcoming section “Caching” for more information).

Translation Look-Aside Buffers

Modern pipelined, superscalar microprocessors need to access memory quickly, and they often repeatedly go to the same address in the execution of a program. To speed up such operations, most newer microprocessors include a quick lookup list of the pages in memory that the chip has addressed most recently. This list is termed a translation look-aside and is a small block of fast memory inside the microprocessor that stores a table that cross-references the virtual addresses in programs with the corresponding real addresses in physical memory that the program has most recently used. The microprocessor can take a quick glance away from its normal address-translation pipeline, effectively “looking aside,” to fetch the addresses it needs.

The translation look-aside buffer (TLB) appears to be very small in relation to the memory of most computers. Typically, a TLB may be 64 to 256 entries. Each entry, however, refers to an entire page of memory, which with today’s Intel microprocessors, totals four kilobytes. The amount of memory that the microprocessor can quickly address by checking the TLB is the TLB address space, which is the product of the number of entries in the TLB and the page size. A 256-entry TLB can provide fast access to a megabyte of memory (256 entries times 4KB per page).

Caching

The most important means of matching today’s fast microprocessors to the speeds of affordable memory, which is inevitably slower, is memory caching. A memory cache interposes a block of fast memory—typically high-speed static RAM—between the micro processor and the bulk of primary storage. A special circuit called a cache controller (which current designs make into an essential part of the microprocessor) attempts to keep the cache filled with the data or instructions that the microprocessor is most likely to need next. If the information the microprocessor requests next is held within the cache, it can be retrieved without waiting.

This fastest possible operation is called a cache hit. If the needed data is not in the cache memory, it is retrieved from outside the cache, typically from ordinary RAM at ordinary RAM speed. The result is called a cache miss.

Not all memory caches are created equal. Memory caches differ in many ways: size, logical arrangement, location, and operation.

Cache Level

Caches are sometimes described by their logical and electrical proximity to the microprocessor’s core logic. The closest physically and electrically to the microprocessor’s core logic is the primary cache, also called a Level One cache. A secondary cache (or Level Two cache) fits between the primary cache and main memory. The secondary cache usually is larger than the primary cache but operates at a lower speed (to make its larger mass of memory more affordable). Rarely is a tertiary cache (or Level Three cache) interposed between the secondary cache and memory.

In modern microprocessor designs, both the primary and secondary caches are part of the microprocessor itself. Older designs put the secondary cache in a separate part of a microprocessor module or in external memory.

Primary and secondary caches differ in the way they connect with the core logic of the microprocessor. A primary cache invariably operates at the full speed of the microprocessor’s core logic with the widest possible bit-width connection between the core logic and the cache. Secondary caches often operate at a rate slower than the chip’s core logic, although all current chips operate the secondary cache at full core speed.

Cache Size

A major factor that determines how successful the cache will be is how much information it contains. The larger the cache, the more data that is in it and the more likely any needed byte will be there when you system calls for it. Obviously, the best cache is one that’s as large as, and duplicates, the entirety of system memory. Of course, a cache that big is also absurd. You could use the cache as primary memory and forget the rest. The smallest cache would be a byte, also an absurd situation because it guarantees the next read is not in the cache. Chipmakers try to make caches as large as possible within the constraints of fabricating microprocessors affordably.

Today’s primary caches are typically 64 or 128KB. Secondary caches range from 128 to 512KB for chips for desktop and mobile applications and up to 2MB for server-oriented microprocessors.

Instruction and Data Caches

Modern microprocessors subdivide their primary caches into separate instruction and data caches, typically with each assigned one-half the total cache memory. This separation allows for a more efficient microprocessor design. Microprocessors handle instructions and data differently and may even send them down different pipelines. Moreover, instructions and data typically use memory differently—instructions are sequential whereas data can be completely random. Separating the two allows designers to optimize the cache design for each.

Write-Through and Write-Back Caches

Caches also differ in the way they treat writing to memory. Most caches make no attempt to speed up write operations. Instead, they push write commands through the cache immediately, writing to cache and main memory (with normal wait-state delays) at the same time. This write-through cache design is the safe approach because it guarantees that main memory and cache are constantly in agreement. Most Intel microprocessors through the current versions of the Pentium use write-through technology.

The faster alternative is the write-back cache, which allows the microprocessor to write changes to its cache memory and then immediately go back about its work. The cache controller eventually writes the changed data back to main memory as time allows.

Cache Mapping

The logical configuration of a cache involves how the memory in the cache is arranged and how it is addressed (that is, how the microprocessor determines whether needed information is available inside the cache). The major choices are direct-mapped, full associative, and set-associative.

The direct-mapped cache divides the fast memory of the cache into small units, called lines (corresponding to the lines of storage used by Intel 32-bit microprocessors, which allow addressing in 16-byte multiples, blocks of 128 bits), each of which is identified by an index bit. Main memory is divided into blocks the size of the cache, and the lines in the cache correspond to the locations within such a memory block. Each line can be drawn from a different memory block, but only from the location corresponding to the location in the cache. Which block the line is drawn from is identified by a tag. For the cache controller—the electronics that ride herd on the cache—determining whether a given byte is stored in a direct-mapped cache is easy. It just checks the tag for a given index value.

The problem with the direct-mapped cache is that if a program regularly moves between addresses with the same indexes in different blocks of memory, the cache needs to be continually refreshed—which means cache misses. Although such operation is uncommon in single-tasking systems, it can occur often during multitasking and slow down the direct-mapped cache.

The opposite design approach is the full-associative cache. In this design, each line of the cache can correspond to (or be associated with) any part of main memory. Lines of bytes from diverse locations throughout main memory can be piled cheek-by-jowl in the cache. The major shortcoming of the full-associative approach is that the cache controller must check the addresses of every line in the cache to determine whether a memory request from the microprocessor is a hit or miss. The more lines there are to check, the more time it takes. A lot of checking can make cache memory respond more slowly than main memory.

A compromise between direct-mapped and full-associative caches is the set-associative cache, which essentially divides up the total cache memory into several smaller direct-mapped areas. The cache is described as the number of ways into which it is divided. A four-way set-associative cache, therefore, resembles four smaller direct-mapped caches. This arrangement overcomes the problem of moving between blocks with the same indexes. Consequently, the set-associative cache has more performance potential than a direct-mapped cache. Unfortunately, it is also more complex, making the technology more expensive to implement. Moreover, the more “ways” there are to a cache, the longer the cache controller must search to determine whether needed information is in the cache. This ultimately slows down the cache, mitigating the advantage of splitting it into sets. Most computer-makers find a four-way set-associative cache to be the optimum compromise between performance and complexity.

Microprocessors Background

Every modern microprocessor starts with the basics—clocked-logic digital circuitry. The chip has millions of separate gates combined into three basic function blocks: the input/output unit (or I/O unit), the control unit, and the arithmetic/logic unit (ALU). The last two are sometimes jointly called the central processingunit (CPU), although the same term often is used as a synonym for the entire microprocessor. Some chipmakers further subdivide these units, give them other names, or include more than one of each in a particular microprocessor. In any case, the functions of these three units are an inherent part of any chip. The differences are mostly a matter of nomenclature, because you can understand the entire operation of any microprocessor as a product of these three functions.

All three parts of the microprocessor interact together. In all but the simplest microprocessor designs, the I/O unit is under the control of the control unit, and the operation of the control unit may be determined by the results of calculations of the arithmetic/logic unit CPU. The combination of the three parts determines the power and performance of the microprocessor.

Each part of the microprocessor also has its own effect on the processing speed of the system. The control unit operates the microprocessor’s internal clock, which determines the rate at which the chip operates. The I/O unit determines the bus width of the microprocessor, which influences how quickly data and instructions can be moved in and out of the microprocessor. And the registers in the arithmetic/control unit determine how much data the microprocessor can operate on at one time.

Input/Output Unit

The input/output unit links the microprocessor to the rest of the circuitry of the computer, passing along program instructions and data to the registers of the control unit and arithmetic/logic unit. The I/O unit matches the signal levels and timing of the microprocessor’s internal solid-state circuitry to the requirements of the other components inside the computer. The internal circuits of a microprocessor, for example, are designed to be stingy with electricity so that they can operate faster and cooler. These delicate internal circuits cannot handle the higher currents needed to link to external components. Consequently, each signal leaving the microprocessor goes through a signal buffer in the I/O unit that boosts its current capacity.

The input/output unit can be as simple as a few buffers, or it may involve many complex functions. In the latest Intel microprocessors used in some of the most powerful computers, the I/O unit includes cache memory and clock-doubling or -tripling logic to match the high operating speed of the microprocessor to slower external memory.

The microprocessors used in computers have two kinds of external connections to their input/output units: those connections that indicate the address of memory locations to or from which the microprocessor will send or receive data or instructions, and those connections that convey the meaning of the data or instructions. The former is called the address bus of the microprocessor; the latter, the data bus.

The number of bits in the data bus of a microprocessor directly influences how quickly it can move information. The more bits that a chip can use at a time, the faster it is. The first microprocessors had data buses only four bits wide. Pentium chips use a 32-bit data bus, as do the related Athlon, Celeron, and Duron chips. Itanium and Opteron chips have 64-bit data buses.

The number of bits available on the address bus influences how much memory a microprocessor can address. A microprocessor with 16 address lines, for example, can directly work with 216 addresses; that’s 65,536 (or 64K) different memory locations. The different microprocessors used in various computers span a range of address bus widths from 32 to 64 or more bits.

The range of bit addresses used by a microprocessor and the physical number of address lines of the chip no longer correspond. That’s because people and microprocessors look at memory differently. Although people tend to think of memory in terms of bytes, each comprising eight bits, microprocessors now deal in larger chunks of data, corresponding to the number of bits in their data buses. For example, a Pentium chip chews into data 32 bits at a time, so it doesn’t need to look to individual bytes. It swallows them four at a time. Chipmakers consequently omit the address lines needed to distinguish chunks of memory smaller than their data buses. This bit of frugality saves the number of connections the chip needs to make with the computer’s circuitry, an issue that becomes important once you see (as you will later) that the modern microprocessor requires several hundred external connections—each prone to failure.

Control Unit

The control unit of a microprocessor is a clocked logic circuit that, as its name implies, controls the operation of the entire chip. Unlike more common integrated circuits, whose function is fixed by hardware design, the control unit is more flexible. The control unit follows the instructions contained in an external program and tells the arithmetic/logic unit what to do. The control unit receives instructions from the I/O unit, translates them into a form that can be understood by the arithmetic/logic unit, and keeps track of which step of the program is being executed.

With the increasing complexity of microprocessors, the control unit has become more sophisticated. In the basic Pentium, for example, the control unit must decide how to route signals between what amounts to two separate processing units called pipelines. In other advanced microprocessors, the function of the control unit is split among other functional blocks, such as those that specialize in evaluating and handling branches in the stream of instructions.

Arithmetic/Logic Unit

The arithmetic/logic unit handles all the decision-making operations (the mathematical computations and logic functions) performed by the microprocessor. The unit takes the instructions decoded by the control unit and either carries them out directly or executes the appropriate microcode (see the section titled “Microcode” later in this chapter) to modify the data contained in its registers. The results are passed back out of the microprocessor through the I/O unit.

The first microprocessors had but one ALU. Modern chips may have several, which commonly are classed into two types. The basic form is the integer unit, one that carries out only the simplest mathematical operations. More powerful microprocessors also include one or more floating-point units, which handle advanced math operations (such as trigonometric and transcendental functions), typically at greater precision.

Floating-Point Unit

Although functionally a floating-point unit is part of the arithmetic/logic unit, engineers often discuss it separately because the floating-point unit is designed to process only floating-point numbers and not to take care of ordinary math or logic operations.

Floating-point describes a way of expressing values, not a mathematically defined type of number such as an integer, rational, or real number. The essence of a floating-point number is that its decimal point “floats” between a predefined number of significant digits rather than being fixed in place the way dollar values always have two decimal places.

Mathematically speaking, a floating-point number has three parts: a sign, which indicates whether the number is greater or less than zero; a significant (sometimes called a mantissa), which comprises all the digits that are mathematically meaningful; and an exponent, which determines the order of magnitude of the significant (essentially the location to which the decimal point floats). Think of a floating-point number as being like those represented by scientific notation. But whereas scientists are apt to deal in base-10 (the exponents in scientific notation are powers of 10), floating-point units think of numbers digitally in base-2 (all ones and zeros in powers of two).

As a practical matter, the form of floating-point numbers used in computer calculations follows standards laid down by the Institute of Electrical and Electronic Engineers (IEEE). The IEEE formats take values that can be represented in binary form using 80 bits. Although 80 bits seems somewhat arbitrary in a computer world that’s based on powers of two and a steady doubling of register size from 8 to 16 to 32 to 64 bits, it’s exactly the right size to accommodate 64 bits of the significant, with 15 bits leftover to hold an exponent value and an extra bit for the sign of the number held in the register. Although the IEEE standard allows for 32-bit and 64-bit floating-point values, most floating-point units are designed to accommodate the full 80-bit values. The floating-point unit (FPU) carries out all its calculations using the full 80 bits of the chip’s registers, unlike the integer unit, which can independently manipulate its registers in byte-wide pieces.

The floating-point units of Intel-architecture processors have eight of these 80-bit registers in which to perform their calculations. Instructions in your programs tell the microprocessor whether to use its ordinary integer ALU or its floating-point unit to carry out a mathematical operation. The different instructions are important because the eight 80-bit registers in Intel floating-point units also differ from integer units in the way they are addressed. Commands for integer unit registers are directly routed to the appropriate register as if sent by a switchboard. Floating-point unit registers are arranged in a stack, sort of an elevator system. Values are pushed onto the stack, and with each new number the old one goes down one level. Stack machines are generally regarded as lean and mean computers. Their design is austere and streamlined, which helps them run more quickly. The same holds true for stack-oriented floating-point units.

Until the advent of the Pentium, a floating-point unit was not a guaranteed part of a microprocessor. Some 486 and all previous chips omitted floating-point circuitry. The floating-point circuitry simply added too much to the complexity of the chip, at least for the state of fabrication technology at that time. To cut costs, chipmakers simply left the floating-point unit as an option.

When it was necessary to accelerate numeric operations, the earliest microprocessors used in computers allowed you to add an additional, optional chip to your computer to accelerate the calculation of floating-point values. These external floating-point units were termed math coprocessors.

The floating-point units of modern microprocessors have evolved beyond mere number-crunching. They have been optimized to reflect the applications for which computers most often crunch floating-point numbers—graphics and multimedia (calculating dots, shapes, colors, depth, and action on your screen display).

Instruction Sets

Instructions are the basic units for telling a microprocessor what to do. Internally, the circuitry of the microprocessor has to carry out hundreds, thousands, or even millions of logic operations to carry out one instruction. The instruction, in effect, triggers a cascade of logical operations. How this cascade is controlled marks the great divide in microprocessor and computer design.

The first electronic computers used a hard-wired design. An instruction simply activated the circuits appropriate for carrying out all the steps required. This design has its advantages. It optimizes the speed of the system because the direct hard-wire connection adds nothing to slow down the system. Simplicity means speed, and the hard-wired approach is the simplest. Moreover, the hard-wired design was the practical and obvious choice. After all, computers were so new that no one had thought up any alternative.

However, the hard-wired computer design has a significant drawback. It ties the hardware and software together into a single unit. Any change in the hardware must be reflected in the software. A modification to the computer means that programs have to be modified. A new computer design may require that programs be entirely rewritten from the ground up.

Microcode

The inspiration for breaking away from the hard-wired approach was the need for flexibility in instruction sets. Throughout most of the history of computing, determining exactly what instructions should make up a machine’s instruction set was more an art than a science. IBM’s first commercial computers, the 701 and 702, were designed more from intuition than from any study of which instructions programmers would need to use. Each machine was tailored to a specific application. The 701 ran instructions thought to serve scientific users; the 702 had instructions aimed at business and commercial applications.

When IBM tried to unite its many application-specific computers into a single, more general-purpose line, these instruction sets were combined so that one machine could satisfy all needs. The result was, of course, a wide, varied, and complex set of instructions. The new machine, the IBM 360 (introduced in 1964), was unlike previous computers in that it was created not as hardware but as an architecture. IBM developed specifications and rules for how the machine would operate but enabled the actual machine to be created from any hardware implementation designers found most expedient. In other words, IBM defined the instructions that the 360 would use but not the circuitry that would carry them out. Previous computers used instructions that directly controlled the underlying hardware. To adapt the instructions defined by the architecture to the actual hardware that made up the machine, IBM adopted an idea called microcode, originally conceived by Maurice Wilkes at Cambridge University.

In the microcode design, an instruction causes a computer to execute a small program to carry out the logic instructions required by the instruction. The collection of small programs for all the instructions the computer understands is its microcode.

Although the additional layer of microcode made machines more complex, it added a great deal of design flexibility. Engineers could incorporate whatever new technologies they wanted inside the computer, yet still run the same software with the same instructions originally written for older designs. In other words, microcode enabled new hardware designs and computer systems to have backward compatibility with earlier machines.

After the introduction of the IBM 360, nearly all mainframe computers used microcode. When the microprocessors came along, they followed the same design philosophy, using microcode to match instructions to hardware. Using this design, a microprocessor actually has a smaller microprocessor inside it, which is sometimes called a nanoprocessor, running the microcode.

This microcode-and-nanoprocessor approach makes creating a complex microprocessor easier. The powerful data-processing circuitry of the chip can be designed independently of the instructions it must carry out. The manner in which the chip handles its complex instructions can be fine-tuned even after the architecture of the main circuits are laid into place. Bugs in the design can be fixed relatively quickly by altering the microcode, which is an easy operation compared to the alternative of developing a new design for the whole chip (a task that’s not trivial when millions of transistors are involved). The rich instruction set fostered by microcode also makes writing software for the microprocessor (and computers built from it) easier, thus reducing the number of instructions needed for each operation.

Microcode has a big disadvantage, however. It makes computers and microprocessors more complicated. In a microprocessor, the nanoprocessor must go through several of its own microcode instructions to carry out every instruction you send to the microprocessor. More steps means more processing time taken for each instruction. Extra processing time means slower operation. Engineers found that microcode had its own way to compensate for its performance penalty—complex instructions.

Using microcode, computer designers could easily give an architecture a rich repertoire of instructions that carry out elaborate functions. A single, complex instruction might do the job of half a dozen or more simpler instructions. Although each instruction would take longer to execute because of the microcode, programs would need fewer instructions overall. Moreover, adding more instructions could boost speed. One result of this micro code “more is merrier” instruction approach is that typical computer microprocessors have seven different subtraction commands.

RISC

Although long the mainstay of computer and microprocessor design, microcode is not necessary. While system architects were staying up nights concocting ever more powerful and obscure instructions, a counter force was gathering. Starting in the 1970s, the micro code approach came under attack by researchers who claimed it takes a greater toll on performance than its benefits justify.

By eliminating microcode, this design camp believed, simpler instructions could be executed at speeds so much higher that no degree of instruction complexity could compensate. By necessity, such hard-wired machines would offer only a few instructions because the complexity of their hard-wired circuitry would increase dramatically with every additional instruction added. Practical designs are best made with small instruction sets.

John Cocke at IBM’s Yorktown Research Laboratory analyzed the usage of instructions by computers and discovered that most of the work done by computers involves relatively few instructions. Given a computer with a set of 200 instructions, for example, two-thirds of its processing involves using as few as 10 of the total instructions. Cocke went on to design a computer that was based on a few instructions that could be executed quickly. He is credited with inventing the Reduced Instruction Set Computer (RISC) in 1974. The term RISC itself is credited to David Peterson, who used it in a course in microprocessor design at the University of California at Berkeley in 1980.

The first chip to bear the label and to take advantage of Cocke’s discoveries was RISC-I, a laboratory design that was completed in 1982. To distinguish this new design approach from traditional microprocessors, microcode-based systems with large instruction sets have come to be known as Complex Instruction Set Computers (CISC).

Cocke’s research showed that most of the computing was done by basic instructions, not by the more powerful, complex, and specialized instructions. Further research at Berkeley and Stanford Universities demonstrated that there were even instances in which a sequence of simple instructions could perform a complex task faster than a single complex instruction could. The result of this research is often summarized as the 80/20 Rule, meaning that about 20 percent of a computer’s instructions do about 80 percent of the work. The aim of the RISC design is to optimize a computer’s performance for that 20 percent of instructions, speeding up their execution as much as possible. The remaining 80 percent of the commands could be duplicated, when necessary, by combinations of the quick 20 percent. Analysis and practical experience has shown that the 20 percent could be made so much faster that the overhead required to emulate the remaining 80 percent was no handicap at all.

To enable a microprocessor to carry out all the required functions with a handful of instructions requires a rethinking of the programming process. Instead of simply translating human instructions into machine-readable form, the compilers used by RISC processors attempt to find the optimum instructions to use. The compiler takes a more in-depth look at the requested operations and finds the best way to handle them. The result was the creation of optimizing compilers discussed in Chapter 3, “Software.”

If effect, the RISC design shifts a lot of the processing from the microprocessor to the compiler—a lot of the work in running a program gets taken care of before the program actually runs. Of course, the compiler does more work and takes longer to run, but that’s a fair tradeoff—a program needs to be compiled only once but runs many, many times when the streamlined execution really pays off.

RISC microprocessors have several distinguishing characteristics. Most instructions execute in a single clock cycle—or even faster with advanced microprocessor designs with several execution pathways. All the instructions are the same length with similar syntax. The processor itself does not use microcode; instead, the small repertory of instructions is hard-wired into the chip. RISC instructions operate only on data in the registers of the chip, not in memory, making what is called a load-store design. The design of the chip itself is relatively simple, with comparatively few logic gates that are themselves constructed from simple, almost cookie-cutter designs. And most of the hard work is shifted from the microprocessor itself to the compiler.

Micro-Ops

Both CISC and RISC have a compelling design rationale and performance, desirable enough that engineers working on one kind of chip often looked over the shoulders of those working in the other camp. As a result, they developed hybrid chips embodying elements of both the CISC and RISC design. All the latest processors—from the Pentium Pro to the Pentium 4, Athlon, and Duron as well—have RISC cores mated with complex instruction sets.

The basic technique involves converting the classic Intel instructions into RISC-style instructions to be processed by the internal chip circuitry. Intel calls the internal RISC-like instructions micro-ops. The term is often abbreviated as uops (strictly speaking, the initial u should be the Greek letter mu, which is an abbreviation for micro) and pronounced you-ops. Other companies use slightly different terminology.

By design, the micro-ops sidestep the primary shortcomings of the Intel instruction set by making the encoding of all commands more uniform, converting all instructions to the same length for processing, and eliminating arithmetic operations that directly change memory by loading memory data into registers before processing.

The translation to RISC-like instructions allows the microprocessor to function internally as a RISC engine. The code conversion occurs in hardware, completely invisible to your applications and out of the control of programmers. In other words, it shifts back from the RISC shift to doing the work in the compiler. There’s a good reason for this backward shift: It lets the RISC code deal with existing programs—those compiled before the RISC designs were created.

Single Instruction, Multiple Data

In a quest to improve the performance of Intel microprocessors on common multimedia tasks, Intel’s hardware and software engineers analyzed the operations multimedia programs most often required. They then sought the most efficient way to enable their chips to carry out these operations. They essentially worked to enhance the signal-processing capabilities of their general-purpose microprocessors so that they would be competitive with dedicated processors, such as digital signal processor (DSP) chips. They called the technology they developed Single Instruction, Multiple Data (SIMD). In effect a new class of microprocessor instructions, SIMD is the enabling element of Intel’s MultiMedia Extensions (MMX) to its microprocessor command set. Intel further developed this technology to add its Streaming SIMD Extensions (SSE, once known as the Katmai New Instructions) to its Pentium III microprocessors to enhance their 3D processing power. The Pentium 4 further enhances SSE with more multimedia instructions to create what Intel calls SSE2.

As the name implies, SIMD allows one microprocessor instruction to operate across several bytes or words (or even larger blocks of data). In the MMX scheme of things, the SIMD instructions are matched to the 64-bit data buses of Intel’s Pentium and newer microprocessors. All data, whether it originates as bytes, words, or 16-bit double-words, gets packed into 64-bit form. Eight bytes, four words, or two double-words get packed into a single 64-bit package that, in turn, gets loaded into a 64-bit register in the microprocessor. One microprocessor instruction then manipulates the entire 64-bit block.

Although the approach at first appears counterintuitive, it improves the handling of common graphic and audio data. In video processor applications, for example, it can trim the number of microprocessor clock cycles for some operations by 50 percent or more.

Very Long Instruction Words

Just as RISC started flowing into the product mainstream, a new idea started designers thinking in the opposite direction. Very long instruction word (VLIW) technology at first appears to run against the RISC stream by using long, complex instructions. In reality, VLIW is a refinement of RISC meant to better take advantage of superscalar microprocessors. Each very long instruction word is made from several RISC instructions. In a typical implementation, eight 32-bit RISC instructions combine to make one instruction word.

Ordinarily, combining RISC instructions would add little to overall speed. As with RISC, the secret of VLIW technology is in the software—the compiler that produces the final program code. The instructions in the long word are chosen so that they execute at the same time (or as close to it as possible) in parallel processing units in the superscalar microprocessor. The compiler chooses and arranges instructions to match the needs of the superscalar processor as best as possible, essentially taking the optimizing compiler one step further. In essence, the VLIW system takes advantage of preprocessing in the compiler to make the final code and microprocessor more efficient.

VLIW technology also takes advantage of the wider bus connections of the latest generation of microprocessors. Existing chips link to their support circuitry with 64-bit buses. Many have 128-bit internal buses. The 256-bit very long instruction words push a little further yet enable a microprocessor to load several cycles of work in a single memory cycle. Transmeta’s Crusoe processor uses VLIW technology.