Next-generation CPUs for the Ubiquitous Information Society

Yasushi Akao
Tadashi Saito
Tsuneo Sato

OVERVIEW: The ubiquitous information society is one in which data networks, as exemplified by the Internet, can be accessed by anyone, anytime, anywhere, thus entering a new “WWW (Whenever, Wherever, Whoever)” era. Because the fusion of digital data is accelerating, the opportunities for applying microprocessors (CPUs) that process digital data are expanding without limit. Hitachi is responding to the needs of such an era by developing the next-generation CPU core, which is equipped with a variety of capabilities. We will contribute to the realization of the ubiquitous information society with these CPUs.

INTRODUCTION
THE terminals used for access in the ubiquitous information society will be found not only in personal computers and cell phones but also in a wide variety of applications such as television sets, refrigerators, microwave ovens and other household appliances as well as automobiles, vending machines and so on, all of which will be connected to the Internet in the future. In that way, information will permeate all aspects of our lives (see Fig. 1).

When taking a bird’s-eye view of the ubiquitous information society in its entirety in this way, we can recognize the following requirements for the CPU.
(1) Optimal multimedia processing of still images, moving pictures and music for the human ear and eye
(2) Low power consumption for battery operation.
(3) Ease of SoC (System on Chip) development for developing SoC with built-in CPU to match “Time to Market” (good market introduction timing).

Here, we describe what a CPU suited to the ubiquitous information society should be like and how Hitachi is dealing with that challenge.

Fig. 1 — The Ubiquitous Information Society Concept.
A society in which various kinds of devices can be connected to a network and information is accessible to anyone, anytime and anywhere.
WHAT IS REQUIRED OF THE CPU

For the terminals used in the ubiquitous information society, the importance of the user interface (UI) in conveying information in a way that is easily understood by people is steadily increasing. Appealing to the senses of hearing and vision is, in particular, a key to achieving such an interface. Seen from the CPU side, this means that efficient multimedia processing for pictures, video, music, voice and so on is important.

The next point is that battery operation will be standard in the ubiquitous information society, where people will be using wireless terminals at any location. Accordingly, low power consumption in required processing will be particularly important.

Also, in this kind of society, data content businesses will become one engine for industrial growth. The factors for deciding on the success of content as a product, “enjoyable,” “fast,” and “beautiful,” will be a driving force for short-term changes in product specifications. Accordingly, an LSI for use in a ubiquitous terminal that implements these functions must be easy to develop in a short period of time. That is to say, short development time for SoCs that have an incorporated CPU is a necessity.

Although there are also other requirements, we describe (1) multimedia performance (performance of middleware for multimedia processing), (2) low power consumption, and (3) ease of SoC development, all of which are requirements that have a high degree of universality.

IMPLEMENTATION TECHNOLOGY

Multimedia Performance

When multimedia processing is done by the CPU, it is generally accomplished by middleware (software) or by a combination of software and hardware such as an accelerator. From the software point of view, some processing is done with a fixed-point algorithm (audio and image processing for example) and other processing is done with a floating-point algorithm (3-D graphics processing for example). To achieve multimedia processing efficiently, at low cost and with low power consumption, there are advantages in using a general-purpose CPU that has been given DSP or FPU functions rather than a CPU together with a separate, dedicated digital signal processor (DSP) or floating-point unit (FPU).

As a means of improving the performance of a general-purpose CPU while considering low power consumption, it is necessary to improve the performance per clock cycle by introducing parallel processing rather than simply increasing the operating frequency.

Also, depending on the application product, it may be necessary to introduce an accelerator to optimize performance and power consumption. For that purpose, in addition to using a standard bus configuration, which facilitates the reuse of hardware IP (Intellectual Property: semiconductor design sharing platform), it is also necessary to implement high-speed data transfer performance, which enables the transfer of large quantities of data.

Low Power Consumption

With advances in semiconductor integration scale, the leak current that occurs along with the lowering of the MOS threshold voltage is becoming large enough to be non-negligible, even for CMOS (complementary metal oxide semiconductor) circuits, and so must be considered in addition to the current dissipated due to capacitor charging and discharging during operation. Dealing with that leak current has therefore become an issue. Effective countermeasures include (1) leak current reduction achieved by controlling threshold voltage by means of substrate voltage control (when the clock is stopped) and (2) power supply cut-off to circuit blocks that are not required to operate (during operation or when the clock is stopped). These functions must be incorporated in the CPU at the design stage.

Ease of SoC Development

What is required of a CPU for ease of SoC includes the items listed below.

(1) Software core: Design assets are in the form of RTL (register transfer level) description, which is independent of process technology. A single-phase clock and edge-triggered flip-flop are used. Synchronous standard SRAM (static random access memory) is used for the memory.

(2) Scalability of core specifications in line with SoC needs: Basic core specifications, which include attachment or removal of a DSP/FPU arithmetic unit and capacity of the cache or other internal memory, are selectable in a scalable manner.

(3) Standard bus configuration

(4) On-chip debugging function: The JTAG (Joint Test Action Group) interface and break, trace and other debugging functions are supported.

(5) Good testability: Stand-alone testing by means of a scan circuit and LBIST (logic built-in self-test) is possible.
NEXT-GENERATION CPU CORE
Improvement of Middleware Performance

A new CPU core that integrates the earlier SH3-DSP and SH-4 is now in development. An overview of that device is shown in Fig. 1. This new core makes it possible to select either the SH3-DSP series CPU, which incorporates a DSP, or the SH-4 series CPU, which incorporates an FPU, according to middleware needs. By increasing the number of pipeline stages from five to seven with the aim of raising the operating frequency by a factor of 1.5, the upper limit of performance can be pushed up, even with the same process technology.

Furthermore, using the superscalar architecture that has been employed for the earlier SH-4 for the common core can greatly improve the middleware performance of the SH3-DSP.

For the bus system, the “SuperHyway” bus is used as the standard bus. Doing so makes it possible to secure accelerators the bus bandwidth needed for high-speed data transfer without sacrificing CPU performance. In addition, connection with existing IP is facilitated by a bus bridge configuration (see Fig. 3).

Improvements for Low Power Consumption

The means of achieving low power consumption in the next-generation CPU core include the three conventional methods listed in items (1) through (3) below, and one new method listed as item (4).

1. Gated clock: An “enable signal” for the clock is added to stop the flip-flop at times when data updating is not necessary.
2. Module stop mode: The provision of the clock to a

Fig. 2—Functions and Configuration Example of Hitachi’s Next-generation CPU Core. The next-generation CPU core integrates a SH3-DSP core and an SH-4 core, and DSP, FPU, cache, and MMU (memory management unit) modules can be selectively added or removed. The 7-stage pipeline is used for frequency improvement (for the SH-4) and the superscalar architecture is used for improved performance (for the SH3-DSP).

Fig. 3—New CPU Core Bus Method. The “SuperHyway” bus serves as the standard bus. To maintain high-speed bus bandwidth, the existing IP connection is done via a bus bridge.
selected module (DSP, FPU, etc.) is stopped.

(3) Sleep mode: The provision of the clock to the CPU is stopped to reduce power consumption.

(4) Power cut-off mode: the power supply is separated and the power to individual circuit blocks whose operation is not required is cut off by means of a new power switch. This mode is capable of controlling power for each individual function module, and is particularly effective for cell phones and other products that have particularly strong requirements for low power consumption.

Method of Implementing SoC Development

SoC development can be achieved easily in the way described below. Concerning the peripheral circuits of memory that were previously configured with hardware macros, the controller and data path are described in RTL. The address decoder and memory cell are configured by standard compiled RAM, and the various differences in the RAM interface are concealed by wrapper circuits (see Fig. 4). Also, changes in memory capacity by parameter settings are made possible by enumerating multiple memory capacity descriptions in the RTL description, and then adding parameter specification descriptions for selecting particular descriptions. The connection of the FPU, DSP, etc. is described with an interface that facilitates adding and removing, making the connection design easy for the DSP function block when used as the SH3-DSP and for FPU function block when used as the SH-4.

On-chip Debugging Functions

For conventional on-chip debugging functions, all of the functions are incorporated in the core (see Fig. 5). In the new CPU core, the break, trace, performance count and other such functions for CPU operation are
built into the core as a CPU analyzer, but the on-chip break and trace functions for on-chip bus operation are implemented outside the core as a bus analyzer. This configuration allows easy expansion of the debugging functions.

Testability
The cost of testing is reduced by moving away from the conventional MUX (multiplexer) scan method toward the LBIST method, which allows a short testing period by automatic generation of random number patterns within the chip (see Fig. 6). Furthermore, because the LBIST method performs the testing at a speed that matches the operating frequency of the product, an increase in product quality can be expected.

CONCLUSIONS
We have described the new type of CPUs required by the ubiquitous information society. The requirements for those CPUs are not special, but shared by a wide range of application fields. Hitachi aims at continuous improvements, and concerning low power consumption, we anticipate that there will be a movement in the direction of dynamic optimum control to match usage conditions and away from guaranteeing the maximum current of previous devices. The next issue to address is how to incorporate such power control functions into operating systems (OS) and other system software.