What is Hyper-Threading ?: The Basics Explained

What is Hyper-Threading ?: The Basics Explained

What is Hyper-Threading?

The idea of ​​Hyper-Threading is coming actually from the server area, where computer systems with more than just one processor are often used. With two processors it is e.g. possible to process two threads per clock cycle, since each CPU can be busy with tasks independently. If an application has been divided into several threads by the programmer, the speed can be increased significantly, if not almost doubled. With web server applications in particular, one can almost speak of a doubling of performance.

The Hyper-Threading Technology is here nothing more than the technology that enables this simultaneous multi-threading in a single physical processor (with restrictions). This one processor is thus split into two logical processors, the two logical/virtual processors sharing part of the physical execution resources, with the implementation logic (architecture state) for each processor being present individually and thus twice. The aim is to use the available resources more effectively.

What's new at the core?

In theory, Hyper-Threading actually sounds quite simple. But what did Intel really have to add to the previous Pentium 4 processor core in order to be able to turn one physical processor into two logical ones? As surprising as the answer may be, “add” is actually not the right word in this context. It would be much better to ask which ones have always existedResources can finally be activated. Every delivered Pentium 4 with a Northwood core (and probably also with a Willamette core) has all the necessary transistors that are needed for Hyper-Threading. Intel says that 5 percent or 2 to 3 million transistors of the relative CPU are part of Hyper-Threading. In terms of size and transistors, nothing changes compared to the previous Pentium 4.

An overview of the Pentium 4 processor core (Northwood) reveals the regions that are now being freed from dust with Hyper-Threading and finally be able to show what's inside them.

What has been added?

In addition, each virtual processor has its own interrupt controller (APIC), the resources such as trace cache (TC)/L1 cache, L2 cache, queues and key buffers divided.

With the trace cache, which contains decoded instructions (micro-operations or µops), if both virtual processors want to have access to the cache, each processor is granted access alternately with each clock. If a virtual processor is waiting for further data from the main memory due to a cache miss and is thus quasi 'blocked' (stalled), the entire trace cache is available to the respective 'other processor'. The process for the internal caches of the processor is similar. If a virtual processor is not used at all, the division of these two resources does not result in any disadvantage.

However, it looks different in the out-of-order execution engine, here especially with the allocator. The out-of-order execution engine has several buffers for reordering, tracing and dividing (sequencing) operations. TheThe task of the allocator is to fill this buffer. The Pentium 4 Northwood core has 126 re-order buffer entries, 128 integers and 128 floating-point physical registers as well as 48 load and 25 store buffer entries. A part of this buffer is divided when Hyper-Threading is active, so that each virtual processor can fill exactly half of the buffer. This concerns the re-order, load and store buffers. Each virtual processor can leave a maximum of 63, 24 or 12 entries here. If the limit is reached, the allocator will assign the resources to the other virtual processor. If there are only decoded instructions for one processor in the task pot (µop queue), the allocator will try to allocate resources to each cycle, but the division remains.

Execution Pipeline

When the allocator has finished its work, the µops end up in two further queues. These are also partitioned in such a way that each processor can book a maximum of half of all entries for itself. From these two queues (Memory Operation Queue and General Instruction Queue) the five µop schedulers take turns from the pot of each virtual processor to assign tasks for processing to the execution units.

On the next page: How does it work?