Introducing the GeForce4: nVidia strikes back
Light-Speed Memory Architecture II
The problem with many graphics cards today, or the chips installed on them, is the significantly faster development of graphics chips in contrast to graphics memory. The gap between the two speeds is widening. For two years now, graphics chips have been available that have a fill rate in the GigaTexel region (1 billion textured pixels), but these are increasingly being thwarted by the limited growth in memory bandwidth. The problem with this is that the increasing depth of complexity of current games condemns more and more pixels not to be visible on the screen because they are covered by other objects when the scene has been completely calculated. This phenomenon is called overdraw in modern German.
For some time now there have been techniques to downdraw the overdraw Use wasted bandwidth more effectively. The lonely leader in this area is clearly the Kyro series from ImgTech/PowerVR. These use what is known as tile-based rendering, in which only the pixels that are actually visible on the screen are fully calculated. Ati has been using a combination of techniques under the marketing term 'HyperZ' since the introduction of the original Radeon chip. sinceWhen the GeForce3 appeared almost a year ago, nVidia was no longer too bad to equip its chips with bandwidth-saving features and gave them the collective term 'Light Speed Memory Architecture'.
The LMA-II now consists of six individual chips Components, each of which can in itself help minimize excessive bandwidth usage by texture operations. In the days of the GeForce3, on the other hand, there were only four elements that can also be found in the LMA-II in an 'improved' form. In detail, these would be the Cross-Bar Memory Controller, Z-Occlusion Culling and Lossless Z-Compression. New additions are a cache collection called QuadCache, an Auto PreCharge for the local graphics RAM and a Fast Z-Clear modeled on ATI's HyperZ. Below are a few words about each component.
Cross-Bar Memory Controller (CBCM): Almost every conventional graphics card (even including the Radeon8500) has a memory controller that supports the performs various necessary memory accesses, usually with 128Bit (DDR). In and of itself, that's a good thing, but there is a catch here. Every time an access is carried out that requires less than 128 bits (DDR) to be transferred, the excess capacity of the controller is simply wasted.
The nV25 (and also the nV20), however, has a quadruple memory controller Use that works independently with 32Bit (DDR). These units, which can also be interconnected with 128 bits (DDR) if required, enable a much more efficient use of the available memory bandwidth, since the full 128 bits (DDR) are often not required during access. Ideally (hits with less than33Bit (DDR) can save up to 75% bandwidth, which is then available for other tasks. The nV17 also contains this technology, but only in duplicate (2x64Bit (DDR)), ie the maximum benefit from the CBMC is 50% saved bandwidth.
QuadCache: For primitives (basic geometric elements), vertex data, textures and pixels that are already being processed, a separate cache, optimized in terms of size and structure, is used. This holds frequently required data that would otherwise have to be laboriously loaded from the graphics memory. The effect is comparable to the first and second level caches, without which even a modern processor beyond 1.5 GHz would be slowed down to the performance level of a Pentium133.
According to nVidia, these caches are optimally adapted to the respective requirement profile, which is also necessary because they are integrated in the chip and every unnecessary transistor there is worth real money. We have not yet found out more about this, but with the GeForce3, for example, the vertex cache comprised 24 entries, while the other caches were not yet available (or at least not advertised).
As already mentioned above, superimposing several objects in current 3D games can hardly be avoided. The more complex the graphics of the game, the higher this overdraw becomes and therefore costs more and more performance in proportion.
The effect with the Villagemark is very nice to see, in which one of the raw data is almost already ridiculous looking KyroII card both the current high-end chips from Ati and nVidia far behindand which has a deliberately created, particularly high overdraw factor of up to 10. This means that up to ten times the visible image points have to be calculated, although they cannot be seen afterwards. It is precisely this problem that attempts are being made to deal with Z-Occlusion Culling (the removal of hidden objects).
How can you recognize in advance that certain pixels, or ideally entire areas, cannot be seen later? The solution for this is actually quite obvious. Since you don't get to see a shambles of overlapping objects on the screen, there must be some value that determines how far a certain object is in the field of view from the viewer. This value, the depth information, is called the Z value and it is, obviously, stored in the Z buffer. If you now check before rendering whether a certain pixel is already being covered by another, you can of course save yourself the further calculation, which in turn greatly benefits the available bandwidth.
One sub-item is that Application of this procedure to a complete region of the screen content, the so-called occlusion query. However, this must currently still be done via software request, i.e. the respective program must request the graphics chip to check a region. In principle, this method is of course much more efficient, but requires the use of specially designed software.
Lossless Z-Compression: By this, nVidia understands a compression for the depth information to be written (Z-Buffer ), whereby the compression rate should be 4: 1, as with the GeForce3. Unfortunately, we do not yet have more detailed information on this feature, so that one should critically keep nVidia's claim that the data traffic to and from the Z-buffer could be reduced by a factor of 4 in this way.
Auto PreCharge: This feature is used to make the RAM ready to accept new data as quickly as possible after accessing an area of the RAM and before accessing another area. The main memory built into the PC also has to wait a certain number of clock cycles, for example, before new data can be sent. The higher the clock frequency of the RAM, the higher the number of these cycles in general. nVidia speaks of up to 10 clock cycles, which can be necessary until one memory area is closed and the next is opened and ready for data transfer. Auto PreCharge is now supposed to activate areas with a forecast calculated in the chip, which in the near future (in the microsecond range) may become the target of a data transfer. These are then 'pre-charged' on suspicion and so the waiting time can be reduced to the 'normal' latency of 2-3 cycles.
Fast Z-Clear: Den Z- We already mentioned Buffer earlier. So that there is no chaos on the screen, the values on it must of course be exact. Remnants of one pixel from the last fully rendered frame can falsify the result and pixel errors would be the result. For this reason, each value in the Z-buffer must be set to zero before each new image to be calculated. In principle, this costs as much bandwidth as any other access. For a single image in 1024x768 in 32Bit, 3.1MB of bandwidth would be lost for this Z-Buffer Reset alone (multiplied by the number of the frame rate, assuming 30 frames/s already 95MB/s), which could just as well be used .
Just like the HyperZ model in ATI's Radeon series, which has been doing the same since mid-2000, the Z buffers are not overwritten individually with zero, but are set to the original value in one go, not dissimilar in principle to pressing the reset button, only thatonly the area of the graphics memory in which the Z-buffer is located is deleted.
In this context, it is interesting that the GeForce3 has another feature that also reduces the Bandwidth was used, but in this case to reduce the geometry data.
We are talking about the so-called higher order surfaces. You may have come across many of them in the form of ATI's TruForm. Essentially, nVidia's path to the HOS was about no longer chasing complex curved surfaces using a vast number of triangles over the AGP bus, but only describing them using a complicated mathematical formula. Necessary changes were then made to the formula by changing variables. Unfortunately, this feature remained a phantom throughout its life. It appears in the official product description for the GeForce3, but since the Detonator driver in version 20.80 it is no longer available and now nVidia seems to have deleted this from the feature list entirely. Who knows, maybe to market it again as a new feature for the nV30?
On the next page: Accuview