The problems inherent to the PCI bus came to light with the rapid increase in graphical processing capabilities. With the GPU's quickly increasing demands for fast connections to the CPU and memory, the AGP bus was created. AGP is a direct derivative of PCI, but clocked higher and without shared-bus contention problems.
Present day high-end networks reach speeds of up to 10 Gigabits, much higher than what the PCI bus can deliver. Networking is therefore beginning to experience the same problems as graphics processing. For high-end links special purpose hardware can be used to circumvent the PCI bottleneck. With the advent of Gigabit on the desktop, however, PCI woes are becoming ever critical.
To address the inherent limitations of PCI a number of competing technologies have been devised. The PCI-X bus, used frequently in midrange to high-end servers, is an extended and backward compatible version of PCI that allows the clockrate to be increased to 66 and even 133 MHz and simultaneously adds support for a wider 64bit databus. With PCI-X a theoretical bandwidth of up to a GB/s is obtainable. PCI-X 2.0 adds even higher clockrates of up to 533MHz, leading to a 4GBps peak rate. PCI-X's main drawback for widespread implementation is the extra cost incurred by the more complicated architecture and minute timing demands.
A wholly different architecture that is currently receiving a lot of attention, partly due to its cost-effectiveness compared to PCI-X, is PCI Express. Marketed as the natural successor to PCI, PCI Express departs completely from the shared-bus architecture and replaces it with a point-to-point interface. PCI express uses serial `links' as the main transport medium, whereby a single link has a theoretical peak bandwidth of 2.5Gb/s each way. However, a 20 percent channel overhead reduces the throughput to "just" 2Gbps. Multiple lines can be aggregated to create 2,4,8 and currently up to 16 times the individual links' transfer speeds.
Apart from these general purpose system bus architectures, specialized interfaces have also been devised. Intel's Communications Streaming Architecture, for instance, creates a direct interface between a NIC and main memory tailored specifically for the copying of data packets without interference from other devices or burdening the CPU. CSA is targeted especially to networking, and therefore probably not applicable to widespread adoption in desktop motherboards.
General purpose processors are powerful devices, but lack network-oriented optimizations. Therefore they are unable to undertake advanced packet processing at high linespeeds. For example, a single 1GHz processor can theoretically dedicate a lowly 100 cycles per packet at 10^6PPS rates (a hypothetical 10Gb link where average packet size is 1KB). With the overhead of kernel processing and especially memory latency added the actual per packet processing decreases even more.
Memory limitations are especially of concern. Theoretical memory bandwidth of present day technologies lie in the GB/s range. However, due to latency sustained throughput is often much lower. The STREAM benchmark gives sustained throughput under a copy operation of 100's of MB/s. This problem becomes especially apparent when accessing relatively small memory areas per request, as is the case with network packet processing. For such operations latency hiding is difficult and prefetching large blocks pointless.
Moreover, latency hiding techniques tend to increase the bandwidth requirements. As access time of memory grows at a rate of less than 10 percent per year, very little indeed compared to Moore's law, the "memory gap" is getting worse rather than better in the future. Latency is expected to become a major bottleneck
For back-bone operations network architects have therefore often resorted to costly dedicated embedded hardware solutions equipped with special purpose hardware, such as ASIC processing units, reduced latency DRAM and high-speed uncontested datalinks. These devices are however too expensive to reconfigure to all the varying advanced tasks.
The jury is still out on bus and CPU speeds. PCI-Express may crank bus speeds up to (close to) network speed in the near future (although this remains to be seen), and CPUs may be improved with special hardware for dealing with network rates (possibly in combination with processor-level parallelism). There is no doubt, however, that DRAM memory will become a bottleneck. In tomorrow's networks, fast packet processing may well require avoiding all accesses to slow DRAM whenever possible, and store (fewer) packets in smaller, yet faster memory on chip. Another strategy is to distribute the memory load. For instance, accessing general purpose main memory for all high-speed links and for all 'normal' applications as well, may be a bad idea. First, it means all the load is concentrated on a single resource. Distributing the load over multiple memory chips present on the network cards, may alleviate this problem. Second, memory on the network cards may implement all sorts of specialised optimisations that would not be suitable for general purpose memory. An example can be found in network cards based on Intel IXP1200 network processors that have multiple different priority queues to access memory.
Network Processors generally come as a peripheral device, e.g. a PCI card. By handling most elementary tasks on the network processor the CPU, memory and PCI bus are all offloaded. However, due to their programmability and close connection to the CPU, network processors keep open the option of implementing advanced processing capabilities, either on-board or in the host.
While network processors are not as suited to high-speed switching as dedicated hardware, they can be seen as an ideal candidate for more specialized functions, e.g. advanced traffic monitoring, for which commodity hardware lacks the processing speed and embedded devices are too costly to design.
An objection to using zero-copy is that transfer of data over low bandwidth links, such as the PCI bus or main memory as compared to dedicated RAM, is actually more costly than executing multiple operations locally. For FFPF, the main problem is the PCI bus. Reading over PCI has a maximum theoretical bandwidth of 133MB/s. Even with a 66 MHz busspeed, on our testsystem measured sustained transfer speed was only 80MB/s. We found that the local SDRAM in our testsystem could, however, copy data (read+write) at 240MB/s and up. Read speeds were found to lay around 2.4GB/s. Similarly, by virtue of their highly parallel designs, the microengines of the IXP1200 can read at linespeed (1Gb/s).
A single copy over the PCI bus from IXP1200 SDRAM to host SDRAM will therefore often be result in higher actual processing speed for processing-intensive tasks, as both the NPU and the CPU can saturate their respective memory buses, instead of having to wait on the much slower PCI interconnect. Moreover, programmed I/O is much slower for bulk transfer than DMA access, which is supported by a number of modern network cards.
The only situation in which zero-copy will benefit packet processing is when the total number of bytes read by the host is much smaller than the complete packet, as then the penalty for copying the full packet outweights the advantage of having faster read speeds. This is true, for instance, for a fairly large number of network monitoring applications which often look at a very small number of fields in the IP/TCP headers. In this case, copying the entire packet (even with fast DMA) may be overkill.
An additional factor speaking against zero-copy is that for each byte request in reality much larger cacheline is sent. By copying larger buffers at once, copy-once makes better use of these blocktransfers and the subsystem's prefetching capabilities, circumventing RAS to CAS latencies. Hasan et al. demonstrate how throughput can be increased even more by paying close attention to RAM latency.