[Home]Network Processing Bottlenecks

Wiki Main | RecentChanges | Preferences

Dealing with bottlenecks

Network speeds have increased rapidly over the years. Not only has the rise in network throughput outpaced that of CPU's, it has thereby also bypassed capabilities of many system links, such as RAM and peripheral buses.

The worst offender: PCI

The current bottleneck in most commodity computing platforms (e.g PCs) for dedicated processing is the PCI bus. PCI was introduced about a decade ago and its standard throughput of 133MB/s at 33 MHz on a 32bit wide bus was at the time fast enough to handle all traffic between the CPU and all peripheral devices.

The problems inherent to the PCI bus came to light with the rapid increase in graphical processing capabilities. With the GPU's quickly increasing demands for fast connections to the CPU and memory, the AGP bus was created. AGP is a direct derivative of PCI, but clocked higher and without shared-bus contention problems.

Present day high-end networks reach speeds of up to 10 Gigabits, much higher than what the PCI bus can deliver. Networking is therefore beginning to experience the same problems as graphics processing. For high-end links special purpose hardware can be used to circumvent the PCI bottleneck. With the advent of Gigabit on the desktop, however, PCI woes are becoming ever critical.

To address the inherent limitations of PCI a number of competing technologies have been devised. The PCI-X bus, used frequently in midrange to high-end servers, is an extended and backward compatible version of PCI that allows the clockrate to be increased to 66 and even 133 MHz and simultaneously adds support for a wider 64bit databus. With PCI-X a theoretical bandwidth of up to a GB/s is obtainable. PCI-X 2.0 adds even higher clockrates of up to 533MHz, leading to a 4GBps peak rate. PCI-X's main drawback for widespread implementation is the extra cost incurred by the more complicated architecture and minute timing demands.

A wholly different architecture that is currently receiving a lot of attention, partly due to its cost-effectiveness compared to PCI-X, is PCI Express. Marketed as the natural successor to PCI, PCI Express departs completely from the shared-bus architecture and replaces it with a point-to-point interface. PCI express uses serial `links' as the main transport medium, whereby a single link has a theoretical peak bandwidth of 2.5Gb/s each way. However, a 20 percent channel overhead reduces the throughput to "just" 2Gbps. Multiple lines can be aggregated to create 2,4,8 and currently up to 16 times the individual links' transfer speeds.

Apart from these general purpose system bus architectures, specialized interfaces have also been devised. Intel's Communications Streaming Architecture, for instance, creates a direct interface between a NIC and main memory tailored specifically for the copying of data packets without interference from other devices or burdening the CPU. CSA is targeted especially to networking, and therefore probably not applicable to widespread adoption in desktop motherboards.

Additional worries: CPU and RAM

The mentioned technologies have in common that they break the PCI bottleneck by increasing throughput of the transfer medium. However, for (multi)gigabit networking this will only solve part of the problem. Additional remaining issues are the relatively low per packet processing that a CPU can deliver and the insufficient bandwidth of standard memory.

General purpose processors are powerful devices, but lack network-oriented optimizations. Therefore they are unable to undertake advanced packet processing at high linespeeds. For example, a single 1GHz processor can theoretically dedicate a lowly 100 cycles per packet at 10^6PPS rates (a hypothetical 10Gb link where average packet size is 1KB). With the overhead of kernel processing and especially memory latency added the actual per packet processing decreases even more.

Memory limitations are especially of concern. Theoretical memory bandwidth of present day technologies lie in the GB/s range. However, due to latency sustained throughput is often much lower. The STREAM benchmark gives sustained throughput under a copy operation of 100's of MB/s. This problem becomes especially apparent when accessing relatively small memory areas per request, as is the case with network packet processing. For such operations latency hiding is difficult and prefetching large blocks pointless.

Moreover, latency hiding techniques tend to increase the bandwidth requirements. As access time of memory grows at a rate of less than 10 percent per year, very little indeed compared to Moore's law, the "memory gap" is getting worse rather than better in the future. Latency is expected to become a major bottleneck

For back-bone operations network architects have therefore often resorted to costly dedicated embedded hardware solutions equipped with special purpose hardware, such as ASIC processing units, reduced latency DRAM and high-speed uncontested datalinks. These devices are however too expensive to reconfigure to all the varying advanced tasks.

The bottom line : present and future bottlenecks

The jury is still out on bus and CPU speeds. PCI-Express may crank bus speeds up to (close to) network speed in the near future (although this remains to be seen), and CPUs may be improved with special hardware for dealing with network rates (possibly in combination with processor-level parallelism). There is no doubt, however, that DRAM memory will become a bottleneck. In tomorrow's networks, fast packet processing may well require avoiding all accesses to slow DRAM whenever possible, and store (fewer) packets in smaller, yet faster memory on chip. Another strategy is to distribute the memory load. For instance, accessing general purpose main memory for all high-speed links and for all 'normal' applications as well, may be a bad idea. First, it means all the load is concentrated on a single resource. Distributing the load over multiple memory chips present on the network cards, may alleviate this problem. Second, memory on the network cards may implement all sorts of specialised optimisations that would not be suitable for general purpose memory. An example can be found in network cards based on Intel IXP1200 network processors that have multiple different priority queues to access memory.

Network Processors

A solution that shares some of the advantages of dedicated devices with the cost-effectiveness of commodity hardware is the network processor (NPU). Several NPU platforms have been created, such as IXP by Intel, PowerNP by IBM and C-Port by Motorola. Network processors are dedicated pieces of equipment consisting of high-speed memory and fast interconnects. Contrary to high-end solutions, however, they do not contain preprogrammed ASIC's, but use reprogrammable `micro' processors. The Intel IXP1200, for instance, consists of an ARM CPU for control tasks coupled with 6 `microengines', simple processors that do not need an operating system to function. Also, by enabling multithreading at the hardware level, microengines can hide the DRAM latency problems by switching tasks during read delays. The result is a higher sustained throughput than achievable with commodity hardware.

Network Processors generally come as a peripheral device, e.g. a PCI card. By handling most elementary tasks on the network processor the CPU, memory and PCI bus are all offloaded. However, due to their programmability and close connection to the CPU, network processors keep open the option of implementing advanced processing capabilities, either on-board or in the host.

While network processors are not as suited to high-speed switching as dedicated hardware, they can be seen as an ideal candidate for more specialized functions, e.g. advanced traffic monitoring, for which commodity hardware lacks the processing speed and embedded devices are too costly to design.

Optimizing throughput: the zero-copy myth

Depending on the operation, quite a lot of data can still have to be transferred between the NPU and the CPU. This is especially the case in one-way tasks, such as network monitoring. To maximize flexibility the implementation of the inter-processor communication should be optimized as much as possible. An often-heard solution to increasing overall system throughput is the so-called zero-copy method, whereby network packets are only written once into device memory by the NIC. After that, all actors in the processing queue access the same piece of memory.

An objection to using zero-copy is that transfer of data over low bandwidth links, such as the PCI bus or main memory as compared to dedicated RAM, is actually more costly than executing multiple operations locally. For FFPF, the main problem is the PCI bus. Reading over PCI has a maximum theoretical bandwidth of 133MB/s. Even with a 66 MHz busspeed, on our testsystem measured sustained transfer speed was only 80MB/s. We found that the local SDRAM in our testsystem could, however, copy data (read+write) at 240MB/s and up. Read speeds were found to lay around 2.4GB/s. Similarly, by virtue of their highly parallel designs, the microengines of the IXP1200 can read at linespeed (1Gb/s).

A single copy over the PCI bus from IXP1200 SDRAM to host SDRAM will therefore often be result in higher actual processing speed for processing-intensive tasks, as both the NPU and the CPU can saturate their respective memory buses, instead of having to wait on the much slower PCI interconnect. Moreover, programmed I/O is much slower for bulk transfer than DMA access, which is supported by a number of modern network cards.

The only situation in which zero-copy will benefit packet processing is when the total number of bytes read by the host is much smaller than the complete packet, as then the penalty for copying the full packet outweights the advantage of having faster read speeds. This is true, for instance, for a fairly large number of network monitoring applications which often look at a very small number of fields in the IP/TCP headers. In this case, copying the entire packet (even with fast DMA) may be overkill.

An additional factor speaking against zero-copy is that for each byte request in reality much larger cacheline is sent. By copying larger buffers at once, copy-once makes better use of these blocktransfers and the subsystem's prefetching capabilities, circumventing RAS to CAS latencies. Hasan et al. demonstrate how throughput can be increased even more by paying close attention to RAM latency.

Implementation in FFPF

In FFPF we have opted for a dual approach. Most of the time, copy-once will results in the best performance, especially when client applications request full packets, so that all data has to be transferred anyway. When it is to be expected that zero-copy performs better the device driver can be told to leave data at the remote location, however. An extension to this scheme might be to only copy those portions of a packet that will probably be needed, such as the packet header, and leave the rest remotely until it is requested. In extremo this would result in using main memory as a cache to PCI addressable memory.

Wiki Main | RecentChanges | Preferences
This page is read-only | View other revisions
Last edited March 17, 2005 4:06 pm by Testeditor (diff)