100g Network Adapter Tuning (DRAFT out for comments, send email to preese @ stanford.edu)
This note will detail suggestions for starting from a default Centos 7.3 system to a tuned 100g enabled system. Currently this documents the use of:
Mellanox 100g NIC, Mellanox ConnectX®-4 VPI-- MCX455A-ECAT (1 port) or MCX456A-ECAT (2 port)
Mellanox ConnectX®-4 EN-- MCX415A-CCAT (1 port) or MCX416A-CCAT (2 port)
QLogic Corp. FastLinQ QL45000 Series 100GbE Controller, QL45611HLCU-CK (1 port)
I want to extend this to cover additional NICs as I can get my hands on them.
This will also cover the installation of the Internet2/ESNet install bits to use the Perfsonar tools on an ordinary host. In particular, these tools, iperf, iperf3 and nuttcp will be installed so they are available. An additional tool from CalTech, fdt.jar, will also be installed. Finally, for easy visibility of the use of the CPU cores/threads, the htop utility will be installed, as well as the screen app.
- Preparing the system and loading needed drivers and repo's.
- Setting firewall rules, just to be safe.
- System tweaks, 'performance' and system speed.
- Testing scenarios, suggestions and results for my configuration.
Install the OS in your favorite way. At the end, be sure things are up to date:
yum update -y
Load the EPEL repository:
yum install epel-release
Next get the needed bits from Internet2, do this:
rpm --import http://software.internet2.edu/rpms/RPM-GPG-KEY-Internet2
yum localinstall http://software.internet2.edu/rpms/el7/x86_64/main/RPMS/Internet2-repo-0.7-1.noarch.rpm
yum install perfsonar-tools
You can look at other perfSONAR bundles available by issuing:
yum search perfSONAR
These commands install a new repo in Centos, Internet2, which has most of the available software that Internet2 offers. Then installs a particular group of tools, the perfSONAR bundle of tools. This preps your system for future testing and enables the future option to let others test against your system without them having to bug you for access first!
Also grab the fdt.jar JAVA application from CalTech/CERN's site:
Download to a place you'll remember, a 'feature' of java is that apps like to be launched from the directory the app is in, at least this one does. Thus, to use this tool you'll need to be in the directory with the fdt.jar file.
Two other utilies to instal, a varient of the popular 'top' utility and the screen program. 'htop' gives a graphical view of how each of the cores/threads are being used. Knowing this will help confirm you have the affinity settings correct for your CPU and NIC. 'screen' is a tool which allows you to multiplex different ssh sessions.
yum install htop screen
Run 'htop' in a different terminal window and you'll see the load on cores/threads grow. This is especially helpful when doing the testing which will follow.
The tool has an oddity where the cores/threads start at 1 instead of 0, to fix, run the program, hit F2 to take you to Setup, down arrow to Display options, right arrow to selection list, down arrow to Count CPUs from 0 instead of 1, hit space, then F10 to return to the htop interface.
If not familar with the 'screen' app, I suggest you take a look at:
There are many 'how to' sites, so look around.
The perfSonar tools use a lot of ports. You can selectively find the ports to open or just trust me and jam these in:
firewall-cmd --zone=public --add-port=61617/tcp --permanent
firewall-cmd --zone=public --add-port=8090/tcp --permanent
firewall-cmd --zone=public --add-port=8096/tcp --permanent
firewall-cmd --zone=public --add-port=4823/tcp --permanent
firewall-cmd --zone=public --add-port=6001-6200/tcp --permanent
firewall-cmd --zone=public --add-port=6001-6200/udp --permanent
firewall-cmd --zone=public --add-port=5001-5900/udp --permanent
firewall-cmd --zone=public --add-port=5001-5900/tcp --permanent
firewall-cmd --zone=public --add-port=861/tcp --permanent
firewall-cmd --zone=public --add-port=8760-9960/udp --permanent
firewall-cmd --zone=public --add-port=33434-33634/udp --permanent
This allows the fdt.jar tool to use its default port
firewall-cmd --zone=public --add-port=54321/tcp --permanent
Makes these permanent and reload rule database
Your system should now have a new net-device, go ahead and give it an appropriate IP, netmask and gateway. While in there, you'll probably want to set the MTU higher, 9000 is a good place to start.
Bring up the interface, is it operational? (You are on your own debugging these issues.)
Lets find the CPU that the 100g NIC is associated with.
Usual response is either '0' or '1', meaning the NIC is associated with either the '0' CPU or the '1' CPU. (If it comes back with a '-1' it probably suggests it is a single CPU system.) Lets assume it returned a '1'.
Knowing that, run this command:
Quite a long list of interesting bits will scroll onto the screen, at the very end you'll likely see something like this:
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
If the NUMA from the earlier commnand came back as a '1', we now know that the 100g NIC is associated with Node1 and has the odd core/thread numbering which will be associated with Node1's CPU.
You'll want to use the CPU closest to the NIC to insure highest performance. Ignoring this, means your testing may use a core/thread from the other CPU and therefore have to traverse the QPI bus with it's notorious slow transfer speeds, greatly impacting your speed testing. Affinity of NIC to CPU is very important! See Brian Tierney's paper.
With this affinity info it is now possible to use some of the newer lower level linux commands to associate running apps with specific cores/threads. In particular you may see 'numactl' with lots of odd options used for this purpose.
Thankfully, Mellanox (and QLogic's beta offering) provide other tools to use to insure that proper affinity is used when using commands that are directed to the 100g NIC. More on these tools in a moment.
Let me do a little summary here, just to be sure we're all on the same page.
We've done the steps necessary to get the 100g NIC recognized and operational on a 100g enabled network (even if it is one server back to back with another server, my current test configuration).
We know the NUMA number for the CPU associated with the 100g NIC and thus we know what cores/threads to use to assure close affinity for optimal speed.
We could and should do some testing at this point just to establish some baselines for performance at this level and as we proceed with additional tuning steps, which are coming next. (Look below and do an iperf3 memory to memory test NOW, honest.)
Ok, all onboard? Lets keep going.
Most modern CPUs can run at different clock frequencies and often do so to save energy. In our case we want to run the CPU as fast as possible. First lets see what speed each CPU core is running at and what the maximum speed could be. Just run this funky command:
grep -E '^model name|^cpu MHz' /proc/cpuinfo
You'll probably see that the cores aren't running near their spec speed. Most often at a level called 'powersave'.
This simple command sets all the cores to 'performance' instead:
sudo cpupower frequency-set --governor performance
Rerun the first command to see what speed the cores are now running at, fast I'll predict!
Sort of goes without saying, but be sure to do this on both systems involved in the testing!
This register does make a difference in my testing.
You could just start launching some tests using the tools already loaded, iperf, iperf3, nuttcp, and fdt, but you really want to see which cores are being used to insure you are getting the optimal IRQ use between the NIC and CPU.
Thus I start by opening another terminal for each system and run 'htop' in that window. I also start daemons on both systems, for iperf3 it is just 'iperf3 -sD'
Start by running a simple iperf3 test between the systems and watch htop's output.
iperf3 -c system2
From the second system, test back to the first:
iperf3 -c system1
These are single core tests so you should have seen both htop displays show high use for a single core. The default iperf3 test is for 10sec, to test longer use the -t(#sec) flag.
Lets try to get some more bandwidth going:
iperf3 -c system2 -P8
iperf3 -c system1 -P8
Are the numbers reasonably close to each other? Hope so.
Try different numbers after the P, higher and lower, what are your results?
Also notice any numbers in the Retr column, this indicates 'Retransmision' requests, also the Cwnd, which is how much data is sent in each packet
For completeness, try testing the reverse direction:
if using iperf3 -c system2, this sends data from system1 to system2. Sometimes it is interesting to run in the reverse direction, system2 to system1, we've done that by going to the other system and testing back. You can also do that from only one system using the -R flag.
iperf3 -c system2 -R
Do the results align?
What is the optimal number of cores, what is the max speed we can get out of the link? Lots of testing can get to these answers. One of the other tools we loaded will greatly help determine the max bandwidth likely between the two systems in their current configuration.
Steps for using the fdt.jar tool:
Start the daemon on one system, be sure to start in the same dir as the fdt.jar file:
java -jar fdt.jar -nettest
Lots of possibly interesting text will scroll by.
Go to the other system, be in the fdt.jar directory:
java -jar fdt.jar -c system1 -nettest
More text will scroll on both screens, eventually getting to bandwidth test results, that will go on as long as you let them, ^C out when ready.
Watching the htop screens during this test should prove quite interesting! Can you suggest how many threads the FDT process uses by default? Which side of the link maxes out?
Stop both sides and start up again but add '-P8' or larger, to the command line. Again watch the htop screen.
First, are the correct cores being used? Next are there as many threads actually being used as requested? Is the bandwidth any higher?
My experience suggests this test is among the best barometers of the speed between two end points.
More suggestions and things to test and try.
I typically set up a number of daemons to be running, for reasons I'll going into shortly. I use the following on both systems:
iperf3 -sD -p5202
iperf3 -sD -p5203
iperf3 -sD -p5204
iperf3 -sD -p5205
Since fdt.jar is pretty output intensive, I generally open up a screen session, then start the daemon in it, then return to my main terminal.
Why run multiple iperf3 daemons on each system? One other variable you'll want track is the symmetry of the connection between the two systems doing simultaneous transfers.
Thus from system1 I test to system2 on the default port, but from system2 I test back to system1, at the same time, using a different port. Tests of this type lead to some interesting results.
iperf3 -c system2
on the other system iperf3 -c system1 -p 5202
The following lines are a little controversial. With current linux versions, network tuning is an added feature, and it is pretty good. I like to think of these lines in the following way:
Linux autotuning does use the sysctl settings to set the range of options it uses. Changing the values broadens the range that the auto tuner can use. This added range allows better tuning for high speed NICs. It also means that when doing testing, some of the values seen may be different than those set with the following lines.
First make a copy of the current state of the sysctl values:
sysctl -a > ~/orig-sysctl-values
Now edit the file:
vi /etc/sysctl.d/<file> Where <file> on my system is '99-sysctl.conf'
Add these lines to the end:
(No controversy to add/change these values for high speed nics)
# increase TCP max buffer size setable using setsockopt()
# allow testing with 256MB buffers
net.core.rmem_max = 268435456
net.core.wmem_max = 268435456
# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
# allow auto-tuning up to 128MB buffers
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
# recommended to increase this for CentOS6 with 10G NICS or higher
net.core.netdev_max_backlog = 250000
# don't cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save = 1
# Explicitly set htcp as the congestion control: cubic buggy in older 2.6 kernels
net.ipv4.tcp_congestion_control = htcp
# If you are using Jumbo Frames, also set this
net.ipv4.tcp_mtu_probing = 1
# recommended for CentOS7/Debian8 hosts
net.core.default_qdisc = fq
These can be activated by:
sysctl -p /etc/sysctl.d/<file>
Similarly, you can fall back to the original options with:
sysctl -p ~/orig-sysctl-values
There are three other lines to issue, these will only persist till the system restarts. They'll need to be reissued each reboot. (there are ways to make these permanent but I'll leave that for another write up)
ethtool -K <net-device> lro on
ifconfig <net-device> txqueuelen 20000
systemctl stop irqbalance <<--- this is an important service to stop!