Skip to main content Skip to secondary navigation

100g Network Adapter Tuning

Main content start

100g Network Adapter Tuning (DRAFT out for comments, send email to preese @ stanford.edu)

 
This note will detail suggestions for starting from a default Centos 7.3 system to a tuned 100g enabled system. Currently this documents the use of:
Mellanox 100g NIC, Mellanox ConnectX®-4 VPI-- MCX455A-ECAT (1 port) or MCX456A-ECAT (2 port)
Mellanox ConnectX®-4 EN-- MCX415A-CCAT (1 port) or MCX416A-CCAT (2 port)
QLogic Corp. FastLinQ QL45000 Series 100GbE Controller, QL45611HLCU-CK (1 port)
 
I want to extend this to cover additional NICs as I can get my hands on them.
 
This will also cover the installation of the Internet2/ESNet install bits to use the Perfsonar tools on an ordinary host. In particular, these tools, iperf, iperf3 and nuttcp will be installed so they are available. An additional tool from CalTech, fdt.jar, will also be installed. Finally, for easy visibility of the use of the CPU cores/threads, the htop utility will be installed, as well as the screen app.
 
  1. Preparing the system and loading needed drivers and repo's.
  2. Setting firewall rules, just to be safe.
  3. System tweaks, 'performance' and system speed.
  4. Testing scenarios, suggestions and results for my configuration.

1. Preparing the system and loading needed drivers and repo's

 
Install the OS in your favorite way. At the end, be sure things are up to date:
yum update -y
 
Load the EPEL repository:
yum install epel-release
 
Next get the needed bits from Internet2, do this:
yum install perfsonar-tools
 
You can look at other perfSONAR bundles available by issuing:
yum search perfSONAR
 
These commands install a new repo in Centos, Internet2, which has most of the available software that Internet2 offers. Then installs a particular group of tools, the perfSONAR bundle of tools. This preps your system for future testing and enables the future option to let others test against your system without them having to bug you for access first!
 
Also grab the fdt.jar JAVA application from CalTech/CERN's site:
 
Download to a place you'll remember, a 'feature' of java is that apps like to be launched from the directory the app is in, at least this one does. Thus, to use this tool you'll need to be in the directory with the fdt.jar file.
 
Two other utilies to instal, a varient of the popular 'top' utility and the screen program. 'htop' gives a graphical view of how each of the cores/threads are being used. Knowing this will help confirm you have the affinity settings correct for your CPU and NIC. 'screen' is a tool which allows you to multiplex different ssh sessions.
 
yum install htop screen
 
Run 'htop' in a different terminal window and you'll see the load on cores/threads grow. This is especially helpful when doing the testing which will follow.
 
The tool has an oddity where the cores/threads start at 1 instead of 0, to fix, run the program, hit F2 to take you to Setup, down arrow to Display options, right arrow to selection list, down arrow to Count CPUs from 0 instead of 1, hit space, then F10 to return to the htop interface.
 
If not familar with the 'screen' app, I suggest you take a look at:
There are many 'how to' sites, so look around.
 

2. Setting firewall rules, just to be safe

 
The perfSonar tools use a lot of ports. You can selectively find the ports to open or just trust me and jam these in:
 
firewall-cmd --zone=public --add-port=61617/tcp --permanent
firewall-cmd --zone=public --add-port=8090/tcp --permanent
firewall-cmd --zone=public --add-port=8096/tcp --permanent
firewall-cmd --zone=public --add-port=4823/tcp --permanent
firewall-cmd --zone=public --add-port=6001-6200/tcp --permanent
firewall-cmd --zone=public --add-port=6001-6200/udp --permanent
firewall-cmd --zone=public --add-port=5001-5900/udp --permanent
firewall-cmd --zone=public --add-port=5001-5900/tcp --permanent
firewall-cmd --zone=public --add-port=861/tcp --permanent
firewall-cmd --zone=public --add-port=8760-9960/udp --permanent
firewall-cmd --zone=public --add-port=33434-33634/udp --permanent
 
This allows the fdt.jar tool to use its default port
firewall-cmd --zone=public --add-port=54321/tcp --permanent
 
Makes these permanent and reload rule database
firewall-cmd --reload
 

3. System tweaks, 'preformance' and system speed

 
Your system should now have a new net-device, go ahead and give it an appropriate IP, netmask and gateway. While in there, you'll probably want to set the MTU higher, 9000 is a good place to start.
 
Bring up the interface, is it operational? (You are on your own debugging these issues.)
 
Lets find the CPU that the 100g NIC is associated with.
cat /sys/class/net/<100g-NIC-name>/device/numa_node
 
Usual response is either '0' or '1', meaning the NIC is associated with either the '0' CPU or the '1' CPU. (If it comes back with a '-1' it probably suggests it is a single CPU system.) Lets assume it returned a '1'.
 
Knowing that, run this command:
lscpu
 
Quite a long list of interesting bits will scroll onto the screen, at the very end you'll likely see something like this:
 
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
 
If the NUMA from the earlier commnand came back as a '1', we now know that the 100g NIC is associated with Node1 and has the odd core/thread numbering which will be associated with Node1's CPU.
 
You'll want to use the CPU closest to the NIC to insure highest performance. Ignoring this, means your testing may use a core/thread from the other CPU and therefore have to traverse the QPI bus with it's notorious slow transfer speeds, greatly impacting your speed testing. Affinity of NIC to CPU is very important! See Brian Tierney's paper.
 
With this affinity info it is now possible to use some of the newer lower level linux commands to associate running apps with specific cores/threads. In particular you may see 'numactl' with lots of odd options used for this purpose.
 
Thankfully, Mellanox (and QLogic's beta offering) provide other tools to use to insure that proper affinity is used when using commands that are directed to the 100g NIC. More on these tools in a moment.
 
Let me do a little summary here, just to be sure we're all on the same page.
 
We've done the steps necessary to get the 100g NIC recognized and operational on a 100g enabled network (even if it is one server back to back with another server, my current test configuration).
 
We know the NUMA number for the CPU associated with the 100g NIC and thus we know what cores/threads to use to assure close affinity for optimal speed.
 
We could and should do some testing at this point just to establish some baselines for performance at this level and as we proceed with additional tuning steps, which are coming next. (Look below and do an iperf3 memory to memory test NOW, honest.)
 
Ok, all onboard? Lets keep going.
 
Most modern CPUs can run at different clock frequencies and often do so to save energy. In our case we want to run the CPU as fast as possible. First lets see what speed each CPU core is running at and what the maximum speed could be. Just run this funky command:
grep -E '^model name|^cpu MHz' /proc/cpuinfo
 
You'll probably see that the cores aren't running near their spec speed. Most often at a level called 'powersave'.
 
This simple command sets all the cores to 'performance' instead:
sudo cpupower frequency-set --governor performance
 
Rerun the first command to see what speed the cores are now running at, fast I'll predict!
 
Sort of goes without saying, but be sure to do this on both systems involved in the testing!
 
This register does make a difference in my testing.
 

4. Testing scenarios, suggestions and results for my configuration

 
You could just start launching some tests using the tools already loaded, iperf, iperf3, nuttcp, and fdt, but you really want to see which cores are being used to insure you are getting the optimal IRQ use between the NIC and CPU.
 
Thus I start by opening another terminal for each system and run 'htop' in that window. I also start daemons on both systems, for iperf3 it is just 'iperf3 -sD'
 
Start by running a simple iperf3 test between the systems and watch htop's output.
iperf3 -c system2
 
From the second system, test back to the first:
iperf3 -c system1
 
These are single core tests so you should have seen both htop displays show high use for a single core. The default iperf3 test is for 10sec, to test longer use the -t(#sec) flag.
 
Lets try to get some more bandwidth going:
iperf3 -c system2 -P8
then
iperf3 -c system1 -P8
 
Are the numbers reasonably close to each other? Hope so.
 
Try different numbers after the P, higher and lower, what are your results?
 
Also notice any numbers in the Retr column, this indicates 'Retransmision' requests, also the Cwnd, which is how much data is sent in each packet
 
For completeness, try testing the reverse direction:
if using iperf3 -c system2, this sends data from system1 to system2. Sometimes it is interesting to run in the reverse direction, system2 to system1, we've done that by going to the other system and testing back. You can also do that from only one system using the -R flag.
iperf3 -c system2 -R
 
Do the results align?
 
What is the optimal number of cores, what is the max speed we can get out of the link? Lots of testing can get to these answers. One of the other tools we loaded will greatly help determine the max bandwidth likely between the two systems in their current configuration.
 
Steps for using the fdt.jar tool:
Start the daemon on one system, be sure to start in the same dir as the fdt.jar file:
java -jar fdt.jar -nettest
 
Lots of possibly interesting text will scroll by.
 
Go to the other system, be in the fdt.jar directory:
java -jar fdt.jar -c system1 -nettest
 
More text will scroll on both screens, eventually getting to bandwidth test results, that will go on as long as you let them, ^C out when ready.
 
Watching the htop screens during this test should prove quite interesting! Can you suggest how many threads the FDT process uses by default? Which side of the link maxes out?
 
Stop both sides and start up again but add '-P8' or larger, to the command line. Again watch the htop screen.
 
First, are the correct cores being used? Next are there as many threads actually being used as requested? Is the bandwidth any higher?
 
My experience suggests this test is among the best barometers of the speed between two end points.
 
------
More suggestions and things to test and try.
 
I typically set up a number of daemons to be running, for reasons I'll going into shortly. I use the following on both systems:
iperf -sD
iperf3 -sD
iperf3 -sD -p5202
iperf3 -sD -p5203
iperf3 -sD -p5204
iperf3 -sD -p5205
 
Since fdt.jar is pretty output intensive, I generally open up a screen session, then start the daemon in it, then return to my main terminal.
 
Why run multiple iperf3 daemons on each system? One other variable you'll want track is the symmetry of the connection between the two systems doing simultaneous transfers.
 
Thus from system1 I test to system2 on the default port, but from system2 I test back to system1, at the same time, using a different port. Tests of this type lead to some interesting results.
 
iperf3 -c system2
on the other system iperf3 -c system1 -p 5202
 
------
 
 
The following lines are a little controversial. With current linux versions, network tuning is an added feature, and it is pretty good. I like to think of these lines in the following way:
Linux autotuning does use the sysctl settings to set the range of options it uses. Changing the values broadens the range that the auto tuner can use. This added range allows better tuning for high speed NICs. It also means that when doing testing, some of the values seen may be different than those set with the following lines.
 
First make a copy of the current state of the sysctl values:
 
sysctl -a > ~/orig-sysctl-values
 
Now edit the file:
 
vi /etc/sysctl.d/<file> Where <file> on my system is '99-sysctl.conf'
 
Add these lines to the end:
(No controversy to add/change these values for high speed nics)
# increase TCP max buffer size setable using setsockopt()
# allow testing with 256MB buffers
net.core.rmem_max = 268435456
net.core.wmem_max = 268435456
# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
# allow auto-tuning up to 128MB buffers
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
# recommended to increase this for CentOS6 with 10G NICS or higher
net.core.netdev_max_backlog = 250000
# don't cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save = 1
# Explicitly set htcp as the congestion control: cubic buggy in older 2.6 kernels
net.ipv4.tcp_congestion_control = htcp
# If you are using Jumbo Frames, also set this
net.ipv4.tcp_mtu_probing = 1
# recommended for CentOS7/Debian8 hosts
net.core.default_qdisc = fq
 
These can be activated by:
sysctl -p /etc/sysctl.d/<file>
 
Similarly, you can fall back to the original options with:
sysctl -p ~/orig-sysctl-values
 
There are three other lines to issue, these will only persist till the system restarts. They'll need to be reissued each reboot. (there are ways to make these permanent but I'll leave that for another write up)
 
ethtool -K <net-device> lro on
ifconfig <net-device> txqueuelen 20000
systemctl stop irqbalance <<--- this is an important service to stop!
 

------

Network and Linux system testing at 100g

Many research universities are moving to a 100g connection to either Internet2 and/or the commodity Internet. So far, most to these very large ‘pipes’ have been deployed because the campus’s previous link had occasionally bumped into the 10g capacity of that existing connection. As the economy of scales goes, it ends up more cost effective to just jump to 100g rather than multiple 10g connections with all the associated routers, switches and configuration complexity that introduces.

All that is just another way of saying that there currently is a large amount of unused bandwidth deployed, operational and unused.

In my role in Research Computing, I know that before long, researchers will start to have a need for ever more bandwidth capacity for the movement of data. In an attempt to stay ahead of the researchers, a challenging task, I’ve worked to establish a temporary local test bed of 100g connected systems and switches. This quickly became less interesting, as simply moving traffic, at 100g, from one system to another on the same switch, was pretty easy and didn’t seem to stress the servers, switches or me.

How to do this on a larger scale?

ESnet has a nice 100g testbed available for researchers to do testing. See http://es.net/network-r-and-d/experimental-network-testbeds/100g-sdn-testbed/ Tending toward the lazy dimension, I didn’t work through the simple process of applying for access. Instead I’d been attending SC conferences for a number of years. Each year there is a bandwidth challenge and capacity escalation, and many 100g connections to most I2 or other regional POPs.

For two years, I’ve been busy running our booth to do much with networking testing. However at SC17, I heard about a new shiny object from Google, the BBR (bottleneck bandwidth and round-trip propagation time) congestion control tool.

With help from many, particularly SCinet, John Graham at UCSD, and Jim Chen and Fei Yeh from Northwestern and Starlight, a quick and dirty test bed was deployed. The few day testing done at SC did show that the BBR congestion algorithm did boost the download transfer speeds, rates went from ~60Gbps to ~75Gbps, a significant increase.

After the conference, Jim and Fei and others at Starlight and CENIC agreed to keep the link between Chicago and Sunnyvale configured and usable from campus. Starlight also agreed to let me have an account on a 100g system in Chicago.

Over the next year, I would take time every few weeks and test the link to see what variables impacted the transfer speeds.

I felt sure there would be significant differences in transfer speeds based on the use of different ‘sysctl’ values. However, I couldn’t develop a definitive test suites to show that. Frustrated I kept going and tried many different options.

Over the winter break of 2018, and just before I retire in January of 2019, I’ll searched all of Google (yea, right) and talked with many system admins here to find the specific variables that truely impact speeds on 100g links.

Going back to my main driver, working with research computing faculty to help them transfer data as efficiently as possible, I wanted to not just have them be able to use the default configs on their desktops or servers but to use more optimal values. Similar to what ESnet has done here: https://fasterdata.es.net/host-tuning/linux/ Following the simple steps outlined on this page have greatly helped move data around campus at the 10g level, and lower the faculty frustration of moving data. I wanted a similar outcome when running at 100g!

My conclusions and details follow in the rest of this article. Suffice to say, there are many intricately linked variables that need to be just right for the average server to move data efficiently on 100g links.

-----

While this first suggestion sounds so basic as to be ignored, please don’t ignore it!

100g Nics, of all brands, require a PCIe 3 x16 slot for full speed operation. Sounds simple, all server have that now, right? Yes most do have that now, but my testbed was composed of any systems I could find. Many were from the transition time between PCIe 2 and 3. Thus I’ve a couple of systems whose motherboard clearly has a x16 label next to the PCI slot BUT the slot is actually only a PCIe 2 x16 slot, which tops out at a theoretical 68Gbps! Can’t tell you how long this system was in the testing mix before I realized this issue and speed limitation to my testing!

How to know? For Linux hosts this seems to be the most straight forward:

root # dmidecode -t 9

System Slot Information

Designation: PCI2

Type: x16 PCI Express 3 ←- KEY!

Current Usage: In Use

Now that you have confirmation of a full speed host, next up is to check the upstream 100g switch. The simplest path here is to have a second 100g host on a different 100g switch. A iperf3 test back and forth gives insight to the speeds possible. Probably not very high speed, right? But is it symmetrical? That is host a → host b, XGbps and host b → host a, similar Gbps? If there is symmetry, even at a low speed, that is a reasonable test of the local switch fabric.

Testing the WAN speed is more difficult and that confirmation needs to be left to the testing to be done later. Though, of course, your WAN/Backbone networking person or team can give you a good idea whether or not they have confidence in the 100g link to off campus.

-----

Next up is the configuration of the servers and NIC. I’ll provide a brief running list of all the steps I suggest to go from bare metal to an operational system. Also, I’m assuming Centos for the OS version. Ubuntu linux if fine too, while the concepts are fine, the individual commands will be different.

yum -y install epel-release

yum -y update

yum -y install http://software.internet2.edu/rpms/el7/x86_64/main/RPMS/perfSONAR-repo-…

To update kernel to latest version: (can always fall back to stock Centos):

yum -y install http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm

yum install -y --enablerepo=elrepo-kernel kernel-ml

awk -F\' /^menuentry/{print\$2} /boot/grub2/grub.cfg

grub2-set-default 0

grub2-mkconfig -o /boot/grub2/grub.cfg

grub2-editenv list

reboot

Pkgs to install

yum -y install psmisc iperf3 iperf htop screen net-tools glances lshw-gui pciutils traceroute ntp mlocate wget lsof fail2ban fail2ban-systemd yum-cron java-1.8.0-openjdk.x86_64

install yum-cron (keeps the os fully patched):

systemctl enable yum-cron.service

systemctl start yum-cron.service

vi /etc/yum/yum-cron.conf # make first three services all yes

start iperf3 daemons

cat << EOF > /usr/local/bin/restart_iperf3.sh

/bin/sleep 10

/usr/bin/killall iperf3

/bin/sleep 0.1

/usr/bin/killall -9 iperf3

/bin/sleep 0.1

if [ `ps -C iperf3 | wc -l` = "1" ]

then

/usr/bin/sudo -u nobody /usr/bin/iperf3 -s -p 5201 -D >/dev/null 2>&1

/usr/bin/sudo -u nobody /usr/bin/iperf3 -s -p 5202 -D >/dev/null 2>&1

fi

EOF

 

cat <<EOF > /etc/rc.local

/usr/local/bin/restart_iperf3.sh

EOF

chmod +x /etc/rc.d/rc.local

chmod +x /usr/local/bin/restart_iperf3.sh

(crontab -l ; echo "59 * * * * /usr/local/bin/restart_iperf3.sh >/dev/null 2>&1") | crontab -

/usr/local/bin/restart_iperf3.sh # start it for the first round

Get FDT

wget http://monalisa.cern.ch/FDT/lib/fdt.jar

Check the speed of CPU(s) tune them for Network throughput:

cat /proc/cpuinfo | grep Hz

systemctl start tuned # seems to be active/enabled already YMMV

systemctl enable tuned

tuned-adm active

tuned-adm profile network-throughput

cat /proc/cpuinfo | grep Hz

Firewall rules, allows iperf, iperf3 and fdt:

firewall-cmd --zone=public --add-port=5201-5210/tcp --permanent

firewall-cmd --zone=public --add-port=5000-5010/tcp --permanent

firewall-cmd --zone=public --add-port=5101-5105/tcp --permanent

firewall-cmd --zone=public --add-port=54321/tcp --permanent

firewall-cmd --reload

Stop/start some services

systemctl stop irqbalance.service

systemctl disable irqbalance.service

systemctl stop wpa_supplicant

systemctl disable wpa_supplicant

If using a Mellanox NIC these two tools are important:

1. mlxup shows firmware version and allows updates

Get current version at: http://www.mellanox.com/page/mlxup_firmware_tool

Run it as root to see what version of FW the NIC has

./mlxup

If needed find the update firmware, download and reissue the mlxup cmd:

IB card FW links: http://www.mellanox.com/page/infiniband_cards_overview

Enet card FW links: http://www.mellanox.com/page/ethernet_cards_overview

This is the general syntax for updating firmware: (as root)

./mlxup -u -D /dir where .bin file is -i fw-ConnectX4-rel-xx_xxx_xx-MCX455A-ECA_Ax-UEFI-14.16.17-FlexBoot-3.5.504.bin

2. Mellanox provides several Linux 'tuning tools', the download is here:

http://www.mellanox.com/downloads/tools/mlnx_tuning_scripts.tar.gz

Full tuning article here:

https://community.mellanox.com/s/article/howto-tune-your-linux-server-f…

This can be a powerful tool but does a lot without telling you what is happening.

I prefer to use it as a diagnostic tool and then tweak things as needed.

Untar the mlnx_tuning_scripts bundle (it doesn't make its own directory, you've been

warned), then issued this as root:

./mlnx_tune -r

This will dump a lot to your screen, starting with a bunch of warnings as you probably

won't have all the MLNX software loaded. The last set of lines gives you a detailed

summary of your server’s config, check over to confirm things are as you expect.

 

Lots of details, which many reading can probably ignore as you are familiar with building a system up from bare metal.

I’ve left the most controversial tuning part till last, the sysctl variable settings.

Here are three links to differing discussions on sysctl settings.

20/40/100G Host tuning for improved network performance

https://www.serveradminz.com/blog/20-40-100g-host-tuning/

Performance Troubleshooting across Networks, Joe Breen, U of Utah

https://slideplayer.com/slide/12223618/

Recent Linux TCP Updates, and how to tune your 100G host

https://fasterdata.es.net/assets/Papers-and-Publications/100G-Tuning-TechEx2016.tierney.pdf

I’ve tried all of them and a good many others. My conclusion is that, Nate Hanford, Brian Tierney, ESnet, suggestions have worked best for MY network situation, you have to test in YOUR network situation to come to your own conclusions.

I’ve found that only adding these lines to the ‘stock’ Centos sysctl setup are all that is needed for tuning the host for very good performance.

net.core.default_qdisc = fq

net.ipv4.tcp_congestion_control = htcp # or set to BBR if using newer kernel

# add to /etc/sysctl.conf

# allow testing with 2GB buffers

net.core.rmem_max = 2147483647

net.core.wmem_max = 2147483647

# allow auto-tuning up to 2GB buffers

net.ipv4.tcp_rmem = 4096 87380 2147483647

net.ipv4.tcp_wmem = 4096 65536 2147483647

The simple way to handle this is to dump these lines to a file, say 100g-tuning.conf, important that it ends in ‘.conf’, then move this file into the /etc/sysctl.d/ directory. The next reboot will load these lines and activate them.

If you don’t want to reboot you can activate the lines by this command, as root:

sysctl -p /etc/sysctl.d/100g-tuning.conf

Finally, you’ve a souped up server fully ready to use your 100g network in an efficient way!

How to test it?

The iperf, iperf3 and nuttcp tools give good ‘instantaneous’ speed estimates. This URL provides a very nice methodology for speed testing your network and your file system attached.

https://fasterdata.es.net/performance-testing/network-troubleshooting-tools/iperf/disk-testing-using-iperf/

The best tool I’ve found for ‘filling’ the link is the fdt.jar program from CERN/CalTech. Here is a link to the documentation on using the tool:

http://monalisa.cern.ch/FDT/documentation_fdt.html

The use case is pretty simple. On one end you start the ‘server’:

java -jar fdt.jar -nettest

At the other system start the transfer:

Java -jar fdt.jar -c (server host FQDN or IP) -nettest

A daunting amount of text is generated when each end it started but eventually you’ll see the testing start up and report transfer speed every few seconds.

To really fill the link, you can add -P X (where X is the number of parallel threads to use) and/or issue the same command on another test system.

For those looking for even more info about file transfer options, this is among the best sites I’ve been able to find:

http://moo.nac.uci.edu//~hjm/HOWTO_move_data.html