100g Network Adapter Tuning
100g Network Adapter Tuning (DRAFT out for comments, send email to preese @ stanford.edu)
- Preparing the system and loading needed drivers and repo's.
- Setting firewall rules, just to be safe.
- System tweaks, 'performance' and system speed.
- Testing scenarios, suggestions and results for my configuration.
1. Preparing the system and loading needed drivers and repo's
2. Setting firewall rules, just to be safe
3. System tweaks, 'preformance' and system speed
4. Testing scenarios, suggestions and results for my configuration
------
Network and Linux system testing at 100g
Many research universities are moving to a 100g connection to either Internet2 and/or the commodity Internet. So far, most to these very large ‘pipes’ have been deployed because the campus’s previous link had occasionally bumped into the 10g capacity of that existing connection. As the economy of scales goes, it ends up more cost effective to just jump to 100g rather than multiple 10g connections with all the associated routers, switches and configuration complexity that introduces.
All that is just another way of saying that there currently is a large amount of unused bandwidth deployed, operational and unused.
In my role in Research Computing, I know that before long, researchers will start to have a need for ever more bandwidth capacity for the movement of data. In an attempt to stay ahead of the researchers, a challenging task, I’ve worked to establish a temporary local test bed of 100g connected systems and switches. This quickly became less interesting, as simply moving traffic, at 100g, from one system to another on the same switch, was pretty easy and didn’t seem to stress the servers, switches or me.
How to do this on a larger scale?
ESnet has a nice 100g testbed available for researchers to do testing. See http://es.net/network-r-and-d/experimental-network-testbeds/100g-sdn-testbed/ Tending toward the lazy dimension, I didn’t work through the simple process of applying for access. Instead I’d been attending SC conferences for a number of years. Each year there is a bandwidth challenge and capacity escalation, and many 100g connections to most I2 or other regional POPs.
For two years, I’ve been busy running our booth to do much with networking testing. However at SC17, I heard about a new shiny object from Google, the BBR (bottleneck bandwidth and round-trip propagation time) congestion control tool.
With help from many, particularly SCinet, John Graham at UCSD, and Jim Chen and Fei Yeh from Northwestern and Starlight, a quick and dirty test bed was deployed. The few day testing done at SC did show that the BBR congestion algorithm did boost the download transfer speeds, rates went from ~60Gbps to ~75Gbps, a significant increase.
After the conference, Jim and Fei and others at Starlight and CENIC agreed to keep the link between Chicago and Sunnyvale configured and usable from campus. Starlight also agreed to let me have an account on a 100g system in Chicago.
Over the next year, I would take time every few weeks and test the link to see what variables impacted the transfer speeds.
I felt sure there would be significant differences in transfer speeds based on the use of different ‘sysctl’ values. However, I couldn’t develop a definitive test suites to show that. Frustrated I kept going and tried many different options.
Over the winter break of 2018, and just before I retire in January of 2019, I’ll searched all of Google (yea, right) and talked with many system admins here to find the specific variables that truely impact speeds on 100g links.
Going back to my main driver, working with research computing faculty to help them transfer data as efficiently as possible, I wanted to not just have them be able to use the default configs on their desktops or servers but to use more optimal values. Similar to what ESnet has done here: https://fasterdata.es.net/host-tuning/linux/ Following the simple steps outlined on this page have greatly helped move data around campus at the 10g level, and lower the faculty frustration of moving data. I wanted a similar outcome when running at 100g!
My conclusions and details follow in the rest of this article. Suffice to say, there are many intricately linked variables that need to be just right for the average server to move data efficiently on 100g links.
-----
While this first suggestion sounds so basic as to be ignored, please don’t ignore it!
100g Nics, of all brands, require a PCIe 3 x16 slot for full speed operation. Sounds simple, all server have that now, right? Yes most do have that now, but my testbed was composed of any systems I could find. Many were from the transition time between PCIe 2 and 3. Thus I’ve a couple of systems whose motherboard clearly has a x16 label next to the PCI slot BUT the slot is actually only a PCIe 2 x16 slot, which tops out at a theoretical 68Gbps! Can’t tell you how long this system was in the testing mix before I realized this issue and speed limitation to my testing!
How to know? For Linux hosts this seems to be the most straight forward:
root # dmidecode -t 9
…
System Slot Information
Designation: PCI2
Type: x16 PCI Express 3 ←- KEY!
Current Usage: In Use
Now that you have confirmation of a full speed host, next up is to check the upstream 100g switch. The simplest path here is to have a second 100g host on a different 100g switch. A iperf3 test back and forth gives insight to the speeds possible. Probably not very high speed, right? But is it symmetrical? That is host a → host b, XGbps and host b → host a, similar Gbps? If there is symmetry, even at a low speed, that is a reasonable test of the local switch fabric.
Testing the WAN speed is more difficult and that confirmation needs to be left to the testing to be done later. Though, of course, your WAN/Backbone networking person or team can give you a good idea whether or not they have confidence in the 100g link to off campus.
-----
Next up is the configuration of the servers and NIC. I’ll provide a brief running list of all the steps I suggest to go from bare metal to an operational system. Also, I’m assuming Centos for the OS version. Ubuntu linux if fine too, while the concepts are fine, the individual commands will be different.
yum -y install epel-release
yum -y update
yum -y install http://software.internet2.edu/rpms/el7/x86_64/main/RPMS/perfSONAR-repo-…
To update kernel to latest version: (can always fall back to stock Centos):
yum -y install http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
yum install -y --enablerepo=elrepo-kernel kernel-ml
awk -F\' /^menuentry/{print\$2} /boot/grub2/grub.cfg
grub2-set-default 0
grub2-mkconfig -o /boot/grub2/grub.cfg
grub2-editenv list
reboot
Pkgs to install
yum -y install psmisc iperf3 iperf htop screen net-tools glances lshw-gui pciutils traceroute ntp mlocate wget lsof fail2ban fail2ban-systemd yum-cron java-1.8.0-openjdk.x86_64
install yum-cron (keeps the os fully patched):
systemctl enable yum-cron.service
systemctl start yum-cron.service
vi /etc/yum/yum-cron.conf # make first three services all yes
start iperf3 daemons
cat << EOF > /usr/local/bin/restart_iperf3.sh
/bin/sleep 10
/usr/bin/killall iperf3
/bin/sleep 0.1
/usr/bin/killall -9 iperf3
/bin/sleep 0.1
if [ `ps -C iperf3 | wc -l` = "1" ]
then
/usr/bin/sudo -u nobody /usr/bin/iperf3 -s -p 5201 -D >/dev/null 2>&1
/usr/bin/sudo -u nobody /usr/bin/iperf3 -s -p 5202 -D >/dev/null 2>&1
fi
EOF
cat <<EOF > /etc/rc.local
/usr/local/bin/restart_iperf3.sh
EOF
chmod +x /etc/rc.d/rc.local
chmod +x /usr/local/bin/restart_iperf3.sh
(crontab -l ; echo "59 * * * * /usr/local/bin/restart_iperf3.sh >/dev/null 2>&1") | crontab -
/usr/local/bin/restart_iperf3.sh # start it for the first round
Get FDT
wget http://monalisa.cern.ch/FDT/lib/fdt.jar
Check the speed of CPU(s) tune them for Network throughput:
cat /proc/cpuinfo | grep Hz
systemctl start tuned # seems to be active/enabled already YMMV
systemctl enable tuned
tuned-adm active
tuned-adm profile network-throughput
cat /proc/cpuinfo | grep Hz
Firewall rules, allows iperf, iperf3 and fdt:
firewall-cmd --zone=public --add-port=5201-5210/tcp --permanent
firewall-cmd --zone=public --add-port=5000-5010/tcp --permanent
firewall-cmd --zone=public --add-port=5101-5105/tcp --permanent
firewall-cmd --zone=public --add-port=54321/tcp --permanent
firewall-cmd --reload
Stop/start some services
systemctl stop irqbalance.service
systemctl disable irqbalance.service
systemctl stop wpa_supplicant
systemctl disable wpa_supplicant
If using a Mellanox NIC these two tools are important:
1. mlxup shows firmware version and allows updates
Get current version at: http://www.mellanox.com/page/mlxup_firmware_tool
Run it as root to see what version of FW the NIC has
./mlxup
If needed find the update firmware, download and reissue the mlxup cmd:
IB card FW links: http://www.mellanox.com/page/infiniband_cards_overview
Enet card FW links: http://www.mellanox.com/page/ethernet_cards_overview
This is the general syntax for updating firmware: (as root)
./mlxup -u -D /dir where .bin file is -i fw-ConnectX4-rel-xx_xxx_xx-MCX455A-ECA_Ax-UEFI-14.16.17-FlexBoot-3.5.504.bin
2. Mellanox provides several Linux 'tuning tools', the download is here:
http://www.mellanox.com/downloads/tools/mlnx_tuning_scripts.tar.gz
Full tuning article here:
https://community.mellanox.com/s/article/howto-tune-your-linux-server-f…
This can be a powerful tool but does a lot without telling you what is happening.
I prefer to use it as a diagnostic tool and then tweak things as needed.
Untar the mlnx_tuning_scripts bundle (it doesn't make its own directory, you've been
warned), then issued this as root:
./mlnx_tune -r
This will dump a lot to your screen, starting with a bunch of warnings as you probably
won't have all the MLNX software loaded. The last set of lines gives you a detailed
summary of your server’s config, check over to confirm things are as you expect.
Lots of details, which many reading can probably ignore as you are familiar with building a system up from bare metal.
I’ve left the most controversial tuning part till last, the sysctl variable settings.
Here are three links to differing discussions on sysctl settings.
20/40/100G Host tuning for improved network performance
https://www.serveradminz.com/blog/20-40-100g-host-tuning/
Performance Troubleshooting across Networks, Joe Breen, U of Utah
https://slideplayer.com/slide/12223618/
Recent Linux TCP Updates, and how to tune your 100G host
https://fasterdata.es.net/assets/Papers-and-Publications/100G-Tuning-TechEx2016.tierney.pdf
I’ve tried all of them and a good many others. My conclusion is that, Nate Hanford, Brian Tierney, ESnet, suggestions have worked best for MY network situation, you have to test in YOUR network situation to come to your own conclusions.
I’ve found that only adding these lines to the ‘stock’ Centos sysctl setup are all that is needed for tuning the host for very good performance.
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = htcp # or set to BBR if using newer kernel
# add to /etc/sysctl.conf
# allow testing with 2GB buffers
net.core.rmem_max = 2147483647
net.core.wmem_max = 2147483647
# allow auto-tuning up to 2GB buffers
net.ipv4.tcp_rmem = 4096 87380 2147483647
net.ipv4.tcp_wmem = 4096 65536 2147483647
The simple way to handle this is to dump these lines to a file, say 100g-tuning.conf, important that it ends in ‘.conf’, then move this file into the /etc/sysctl.d/ directory. The next reboot will load these lines and activate them.
If you don’t want to reboot you can activate the lines by this command, as root:
sysctl -p /etc/sysctl.d/100g-tuning.conf
Finally, you’ve a souped up server fully ready to use your 100g network in an efficient way!
How to test it?
The iperf, iperf3 and nuttcp tools give good ‘instantaneous’ speed estimates. This URL provides a very nice methodology for speed testing your network and your file system attached.
The best tool I’ve found for ‘filling’ the link is the fdt.jar program from CERN/CalTech. Here is a link to the documentation on using the tool:
http://monalisa.cern.ch/FDT/documentation_fdt.html
The use case is pretty simple. On one end you start the ‘server’:
java -jar fdt.jar -nettest
At the other system start the transfer:
Java -jar fdt.jar -c (server host FQDN or IP) -nettest
A daunting amount of text is generated when each end it started but eventually you’ll see the testing start up and report transfer speed every few seconds.
To really fill the link, you can add -P X (where X is the number of parallel threads to use) and/or issue the same command on another test system.
For those looking for even more info about file transfer options, this is among the best sites I’ve been able to find:
http://moo.nac.uci.edu//~hjm/HOWTO_move_data.html