SRN Client Network Architecture Guide
When you get (or rent) space in the SRCF (SRCF1 or SRCF2) or RCF—as described in the Data Center Network Connectivity page—you are responsible for purchasing, configuring, and maintaining a network switch in your rack. In effect, you are responsible for everything related to the network in your rack space. Setting up a network architecture for your environment is something that has to be done, but often can be done once. This technical guide is for you—or your School/Department/Institution support, or whichever group you involve for help—to use as a starting point in designing your rack's network architecture.
If you do not want to do this yourself, you have a few options. If you are part of the School of Medicine, they may be able to assist. Send a request to the TDS Service Desk to begin that conversation. Otherwise, your LNA may be able to assist. If you do not know who your LNA is, see the end of the Building Networking guide. Otherwise, the Stanford Research Computing Center and the Technology Consulting Group may be able to provide one-time architecture services at the appropriate Time and Materials rate. Day-to-day administration and troubleshooting services may also be available, though that typically requires a proper agreement with monthly time commitment.
This guide is divided into two parts. First, the network switch requirements are described. Following that is a section on how you might think about planning server connectivity, and a guide to selecting an appropriate switch. After that will be a discussion of VLANs and IP Addresses, and an example network for a compute environment.
Your Uplink & Switch Requirements
The Data Center Network Connectivity page describes the uplinks you can expect to have to the SRN. Depending on the bandwidth level you select, you will get a pair of singlemode or multimode fiber cables with duplex LC connectors. It is up to you to provide a switch with the required features to connect to the SRN:
- LLDP for port information.
- LACP (802.3ad) port aggregation.
- 802.1q VLAN tagging/trunking.
- Fiber transceiver diagnostics
- No Forward Error Correction (FEC)
Each of the items above have caused trouble for SRN clients in the past. These items will now be discussed in more detail, in the order in which you would normally configure them on a switch.
Forward Error Correction (FEC)
As speeds and fiber lengths increase, the chances of link errors also increases. The encoding scheme for the technology (40GBASE-LR4, 100GBASE-LR4, etc.) sometimes catches problems, but does not support correction. Forward Error Correction (FEC) acts below the Ethernet layer, adding parity data to the stream, which the receiving switch can use to detect and correct errors. It is useful with longer fiber lengths, but is not as useful within a data center.
FEC is configured at the physical-port level. Some switches have FEC enabled by default, some do not enable it, and some enable it only at certain link speeds. You must ensure that FEC is disabled on both core uplink ports.
LLDP for port information
This image from the lldpd web site explains the reason for using LLDP:
LLDP (Link Layer Discovery Protocol) is an Ethernet link-local protocol, that two devices can use to exchange information about each other. The protocol is link-local, meaning that the connection is only between the two devices at either end of the network connection; and Ethernet-based, so it works even when no IP addresses are assigned. LLDP lets you find out what core switches (and core switch ports) you are connected to; and it allows UIT Networking to see what their core switches are connected to. When making configuration changes to an active port, LLDP helps us ensure we are making changes to the correct port.
LLDP is configured at the physical-port level. You must ensure that LLDP is enabled at least for the two core uplink ports. We also suggest that you enable LLDP on all ports, and that you install LLDP software (lldpd or openlldp/lldpad) on your systems, so that you have the same real-time visibility into your downstream network connections.
In some cases, you need to explicitly specify what information ("TLVs") you provide over LLDP. You should at least enable the Chassis ID, Port ID, Port Description, and System Name TLVs. You should also set your switch hostname to be something useful.
LACP for port aggregation
LACP (Link Aggregation Control Protocol, also known as 802.3ad, the name of the IEEE standard) is another Ethernet link-local protocol. When two switches would like to balance traffic across multiple physical links, LACP is one way to configure that balancing. When a physical link comes up and LACP is enabled, LACP ensures that both switches are ready to send data, the ports in the aggregate are the same speed, and the ports are in the same aggregate.
This page uses the word "aggregate" to refer to the virtual interface formed by combining the multiple physical interfaces. Your switch might use the term "bond", "channel group", "link aggregation group" (or "LAG"), or "port channel" (or even "port-channel"). Some switches also use the word "dynamic" to refer to LACP, but check your switch's documentation to be sure.
LACP is configured at the physical port level. You must enable LACP on both core uplink ports, placing them (and only them) in the same aggregate. Other forms of aggregation (such as static mode) are not supported. If you have the ability to configure LACP active or passive mode, you must select active mode. Finally, if you have a choice between LACP slow or fast speed, select fast. These settings help to ensure, when the physical link comes up, it is added to the aggregate as soon as possible.
802.1q for VLAN tagging
Later in this document will be a discussion of VLAN allocation on the SRN. Suffice it to say, you will likely be dealing with multiple VLANs. Even if you are not, it is normal practice to have VLAN tagging/trunking enabled on the uplink to the SRN. That way, if VLANs need to be added in future, it can be done as a nondisruptive operation.
802.1q (another IEEE standard) is configured at the aggregate level, though some switches also require identical configuration on the physical ports. You must configure VLAN tagging/trunking on the SRN aggregate, using the VLANs allocated to you, with no native VLAN.
Layer-1 Architecture & Switch Selection
Using the requirements listed above, it should be clear that a cheap, unmanaged switch is not appropriate for your top-of-rack switch. Besides, such switches are often not able to handle the amount of traffic that is typical of a rack in a research data center. The need for large amounts of storage means that, even if there is not much network traffic leaving the rack, there are large flows of data within the rack, between storage and servers.
Since everything exists within a single rack, it is not necessary to use fiber. Instead, either twisted-pair (for 1 & 10 Gigabit) or DAC (Direct Attach Copper, for 10 Gigabit and above) cables can be used for connections within a rack. Although there is a limitation on cable length (typically 5 meters for 25-Gigabit and above), the cost is lower when compared to fiber.
Here are three example Layer-1 architectures suitable for connecting to the SRN, which take the above into account, and meet many use cases in our research data centers:
The architecture on the left uses servers that have 25-Gigabit SFP28 ports, with SFP28 ports on the switch and SFP28 DAC cables in between. Many server manufacturers can provide SFP28 ports as standard, if you ask. For example, most Dell and Supermicro-based servers place their built-in network connections on a daughtercard: Dell calls them "Network Daughter Cards" or "OCP Mezzanine cards", and Supermicro calls them "Networking AOC (Add-On Cards)". Current-generation servers which use these cards can support multiple 10- or 25-Gigabit network ports. In some cases, it is even possible to enable VLAN trunking on the ports, to carry both user and out-of-band management (IPMI, BMC, etc.) traffic, removing the need for a separate physical connection.
Most switches with SFP28 ports also include a few QSFP28 ports for 40- or 100-Gigabit connectivity, so the architecture on the left uses two of those ports for 100-Gigabit connectivity to the SRN. The remaining QSFP28 ports are dedicated to 40- or 100-Gigabit connectivity to storage, VM hosts, or data-transfer nodes (DTNs).
For machines which need a separate 1-Gigabit port for IPMI, that can be provided one of two ways. If only a few ports are needed, 10-Gigabit twisted-pair SFP+ modules can be installed. SFP28 ports support SFP+ ‘optics’, and twisted-pair ports typically support 1-Gigabit connections, though you should verify your switch supports this. The other option is to purchase a switch specifically for IPMI connectivity. This could be a cheaper switch, potentially without any special configuration. The IPMI switch's uplink would connect to a port on the rack core switch, and then Category-5e (or better) cables would connect to IPMI ports.
The architecture in the middle uses a 10-Gigabit switch, with twisted-pair ports. This allows you to use Category-6 or better network cables, which are much cheaper than DAC or fiber cables. The tradeoff is that your servers will be limited to 10-Gigabit connectivity. Port aggregation can be used for higher bandwidth, but this can cause problems with functions like PXE and IPMI.
Most 10-Gigabit switches with twisted-pair ports also include a few SFP+ or QSFP+ ports for 10- or 40-Gigabit connectivity, so the architecture in the middle uses two of those ports for 40-Gigabit connectivity to the SRN. The remaining QSFP+ ports are used for storage, VM hosts, etc.. Since twisted-pair ports automatically support 1-Gigabit connectivity, the same switch can be used for both front-end and IPMI connections.
If you have multiple racks in the same data center, or you have dense racks requiring many Ethernet ports, the architecture on the right is ideal. It is similar to the architecture on the left, except all the ports are 100-Gigabit QSFP28. All three architectures can connect to other top-of-rack switches, but increased packet-processing performance of a 32 QSFP28-port switch is better able to handle traffic coming from multiple racks.
Just like the architecture on the left, the QSFP28 ports are used for the 100-Gigabit connection to the core, and for 100-Gigabit or 40-Gigabit connectivity to storage, VM hosts, etc.. QSFP28 ports are also used for connectivity to top-of-rack switches for your other racks in the same data center, which is important if you have private VLANs in your environment. Finally, servers in the rack can connect using DAC or AOC splitter cables, which split one QSFP28 port into four 25-Gigabit SFP28 ports.
To put the above architectures into real-money terms, let's look at some equipment which would be used to implement them.
We will assume needing to connect fourty-eight servers at 10- or 25-Gigabits; with 1-, 2-, 3-, or 5-meter cables (12 each); and four storage/VM servers at 100-Gigabit with 3-meter DAC cables.
For a high-speed network switch (with 100G QSFP 28 ports), we will use the Dell S5232F-ON, which includes thirty-two 100-Gigabit QSFP28 ports. The lowest GSA Advantage price of the 210-APHL is $26,583.21.
Two 100GBASE-LR4 QSFP28 transceiver modules are needed. From FS, for Dell switches, two of part #48857 costs $998. Twelve QSFP28-to-four-SFP28 breakout cables (3 each at 1, 2, 3, and 5 meters) cost $1,251. Four 3-meter 100-Gigabit DAC cables cost $216.
For a medium-speed network switch (with 25G QSFP28 ports), we will use the Mellanox Ethernet SN2410, which includes forty-eight 25-Gigabit SFP28 ports and eight 100-Gigabit QSFP28 ports. The Colfax Direct public price for the MSN2410-CB2R is $13,706.
Two 100GBASE-LR4 QSFP28 transceiver modules are needed. From FS, for Mellanox switches, two of part #71010 costs $998. Forty-eight 25-Gigabit DAC cables (twelve each at 1, 2, 3, and 5 meters) cost $1,908. Four 3-meter 100-Gigabit DAC cables cost $216.
For the lower-speed option, we will use the Dell S4048T-ON, which includes forty-eight 10-Gigabit twisted-pair ports and six 40-Gigabit QSFP+ ports. The lowest GSA Advantage price of the 210-AHMY is $12,099.99.
Two 40GBASE-LR4 QSFP+ transceiver modules are needed. From FS, for Dell switches, two of part #36696 costs $618. Forty-eight shielded Cat6a cables (twelve each at 3, 6, 10, and 16 feet) cost $218.
To summarize, in February 2022, using publicly-available pricing and solid network equipment…
- The high-speed example architecture would cost around $29k to implement, broken down into approximately $27k capital and 2k expense.
- The medium-speed example architecture would cost around $17k to implement, broken down into approximately $14k capital and $3k expense.
- The lower-speed architecture would cost around $13k to implement, broken down into approximately $12k capital and $1k expense.
These prices do not include an extended warranty (for which pricing is not publicly available), and it does not include setup and administration costs. But it is a good starting point when budgeting for networking in your new rack.
Layer-2 and -3 Architecture
Within the SRN, there are three guiding principles for Layer 2 and 3:
- Do not mix VLANs between the Campus and Research areas.
- You don't need to stuff multiple subnets into one VLAN anymore (unless it's IPv4 and IPv6).
- Prefer non-firewalled and non-NAT whenever possible.
The SRN Architecture's use of VXLAN (extending the number of available VLANs) and EVPN (joining together VLANs spread across multiple sites) means we are no longer facing practical limits on the number of VLANs that exist. UIT Network Engineering implements VXLAN by splitting the space into multiple VLAN Areas. The Research VLAN Area is new, and is used for VLANs in the SRN. The original VLAN space is now part of the Campus VLAN Area. Refer to NetDB for the full list of VLAN Areas. While the Campus VLAN Area is over 60% allocated, the Research Area (as of November, 2021) has more than 99% of the VLAN space available.
This amount of VLAN space gives you flexibility in deciding how many VLANs to use. It is no longer necessary to put all of your subnets in a single VLAN. Indeed, the only time you should really have multiple subnets on a single VLAN is when you have an IPv6 subnet to match an IPv4 subnet.
The features of the new SRN do mean that you should avoid using the Campus VLAN Area when you are on the SRN. That means your networks should be…
- Routed by the SRN router (for non-firewalled subnets), or
- Routed by the RC firewall (for firewalled subnets); and
- Not be present both on campus and in the SRCF/RCF.
For networks which are unable to meet the three requirements above, UIT Network Engineering has a separate circuit connecting the SRN to the MOA zone of campus, through which your VLAN can be trunked. Using this path means your network will lose most of the performance benefits of the new SRN architecture. Talk to us before going with this option.
If you will be using private VLANs on your switches, Network Engineering has reserved a range of VLAN numbers, and have guaranteed that they will never be used on SUNet. For the Research VLAN Area, they are 2000-2040 (inclusive). For the Campus VLAN Area, they are 2000-2009 and 2020-2039 (both inclusive). Using VLANs in this range for your private networks will ensure you never get them as a SUNet VLAN assignment.
Although the SRN has its own VLAN Area, we still share the same IP space as the rest of Campus. That means the types of networks and request processes are the same.
The SRN supports public, shadow (RFC1918 addresses, no Internet access), and shady (RFC1918, NATed to Internet) networks; both IPv4 and IPv6. When requesting a network, use the same processes you do today for campus, with one note:
- Non-firewalled networks: Specify in your request that the network should be on the SRN.
- Firewalled networks: Select the rc-srtr firewall as the owning firewall.
We suggest login nodes, data-transfer nodes, and the like use public subnets. We also suggest avoiding the use of the network firewall for those cases, relying instead on host-based firewalls (which, per MinSec, need to be used anyway). The goal of the SRN is to provide efficient data transfer. Using a firewalled network for data transfer adds a latency to each connection, and places a hard cap on bandwidth; the firewalls use 40-Gigabit links, while the rest of the SRN uses multiple 100-Gigabit links.
We suggest administrative systems use shady nets, optionally behind the network firewall. Placing administrative systems into shady nets provides a benefit by removing direct access from—and exposure to—the Internet. The cost of this protection is that all Internet-bound traffic must return to campus to be NATed, as there are no NAT gateways on the SRN.
Ancillary devices like IPMI, PDUs, storage management, switch management, etc. should either use a shadow network, or a private network. This provides an additional level of isolation from the devices which can be most vulnerable, as they are harder to update, and often vendors do not provide updates. Note however that the trend is to move many Campus services to the cloud, and this might cause issues if you are trying to use such a service from a shadow net (for example, a storage server trying to use Central simple-bind LDAP). In those cases, you can either use a shady net, or set up NAT on your private network.
If you need a private IPv4 range for your environments, which will never be routed on campus, we recommend using the 10.203.0.0/19 address space. Like with the private VLAN range, Network Engineering has guaranteed that this subnet will never be routed on campus. For IPv6, you should generate a Unique Local Address (ULA) subnet within the fd00::/8 space. You can find RFC4193 IPv6 generators on the Internet. Note that the ULA subnet fd03:dbaa:5533:/48 has been chosen by Network Engineering to host shadow nets.
Layer-2 and -3 Example
Using the tips above, here is an example network architecture for a Moderate-Risk cluster in the SRN, which uses the advice from this page.
- A small public net in the Research VLAN Area, with IPv4 and IPv6 addresses, not network-firewalled, for login nodes and data-transfer nodes. Host firewalls are set to default-deny, allowing only necessary services. Fail2ban is used to monitor open ports and temporarily block abusive IPs. If security requirements demand the use of the network firewall, put things behind the RC firewall, consider advocating to leave the DTNs in a non-firewalled subnet.
- A small shady (NATed) net in the Research VLAN Area, optionally network-firewalled, for management nodes.
- A private subnet, on VLAN 2000, for back-end traffic, including node provisioning and compute node traffic. Assuming no more than around 200 hosts, 10.203.0.0/24 would be large enough.
- A private subnet, on VLAN 2001, for IPMI. Assuming no more than around 200 hosts, 10.203.1.0/24 would be large enough.