2024 SRN/Campus Stretch Migration
Executive Summary
When the Research data centers were created, they existed separately from the "Any VLAN anywhere" network of building & data center VLANs. Most VLANs remained separate—either living inside the Research space, or in the campus space—but some clients wanted to have VLANs that existed in both spaces. A method was created to "stretch" a VLAN from the Campus area to the Research area.
With the deployment of EVPN/VxLAN—the model upon which the SRN is based—to campus, these "stretched" VLANs will be migrated to the new, "interconnected" way of stretching VLANs.
To implement this migration, each affected VLAN will need to have a downtime. During a VLAN's migration, devices on the Research side of the interconnection will temporarily lose connectivity to everything. Devices on the Campus side of the interconnection will not be affected, though they will (temporarily) lose connectivity to devices on the Research side. We expect 5-15 minutes of downtime to migrate a VLAN.
VLANs that are not "stretched" will not be affected by this work.
What is the current architecture?
Why does it need to change?
Today, the SRN architecture looks like this:
Each top-of-rack switch has a connection on a pair of leaf switches (either leaf01/02 or leaf03/04). Traffic enters the SRN via a leaf-switch pair.
In normal cases, traffic within the same VLAN takes the shortest path to its destination. Traffic bound for elsewhere takes the shortest path to the router or firewall, and thence to another SRN rack or towards the aggregation switches. In all cases, there is a redundant path: If a piece of network equipment has an issue, or an entire research data center goes down, there is still a path for your traffic to take.
If the traffic is coming from a "stretched" VLAN, the path it takes is different. Traffic leaving a stretched VLAN—even if going to elsewhere in a Research data center—must first travel to SRCF1 leaf switch 1 or 2, travel through a 2x100GbE connection to campus, and thence to its destination, as if the traffic had originated from campus.
This special path for stretched VLANs has issues. First, it requires traffic potentially traverse research data centers to reach SRCF1, only to turn around and come back to campus. Also, if SRCF1 is experiencing issues or maintenance, the dedicated connection might be unavailable, cutting off the stretched VLANs from everything else.
What will the new architecture look like?
The new architecture looks like this:
The boxes have been reorganized, but the paths are mostly the same. Most traffic—traffic that is not on a stretched VLAN—takes the same paths that it did before.
Traffic on a stretched VLAN will take a separate path, through one of the new "Interconnect Gateways" (IGWs). Each VLAN Area has a geographically-distributed pair of IGWs, connecting to other IGWs across SUNet. These paths are not dependent on any one facility, so they are resilient against individual equipment or facility outages.
How will the migration take place?
How long will it take?
To migrate a VLAN, the Backbone team will first remove it from the special SRCF1-to-MOA connection. At that point, the downtime for the VLAN begins.
When we migrate a VLAN, what your devices experience will depend on which "side" of the stretch your devices are on. Most stretched VLANs are owned by routers & firewalls on campus; for those VLANs, the devices on the Campus side of the stretch will not be affected, except that they will be unable to talk to devices on the Research side of the stretch. Devices on the Research side of the stretch will lose connectivity to everything, except for devices on the same side (the Research side) of the stretch.
After confirming the VLAN is removed from the SRCF1-to-MOA connection, the interconnection will be created: The IGWs for the Campus and Research areas will be updated to recognize—and relay traffic for—the migrating VLAN.
It will take 5-15 minutes to implement the migration. Once the migration is complete, devices that are 'chatty' (like anything making connections outbound) will start working quickly (within seconds); devices that are normally quiet (like remote management interfaces) will take longer to start working (within few minutes, rarely up to half an hour).
Migrations will happen one VLAN at a time. We will schedule long (30- or 60-minute) downtime windows. Each VLAN that is assigned to a downtime window will migrate during that window, in a set order. We will only migrate one VLAN at a time.
What VLANs are affected?
Check out this Google Sheet to see if your VLAN is one of the affected ones. You will need to log in to your Stanford Google account to view the document.
If your VLAN is not listed in the document, then you are not affected by this work.
When will my VLAN be migrated?
Where can I send additional questions?
You shoulve have received—or will soon receive—a notification from either Stanford Research Computing, UIT Networking, or from someone in your School or Department. That email thread will be used to set the migration schedule for your VLAN. You can use the same email thread to submit questions.