Monday, December 28, 2015

VMware NSX useful resources

I'm trying to deep dive into VMware Network Virtualization (NSX) and I have decided to collect all useful resources I will find during my journey.

Fundamentals
NSX Design
NSX Operations
NSX in Home Lab
NSX Advanced
NSX Security
Other list of resources
  • Rene Van Den Bedem (@vcdx133) :  NSX Link-O-Rama - great list of resources gathered by Rene
Tools
  • ARKIN - network visibility and analytics
This list will be continuously updated.
If you know any other useful NSX resource don't hesitate to write a comment with link.

Thursday, December 17, 2015

End to End QoS solution for Vmware vSphere with NSX on top of Cisco UCS

I'm engaged on a private cloud project where end to end network QoS is required to achieve some guarantees for particular network traffics.  These traffics are
  • FCoE Storage
  • vSphere Management
  • vSphere vMotion
  • VM production
  • VM guest OS agent based backup <== this is the most complex requirement in context of QoS
Compute and Network Infrastructure is based on
  • CISCO UCS
  • CISCO Nexus 7k and 
  • VMware NSX. 
More specifically following hardware components has to be used:
  • CISCO UCS Mini Chassis 5108 with Fabric Interconnects 6324 
  • CISCO UCS servers B200 M4 with virtual interface card VIC1340 (2x10Gb ports - each port connected to different fabric interconnect)
  • CISCO Nexus 7k
Customer is also planning to use NSX security (micro segmentation) and vRealize Automation for automated VM provisioning.

So now we understand the overall concept and we can consider how to achieve end to end network QoS to differentiate required network traffics. 

Generally we can leverage Layer 2 QoS 802.1p (aka CoS - Class of Service ) or Layer 3 QoS (aka DSCP - Differentiated Services Code Point). However, to achieve end to end QoS on Cisco UCS we have to use CoS because it is the only QoS method available inside Cisco UCS blade system to guarantee bandwidth in shared 10Gb NIC (better to say CNA) ports.

The most important design decision point is where we will do CoS marking to differentiate required network traffics. Following two options are generally available:
  1. CoS marking only in Cisco USC (hardware based marking) 
  2. CoS marking on vSphere DVS portgroups (software based marking)
Lets deep dive into available options.

OPTION 1 - UCS Hardware CoS marking

Option 1 is depicted on figure below. Please, click on figure to understand where CoS marking and bandwidth management is done.


Following bullets describes key ideas of option 1:
  • Management and vMotion traffics are consolidated on the same pair of 10Gb adapter ports (NIC-A1 and NIC-B1) together with FCoE traffic. Active / Standby teaming is used for vSphere Management and vMotion portgroups where each traffic is by default active on different UCS fabric.
  • VTEP and Backup traffics are consolidated on the same pair of 10Gb adapter ports (NIC-A2 and NIC-B2). Active / Standby teaming is used for NSX VTEP and backup portgroup.  Each traffic is by default active on different UCS fabric. 
  • Multiple VTEPs and Active / Active teaming for backup portgroup can be considered for higher network performance if necessary.
  • All vmkernel interfaces should be configured consistently across all ESXi hosts to use the same ethernet fabric in non-degraded state and achieve optimal east-west traffic.
Option 1 implications:
  • Two Virtual Machine's vNICs has to be used because one will be used for production traffic (VXLAN) and second one for backup traffic (VLAN backed portgroup with CoS marking).
OPTION 2 - vSphere DVS CoS marking:

Option 2 is depicted on figure below. Please, click on figure to understand where CoS marking and bandwidth management is done.

Following bullets describes key ideas of Option 2 and difference against Option 1:
  • Management and vMotion traffics are also consolidated on the same pair of 10Gb adapter ports (NIC-A1 and NIC-B1) together with FCoE traffic. The difference against Option 1 is usage of single CISCO vNIC and CoS marking in DVS portgroups. Active / Standby teaming is used for vSphere Management and vMotion portgroups where each traffic is by default active on different UCS fabric.
  • VTEP and Backup traffics are consolidated on the same pair of 10Gb adapter ports (NIC-A2 and NIC-B2). Active / Standby teaming is used for NSX VTEP and backup portgroup.  The difference against Option 1 is usage of single CISCO vNIC and CoS marking in DVS portgroups. Each traffic (VXLAN, Backup) is by default active on different UCS fabric. 
  • Multiple VTEPs and Active / Active teaming for backup portgroup can be considered for higher network performance if necessary.
  • All vmkernel interfaces should be configured consistently across all ESXi hosts to use the same ethernet fabric in non-degraded state and achieve optimal east-west traffic.
  • FCoE traffic is marked automatically by UCS for any vHBA. This is the only hardware based CoS marking.
Option 2 implications:
  • Two Virtual Machine's vNICs has to be used because one will be used for production traffic (VXLAN) and second one for backup traffic (VLAN backed portgroup with CoS marking).

Both options above requires two vNICs per VM. It is introducing several problems some of them listed below:
  1. Default IP gateway is used for production network therefore backup network cannot be routed without static routes inside guest OS. 
  2. NSX security tags cannot be applied per VM but must be applied per vNIC which is additional complexity.
Do we have any possibility to achieve QoS differentiation between VM traffic and In-Guest backup traffic with single vNIC per VM? 

We can little bit enhance Option 2. Enhanced solution is depicted in figure below. Please, click on figure to understand where CoS marking and bandwidth management is done.

So where is the enhancement? We can leverage enhanced conditional CoS marking in each DVS portgroup used as NSX virtual wire (aka NSX Logical Switch or VXLAN). If IP traffic is targeted to backup server we will mark it as CoS 4 (backup) else we will mark it a CoS 0 (VM traffic). 

You can argue VXLAN is L2 over L3 and thus L2 traffic where we did a conditional CoS marking will be encapsulated into L3 traffic (UDP) in VTEP interface and we will lose CoS tag. However, that's not the case. VXLAN protocol is designed to keep L2 CoS tags by copying inner Ethernet header into outer Ethernet header. Therefore virtual overlay CoS tag is kept even in physical network underlay and it can be leveraged in Cisco UCS bandwidth management (aka DCB ETS - Enhanced Transmission Selection) to guarantee bandwidth for particular CoS traffics.

The enhanced Option 2 seems to me as a best design decision for my particular use case and specific requirements.

I hope it make sense and someone else find it useful. However I share it with IT community for broader review. Any comments or the thoughts are very welcome.

Monday, November 16, 2015

Creating a Capacity & Performance Management Dashboard in vRealize Operations 6.x

I'm long time proponent of performance SLAs in modern virtual datacenters. Performance SLAs is nothing else than mutual agreement between service provider and service consumer. Agreement describes what performance of particular resource consumer can expect and provider should guarantee. The performance SLA is important mainly on shared resources. On dedicated resources consumer knows exactly what to expect from performance point of view. In modern virtualized datacenters is almost everything shared therefore performance SLAs are a must and all service consumers should require it.

The most important shared resources on virtualized infrastructures having significant impact on application performance are CPU and Disk. The rest infrastructure resources - Memory and Network - are important as well but CPU and Disk performance was typical final root cause of any performance troubleshootings I did over several years. In VMware vSphere we can typically identify CPU Contention by CPU %RDY metric and Storage Contention based on disk response time of normalized I/O's. We can identify such issues during troubleshooting when infrastructure consumers are complaining about application performance. We call it reactive approach.  But more mature approach is to identify potential performance issues before application is affected.  We call it proactive approach. And that's where performance SLA's and threshold monitoring come in to play.

Infrastructure performance SLA can looks like

  • CPU RDY is below 3% (notification threshold 2%)
  • If # of vDisk IOPS < 1000 then vDISK Response Time is below 10ms (notification threshold 7ms)

Simple right? These two bullets above should be clearly articulated, explained and agreed between infrastructure service provider and infrastructure consumer building and providing application services on top of infrastructure.

So now how to monitor these performance metrics? I have just found Sunny Dua and Iwan Rahabok blog post covering this topic and step by step problem solution with vRealize Operations 6.x. Sunny and Iwan prepared and shared with community customized vROps supermetrics, views and dashboards for performance capacity planning.  To be honest I do not have big experience with vROps so far but it seems to me as very helpful tool for anybody using vRealize Operations as monitoring platform.

Let's try to build and provide mature IT with clearly articulated SLA's and with agreed expectations between service providers and service consumers.

Friday, November 06, 2015

VMware Tools 10 and "shared productLocker"

VMware tools (aka VM tools, vmtools) were always distributed together with ESXi image however it changed with VMware Tools 10. VMware is now shipping VM tools also outside of the vSphere releases. For more information look at this blog post.

Where can I get VMware Tools?

Option 1/ VMware Tools 10 can be downloaded from my.vmware.com. More specifically from this direct URL. Please be aware, that you must be logged in to my.vmware.com before direct link works.

Option 2/ VMware Tools can be also downloaded from here without any login credentials required. The latest version (10.0.6 at the moment of writing this blog post) is available here. Benefit of option (2) is that there are binaries which can be use directly within guest operating systems.

Option 3/ Open-vm-tools. This option is available for linux based and FreeBSD operating systems. You can read more about it here , in SourceForge or in GitHub. Optimally the open-vm-tools should be available through standard package manager of your unix-like operating system.

It is worth to mention that
  • VMware Tools 10 are backward compatible and are independent on ESXi version.
  • You can share VMware Tools 10 packages with application/OS owners and they can update VMware Tools by them selves during OS update procedure. But even your OS owners will do VMtools update by them selves it is still worth to have VMware tools available in ESXi to have ability of VMtools comparison from vSphere point of view.
VMtools versions

VMtools reports version as a number. For example version number 9536 is version 9.10.0. You can map VMtools version number to human readable version by leveraging this file.

Updates

Ok, so what? If you update your VMware Tools in old way (together with ESXi image) you - or VMware Update Manager - have to upload VMware tools to each ESXi host with following impacts
  1. It takes significantly more time especially in bigger environments.
  2. You can potentially end-up with different VM tools version in different ESXi hosts in your datacenter. It can be reported as outdated VM tools after vMotion of particular VM across different ESXi hosts.
The thing is that VMware Tools 10 and above doesn't need to be updated automatically with ESXi update on each ESXi host. You can update ESXi hosts without VMware Tools and later update VMware tools bundle just on single shared place - in shared productLocker location.

So what actually the "productLocker" is? The "productLocker" is essentially VMware Tools directory. This directory is on each ESXi host by default however it can be reconfigured and pointed to directory on shared datastore. Such configuration is called "shared productLocker" and it enables us to do centralize update of VM tools. It is worth to mention that all your ESXi hosts must be reconfigured to use this shared location.

Reconfiguration has to be done via ESXi host advanced configuration option UserVars.ProductLockerLocation. So it has to be changed manually on each host, you can change it automatically via custom PowerCLI script or if you have Enterprise Plus Edition you can leverage Host Profiles. The last option works for me perfectly.

Below is screenshot showing /productLocker directory structure and content on ESXi 6 host ...

/productLocker directory structure and content 
If you use central location for VMware Tools then you don't need update ESXi hosts with full ESXi image but only with ESXi image without VMware Tools. See example of different profiles in ESXi 6 Update 2 image below.
[root@esx01:~] esxcli software sources profile list -d /vmfs/volumes/NFS-SYNOLOGY-SATA/ISO/update-from-esxi6.0-6.0_update02.zip
Name                              Vendor        Acceptance Level
--------------------------------  ------------  ----------------
ESXi-6.0.0-20160301001s-no-tools  VMware, Inc.  PartnerSupported
ESXi-6.0.0-20160302001-standard   VMware, Inc.  PartnerSupported
ESXi-6.0.0-20160301001s-standard  VMware, Inc.  PartnerSupported
ESXi-6.0.0-20160302001-no-tools   VMware, Inc.  PartnerSupported
Profile names with postfix no-tools can be used for ESXi update without updating VMware Tools to each ESXi host. For further details how to update ESXi hosts with particular profile see my other post - How to update ESXi via CLI.

Current ESXi host product locker location can be displayed by esxcli command
esxcli system settings advanced list -o /UserVars/ProductLockerLocation 
and the output should looks like ...
[root@esx01:~] esxcli system settings advanced list -o /UserVars/ProductLockerLocation   
Path: /UserVars/ProductLockerLocation   
   Type: string 
   Int Value: 0
   Default Int Value: 0
   Min Value: 0   
   Max Value: 0   
   String Value: /locker/packages/6.0.0   
   Default String Value: /locker/packages/6.0.0   
   Valid Characters: *   
   Description: Path to VMware Tools repository
To change location you can use following esxcli command
esxcli system settings advanced set -o /UserVars/ProductLockerLocation --string-value "/vmfs/volumes/NFS-SYNOLOGY-SATA/VMtools/latest
And you can verify that setting was changed ...
[root@esx02:~] esxcli system settings advanced list -o /UserVars/ProductLockerLocation
   Path: /UserVars/ProductLockerLocation
   Type: string
   Int Value: 0
   Default Int Value: 0
   Min Value: 0
   Max Value: 0
   String Value: /vmfs/volumes/NFS-SYNOLOGY-SATA/VMtools/latest   
   Default String Value: /locker/packages/6.0.0   
   Valid Characters: *   
   Description: Path to VMware Tools repository
ESXi host has to be rebooted to activate new Product Locker Location.

Hope this helps other folks in VMware community to simplify operations with VMware Tools.

References



Wednesday, October 07, 2015

How to restore deleted vmdk from VMFS5

Yesterday I have got an E-mail from somebody asking me how to restore deleted vmdk from VMFS5. They deleted VM but realised there are very important data.

Typical answer would be - "Restore from backup" - however they wrote that they don't have backup.

Fortunately, I have never had a need to restore deleted vmdk so I was starting to do some quick research (aka googling :-) )

I found VMware KB 1015413 with following statement ...
"VMware does not offer data recovery services. This article provides contact information for data recovery service providers.
Note: VMware does not endorse or recommend any particular third-party utility, nor is this list meant to be exhaustive."
I was sitting in the VMware office so I have asked colleagues if they have any practical experience with undeleting vmdk from VMFS.  One colleague of mine suggested utility "VMFS Recovery" from DiskInternals.com. He had positive experience with this particular tool. His suggestion was to use trial version which should help to identify if recovery is possible and buy full version for recovery.

Warning: Use any third party tool for recovery on your own risk.

I absolutely understand that if you lose your important data you would like to try anything to recover it back however here are my general thoughts:

  • Clone or snapshot your raw disk (LUN) with VMFS before any recovery (use storage array capabilities, third party imaging tool or *nix dd command) 
  • If your data are very valuable for you consider engagement of data recovery expert services
  • When  you do recovery by your self I wish you good luck.
If you have some other experience with this topic please share it with community in comments. 

Thursday, October 01, 2015

VMware VM Template deployment and MS Windows product license activation

In the past I have been informed from some of my customers that MS Windows Server license was not properly applied and activated  during VMware VM template deployment even the Product Key was properly entered in "Customization Specification".

I don't know if this issue still exists in the latest vSphere version however there was always pretty easy work around my customer is using since then.

You can use Run Once commands in "Customization Template". Below, in vSphere Web Client screen shot, you can see just example where two MS-DOS commands (dir and echo) are used.


For application and activation of MS Windows license key is leveraged tool slmgr.vbs which stands for "Windows Software Licensing Management Tool". 

Exact Run Once commands are:
  • C:\Windows\System32\slmgr.vbs -ipk H7Y93-12345-54321-ABCDE (just example use your product key)
  • C:\Windows\System32\slmgr.vbs -ipk

First command is used for product key application and second for activation.

Wednesday, September 02, 2015

Storage related problems with LSI 12Gb SAS HBA card

Our Dell field engineer experienced strange storage problems with SAS storage connected to ESXi hosts having LSI 12Gb SAS HBAs. Datastores were inaccessible after ESXi reboot, paths were temporarily unavailable, etc. In this particular case it was DELL Compellent storage with SAS front-end ports but the problem was not related to particular storage and similar issue can be experienced on other SAS storage systems like DELL PowerVault MD, etc.

He has found (thanks Google) Christopher Glemot's blog post describing other issues with LSI 12Gb SAS HBA. However the key point was that mpt3sas driver should be used and not msgpt3 driver which was on these particular ESXi hosts.

The solution is relatively easy:

  • Uninstall msgpt3 driver - esxcli software vib remove –n lsi-msgpt3
  • Download the latest mpt3sas driver (currently mpt3sas-10.00.00.00-6.0-2803883.zip)
  • Unpack and upload mpt3sas driver  to ESXi host
  • Install  mpt3sas driver on ESXi host - esxcli software vib install –d /path/mpt3sas-10.00.00.00-6.0-offline_bundle-2803883.zip
  • Restart ESXi host
Hope this helps.

Wednesday, August 05, 2015

DELL Force10 : BPDU Guard vs BPDU Filtering

Spanning tree should be enabled on any enterprise switch during initial switch configuration. That's the reason I have mentioned spanning tree configuration in blog post "Initial switch configuration". On the latest FTOS version following spanning tree protocols are supported:

  • STP (Spanning Tree Protocol)
  • RSTP (Rapid Spanning Tree Protocol)
  • MSTP (Multiple Spanning Tree Protocol)
  • PVSTP+ (Per-VLAN Spanning Tree Plus)

I assume the reader is familiar with various spanning tree protocols and general difference between BPDU Guard and BPDU Filter. Here is just quick recap of relevant terminology
  • BPDU Guard and BPDU Filtering are Spanning Tree Protocol security mechanisms.
  • BPDU Guard is typically configured on particular switch edge port and it generally detects BPDU frames and because BPDU frames are not expected on edge port it disables the port temporary or permanently.
  • BPDU Filter is also typically configured on switch edge port and detects BPDU frames however it does not disable switch port but instead filter these BPDU frames to mitigate impact on spanning tree protocol because BPDU frames can initiate topology change and selection of STP root. 
Generally it is not recommended to filter BPDUs but there are at least some use cases where BPDU filter is beneficial. Here are at least two use cases in my mind where I believe BPDU Filter can be beneficial:
Use Case 1/ Datacenter interconnect (aka DCI) where you are absolutely sure there cannot be the loop and you want to have two independent spanning tree regions and you really want filter BPDUs
Use Case 2/ Edge ports to ESXi hosts
  • Rogue VM can send BPDUs to the network therefore some protection is needed especially in non-trusted  environments like IaaS Cloud Providers so you have to choose between BPDU Guard and BPDU Filter + Broadcast Storm Control
  • BPDU Guard can cause DoS when you don’t have control on ESXi configuration. See. http://blog.igics.com/2015/01/bpdu-filter-and-forged-transmit-on.html for further details 
  • BPDU Filter can help to mitigate topology changes when rogue VM is trying to be and not to be STP root switch periodically and initiate unwanted network topology changes   

This blog post main objective is explanation of specific DELL S-Series (formerly Force10) switches BPDU Guard and Filtering implementation so here it is.

Force10 BPDU Guard
  • Software-based implementation - BPDUs are received on an interface and passed to the CPU for analysis/action (logs will reflect dropped BPDUs)
  • Occurs when interfaces are configured for portfast/edge-port with bpduguard and a BPDU is received
Force10 BPDU Filtering
  • Hardware-based implementation - BPDUs are dropped on ingress to the interface
  • CPU does not receive BPDU, leaving CPU resources available for other tasks
  • Logs will not reflect dropped BPDUs because it is possible only when BPDUs are sent to the CPU
  • Occurs when STP is disabled globally or per-interface
Spanning Tree examples:
All examples below are for Rapid Spanning Tree Protocol (RSTP) but the same syntax works for other spanning tree protocol variants.

STP Portfast equivalent 
confinterface gigabitethernet 0/1  spanning-tree rstp edge-port
BPDU Guard - set port state to Error Disabled and drop traffic if a BPDU is received on interface - interface state remain up 
confinterface gigabitethernet 0/1  spanning-tree rstp edge-port bpduguard
BPDU Guard - shutdown interface if a BPDU is received on interface
confinterface gigabitethernet 0/1  spanning-tree rstp edge-port bpduguard shutdown-on-violation
BPDU Filter on particular interface
conf
interface gigabitethernet 0/1
  no spanning-tree

Tuesday, July 28, 2015

How you understand documenting Conceptual, Logic, Physical?

I have just read following question in Google+ "VCDX Study Group 2015"
As a fellow writer (we architects are not readers, but writers! :) ) wanted to ask you how you understand documenting Conceptual, Logic, Physical.
Can you add all these in a single Architecture design document with all 3 parts as 3 sections or you are better off creating 3 separate documents for each type of design?
I'm hearing very often similar questions about approach how to write good design documentation. So my answer was following ...

As a writer you have to decide what is the best for your readers :-)

When I'm engaged to write Architecture document I use different approaches for different design engagements. It really depends on project size, scope, audience, architecture team, etc... For example, right now I'm working on project where 6 architects are working on single High Level Design covering big picture and each preparing Low Level Design. At the end there is single HLD document and five separate LLD documents covering
  • Compute, 
  • Storage, 
  • Networking,
  • vSphere and
  • Backup.
I had another projects where whole architecture was in single document where each section was targeted to different auditorium. That was a case of my VCDX design documentation.

Generally I believe High Level Design (HLD) is for broader technical audience but also for business owners. Therefore physical design is not required in this stage and only Conceptual and brief Logical design for each area should be in HLD. Low Level Design (LLD) is for technical implementers and technical operational personnel therefore less writing creativity and more deep technical language for specific area should be used there with references to HLD. I recommend to read Greg Ferro's "Eleven Rules of Design Documentation" which IMHO apply very good to LLD.

HLD Conceptual Design should include business and technical requirements, constraints, assumptions, key design decisions, overall high level concept and risk analysis).

HLD Logical Design should include basic logical constructions for different design areas together with capacity planning.

LLD should include Conceptual, Logical and Physical design for specific area(s) or designed system/subsystem. In LLD conceptual design there should be a subset of HLD technical requirements, constraints and assumptions and maybe some other specific requirements irrelevant in HLD. They can be even discovered after HLD and LLD design reviews and additional technical workshops. Logical design can be the same as in HLD or you can go into deeper level but still stay in logical layer without product physical specifications, cabling port IDs, VLAN IDs, IP addressing, etc... These physical details should be in in physical design and if needed referenced in to attachments, excel workbooks, or similar implementation/configuration management documents.

LLD Physical design is usually leveraged by implementer to prepare As-Built documentation.

That's just my $0.02 and your mileage can vary. 

At the end I have to repeat ... you, as a writer (Architect), have to decide appropriate documentation format for your target audience.

Don't hesitate to share your thoughts in comments.

Tuesday, July 07, 2015

DELL Force10 : Interface Configuration and VLANs

Physical interface configuration

Physical switch interface configuration is a basic operation with any switch device. DELL Force10 switch is not exception. However one think is very unique on Force10 switches. Everything, including physical interfaces, on Force10 switch is disabled by default therefore interfaces are in down state and must be configured before any use. Someones are saying it is strange behavior but in my opinion that's pretty good behavior because it is much more secure approach. You will not disrupt whole network by connecting and cabling new switch in to your enterprise network until you configure something. If you will do bad configuration than it is your fault and not device fault.

Ok, so when you want to use some switch interface you have to enable interface explicitly. Before that you should be absolutely sure your new Force10 switch is ready to be connected to the network. Think for example about spanning tree protocol configuration. Let's assume you know what you do and you want enable particular physical interface. It is easy. I thong the example below is self explanatory
conf
  interface tengigabit 0/1
  no shutdown
So your interface is up but another important note is that all physical interfaces are Layer 3 by default. You can assign IP address to Layer 3 (routed) interface and your L3 switch is configured as router device. IP address assignment is shown below.
conf
  interface tengigabit 0/1
  ip address 192.168.1.11/24
  no shutdown
Cool, but there is a chance you want configure Layer 2 interface to work as a switch port and not routed port. It is pretty easy, you have to tell it to your interface to not have IP address and behave as a switch port.
conf
  interface tengigabit 0/1
  no ip address
  switchport
  no shutdown

Physical Interface Numbering

So far we have used similar interface identification as follows
interface tengigabit 0/13

General interface identification convention parts have following format
interface <Interface Type> <Stack Unit Number>/<Interface Number>
where
  • Interface Type - can have values gigabit (gi), tengigabit (te), fortygigabit (fo)
  • Stack Unit Number - is stack ID number if classic stacking is configured otherwise there is 0 as it is single unit switch
  • Interface Number - is sequential port number on particular stack unit

Interface ranges

You can leverage interface ranges to simplify interface and VLAN configurations. 
conf
interface range  te 0/1-3, te 0/5-7
interface range vlan 100-110
I think you can see.the benefit. All configurations are applied to all interfaces on the range.

VLANs

In configurations above everything is happening in default VLAN which is by default VLAN 1. This is a single broadcast domain. In computer networking, a single layer-2 network may be partitioned to create multiple distinct broadcast domains, which are mutually isolated so that packets can only pass between them via one or more routers; such a domain is referred to as a virtual local area network, virtual LAN or VLAN. Source: wikipedia. Nowadays VLANs are used very often for network separation (security) and broadcast domain split (availability, performance). If you are familiar with CISCO VLAN configuration then you have to create VLAN id in VLAN database and you can assign VLAN/VLANs to particular interface. In Force10 it is little bit different. You have to create VLAN id, that's the same. However you are not assigning VLAN per interface but assigning interfaces to VLAN. See example below.
conf
  interface vlan 100
  description "VLAN for mission critical servers"
  untagged TenGigabitEthernet 0/1-3
  tagged TenGigabitEthernet 0/0
In example above we have created VLAN 100 for three mission critical servers. Servers are connected to ports Te 0/1, Te 0/2 and Te 0/3 without VLAN tagging (aka access mode). Switch interface Te 0/0 is uplink to the rest of the network therefore more VLANs has to be configured on this particular port therefore the port is configured as a trunk port with more tagged VLANs.

Switch uplinks are usually configured redundantly in high availability mode therefore there is a big chance you would like to use port-channel (aka LAG) as an switch uplink. LAGs are explained in next section. Port-channel is nothing else as a special virtual interface therefore port-channel VLAN configuration is very similar to physical interfaces.
conf
interface vlan 100
  description "VLAN for mission critical servers"
  untagged TenGigabitEthernet 0/1-3
  tagged Port-channel 1

LAGs - Link Aggregates 

Link Aggregation is general term for channeling multiple links into single virtual aggregate also known as port channel. There are two types of port channels static and dynamic (aka LACP). For more general information about link aggregation look here.

Now let's see how you can configure port channels.

Static Port Channel
Below is example of static port channel bundled with two interfaces (te 0/1 and te 0/2)
interface port-channel 1
  description "Static Port-Channel"
  channel-member tengigabit 0/1
  channel-member tengigabit 0/2
  no ip address
  switchport
  no shutdown
Dynamic Port Channel
Below is example of dynamic port channel bundled with two interfaces (te 0/1 and te 0/2)
interface port-channel 1
  description "Dynamic Port-Channel (LACP)"
  no ip address
  switchport
  no shutdown
 
interface tengigabit 0/1
  port-channel-protocol lacp
    port-channel 1 mode active
  no shutdown
 
interface tengigabit 0/2
  port-channel-protocol lacp
    port-channel 1 mode active
  no shutdown

VLT (Virtual Link Trunking) is actually virtual Port Channel spanned across multiple chassis (aka MultiChassis LAG). VLT can be static or dynamic port-channel. When two Force10 switches are configured in single VLT domain you can create VLT port-channel independently on each VLT node. You can read more about VLT here.

You configure VLT port-channel on each node in absolutely same way as classic port-channels. The only difference is that you will tell FTOS that this particular port-channel is VLT and you can define peer port-channel id which can be different then on other node. However best practice is to use same port-channel IDs on both VLT nodes just to keep configuration simple and more readable.

Directive to tell the port-channel is VLT is vlt-peer-lag.

So if port-channel examples above would be VLTs then the configuration is the same only with one additional option. See examples below.

Static VLT Port Channel
interface port-channel 1
  description "Static Port-Channel"
  channel-member tengigabit 0/1
  channel-member tengigabit 0/2
  vlt-peer-lag port-channel 1
  no ip address
  no shutdown

Dynamic VLT Port Channel
interface port-channel 1
  description "Dynamic Port-Channel (LACP)"
  no ip address
  vlt-peer-lag port-channel 1
  switchport
  no shutdown
 
interface tengigabit 0/1
  port-channel-protocol lacp
    port-channel 1 mode active
  no shutdown
 
interface tengigabit 0/2
  port-channel-protocol lacp
    port-channel 1 mode active
  no shutdown

Conclusion

Interface and VLAN configuration is an basic network operation. If you are familiar with any other switch vendor interface configuration I think Force10 interface configuration is simple for you. Only different approach is with VLAN configuration but it is just a matter of habit.

Hope you found this blog post useful and as always, any comment and feedback is highly appreciated.

Monday, June 29, 2015

DELL Force10 : Virtual Routing and Forwarding (VRF)

VRF Overview

Virtual Routing and Forwarding (VRF) allows a physical router to partition itself into multiple Virtual Routers (VRs). The control and data plane are isolated in each VR so that traffic does NOT flow across VRs. Virtual Routing and Forwarding (VRF) allows multiple instances of a routing table to co-exist within the same router at the same time.

DELL OS 9.7 supports up 64 VRF instances. Number of instances can be increased in future versions therefore check current documentation for authoritative number of instances.

VRF Use Cases

VRF improves functionality by allowing network paths to be segmented without using multiple devices. Using VRF also increases network security and can eliminate the need for encryption and authentication due to traffic segmentation. 

Internet service providers (ISPs) often take advantage of VRF to create separate virtual private networks (VPNs) for customers; VRF is also referred to as VPN routing and forwarding.

VRF acts like a logical router; while a physical router may include many routing tables, a VRF instance uses only a single routing table. VRF uses a forwarding table that designates the next hop for each data packet, a list of devices that may be called upon to forward the packet, and a set of rules and routing protocols that govern how the packet is forwarded. These VRF forwarding tables prevent traffic from being forwarded outside a specific VRF path and also keep out traffic that should remain outside the VRF path.

VRF uses interfaces to distinguish routes for different VRF instances. Interfaces in a VRF can be either physical (Ethernet port or port channel) or logical (VLANs). You can configure identical or overlapping IP subnets on different interfaces if each interface belongs to a different VRF instance.

VRF Configuration

First of all you have to enable VRF feature.
conf
feature vrf
Next step is to create additional VRF instance
ip vrf  tenant-1 
vrf-id is assigned automatically however if you want to configure vrf-id explicitly you can by additional parameter. In example below we use vrf-id 1
ip vrf  tenant-1 1
We are almost done. The last step is interface assignment in to particular VRF. You can assign following interfaces
  • Physical Ethernet interfaces (in L3 mode)
  • Port-Channel interfaces (static and dynamic/lacp)
  • VLAN interfaces
  • Loopback interfaces 
Below is example how to assign LAN 100 in to VRF instance tenant-1.
interface vlan 100
  ip vrf forwarding tenant-1
Configuration is pretty easy, right?

Working in particular VRF instance
When you want to configure, show or troubleshoot in particular VRF instance you have to explicitly specify in what VRF you want to be.

So for example when you want to do ping from tenant-01 VRF instance you have to use following command
ping vrf tenant-01 192.168.1.1
Conclusion

VRF is great technology for L3 multi-tenancy. DELL Network Operating System 9 supports VRF therefore you can design interesting network solutions.

Saturday, June 20, 2015

DELL Compellent Best Practices for Virtualization

All DELL Compellent Best Practices has been moved here.

The most interesting best practice document for me is "Dell Storage Center Best Practices with VMware vSphere 6.x".

VMware HA Error During VLT Failure

I have received following message in to my mailbox ...
Hi.
I have a customer that has been testing Force10 VLT with peer routing and VMWare and has encountered the warning message on all hosts during failover of the switches (S4810’s) only when the primary VLT node is failed
“vSphere HA Agent, on this host couldn’t not reach isolation address 10.100.0.1”
Does this impact HA at all?  Is there a solution?
Thanks
Paul 

Force10 is the legacy product name of DELL S-series datacenter networking. Force10 S4810's are datacenter L3 switches. If you don't know what Force10 VLT is look here. Generally it is something like CISCO virtual Port Channel (vPC), Juniper MC-LAG, Arista MLAG, etc.

I think my answer can be valuable for broader networking and virtualization community so here it is ...

First of all let’s make some assumptions:
  • Force10 VLT is used for multi chassis LAG capability
  • Force10 VLT peer routing is enabled in VLT domain to achieve L3 routing redundancy
  • 10.100.0.1 is IP address of VLAN interface on Force10 S4810 (primary VLT node) and this particular VLAN is used for vSphere management.
  • 10.100.0.2 is IP address on Force10 S4810 - secondary VLT node.
  • vSphere 5.x and above is used.

Root cause with explanation:
When primary Force10 VLT node is down then ping to 10.100.0.1 doesn’t work because peer-routing is ARP proxy on L2. Secondary node will route L2 traffic on behalf of primary node but 10.100.0.1 doesn’t answer on L3 therefore ICMP doesn’t work.

VMware (vSphere 5 and above) HA Cluster use network and storage heartbeat mechanism. Network mechanism use two probe algorithms listed below. 
  1. ESXi hosts in the cluster are sending heartbeat beacon to each other. This should work ok during primary VLT node failure.
  2. ESXi hosts are also pinging HA isolation addresses (Default HA isolation address is default gateway therefore 10.100.0.1 in your particular case). This doesn’t work  during primary VLT node failure.

That’s the reason VMware HA Cluster will log about this situation.

Is there any impact?
There is no impact on HA Cluster because
  •  It is just informative message because algorithm (1) works correctly and there is still network visibility  among ESXi hosts in the cluster.
  • From vSphere 5 and above there is also storage heartbeat mechanism which can eliminate network invisibility among ESXi host in the cluster.

Are there any potential improvements?
Yes they are. You can configure multiple HA Isolation Addresses to mitigate default gateway unavailability. In your particular case I would recommend to use two IP addresses (10.100.0.1 and 10.100.0.2) because at least one VLT node will be always available.


For more information how to configure multiple HA isolation addresses look at http://kb.vmware.com/kb/1002117

Monday, June 15, 2015

No data visibility for vSphere Admin

Recently I did very quick (time constrained) conceptual/logical design exercise for one customer who had virtualization first strategy and was willing to virtualize his Tier 1 business critical applications. One his requirement was to preclude data visibility for VMware vSphere admins.

I was quickly thinking how to fulfill this particular requirement and my first general answer was ENCRYPTION. The customer asked me to tell him more about encryption possibilities and I listed him following options.

Option 1/ Encryption in the Guest OS 

Product examples Microsoft BitLocker, HyTrust, SafeNet, etc.
Very nice comparison of disk encryption softwares is here.

Option 2/ Application level encryption

Product examples Database Encryption in SQL Server 2008 and higher, Oracle Database Transparent Encryption, etc.

Option 3/ Encryption in the Fibre Channel SAN

Example is Brocade SAN Encryption Solution or Cisco MDS 9000 Family Storage Media Encryption.

Option 4/ Encryption in the Disk Array

Data encryption behind storage controllers. Usually leveraging Self Encrypted Disks (aka SED).

Next logical question was ... what is the performance impact?.
My quick answer was that there is definitely performance overhead in software encryption but no performance overhead with hardware encryption as it is offloaded into the special ASICs.

Hmm... Right, the most appropriate answer would be that hardware solutions are designed to have none or negligible performance impact. I always recommend to do testing before any real use in production but that's what hardware vendors claim at least in their white papers. Specifically in option (3) storage IO has to be redirected to the encryption module/appliance in the SAN which should be order of magnitude less that typical IO response time therefore impact on storage latency should be theoretically none or negligible.

However the problem with my recommended options is not  performance claim.
The problem is that only option 1 and 2 are applicable to fulfill customer's requirement because option 3 and 4 do encryption and decryption on lower levels and data are decrypted and visible on vSphere layer. Therefore vSphere admin would have visibility into data.

Options 1 and 2 has definitely some performance overhead nowadays generally somewhere between 20%-100% depending on software solution, CPU family, encryption algorithm strength, encryption key length, etc.

For completeness let's say that options 3 and 4 are good for different use cases.

  • Option 3 can help you to secure data from storage admin not having access to SAN network or from someone having physical access to disks. 
  • Option 4 can help you to secure data on disks against theft of physical storage or disks.

It is worth to say that security is always trade-off.

Software based solutions has some negative impact on performance, medium negative impact on price and also negative impact on manageability. Performance of software based solutions can be significantly improved by leveraging AES hardware offload to modern Intel Processors and performance overhead will be mitigated year by year.

Pure hardware based solutions are not applicable options for our specific requirement but even it would be applicable and they will have none or negligible impact on performance there are drawbacks like huge impact on cost and also some impact on scalability and manageability.

Conclusion
I was very quick during my consulting and I didn't realize what options really fulfill specific customer's requirement. I'm often saying that I don't trust anybody nor my self. This was exactly the case - unfortunately :-(

Time constrained consulting usually doesn't offer the best results. Good architecture need some time for review and better options comparison :-)

Thursday, May 28, 2015

How large is my ESXi core dump partition?

Today I have been asked to check core dump size on ESXi 5.1 host because this particular ESXi experienced PSOD (Purple Screen of Death) with message that core dump was not saved completely because out of space.

To be honest it took me some time to find the way how to find core dump partition size therefore I documented here.

All commands and outputs are from my home lab where I have ESXi 6 booted from USB but principle should be same.

To run these commands you have to login to ESXi shell for example over ssh or ESXi troubleshooting console.

First step is to get information what disk partition is used for core dump.
[root@esx01:~] esxcli system coredump partition get   Active: mpx.vmhba32:C0:T0:L0:9
   Configured: mpx.vmhba32:C0:T0:L0:9
Now we know that core dump is configured on disk mpx.vmhba32:C0:T0:L0 partition 9.

Second step is to list disks and disks partitions together with sizes.
[root@esx01:~] ls -lh /dev/disks/total 241892188
-rw-------    1 root     root        3.7G May 28 11:25 mpx.vmhba32:C0:T0:L0
-rw-------    1 root     root        4.0M May 28 11:25 mpx.vmhba32:C0:T0:L0:1
-rw-------    1 root     root      250.0M May 28 11:25 mpx.vmhba32:C0:T0:L0:5
-rw-------    1 root     root      250.0M May 28 11:25 mpx.vmhba32:C0:T0:L0:6
-rw-------    1 root     root      110.0M May 28 11:25 mpx.vmhba32:C0:T0:L0:7
-rw-------    1 root     root      286.0M May 28 11:25 mpx.vmhba32:C0:T0:L0:8
-rw-------    1 root     root        2.5G May 28 11:25 mpx.vmhba32:C0:T0:L0:9

You can get the same information by partedUtil.
[root@esx01:~] partedUtil get /vmfs/devices/disks/mpx.vmhba32:C0:T0:L0:9326 255 63 5242880
Here you can see the partition has 5,242,880 sectors where each sector is 512 bytes. That's mean 5,242,880 * 512 / 1024 / 1024 / 1024 = 2.5GB

Note: It is 2.5GB because ESXi is installed on 4GB USB. If you have regular hard drive core dump partition should be 4 GB.

BUT all above information is not valid if you have changed your Scratch Location (here is VMware KB how to do it). If your Scratch Location is changed you can display current scratch location which is stored on /etc/vmware/locker.conf

[root@esx01:~] cat /etc/vmware/locker.conf
/vmfs/volumes/02c3c6c5-53c72a35/scratch/esx01.home.uw.cz 0

and you can list sub directories in your custom scratch location
[root@esx01:~] ls -la /vmfs/volumes/02c3c6c5-53c72a35/scratch/esx01.home.uw.cztotal 28d---------    7 root     root          4096 May 12 21:45 .d---------    4 root     root          4096 May  3 20:47 ..d---------    2 root     root          4096 May  3 21:17 cored---------    2 root     root          4096 May  3 21:17 downloadsd---------    2 root     root          4096 May 28 09:30 logd---------    3 root     root          4096 May  3 21:17 vard---------    2 root     root          4096 May 12 21:45 vsantraces
Please note that new scratch location contains custom core dump sub directory (core) and also log sub directory (log).  

Other considerations
I usually change ESXi coredump partition and log directory location to shared datastore. This is done by following ESXi host advanced settings fully described in this VMware KB:
  • CORE DUMP Location: ScratchConfig.ConfiguredScratchLocation
  • Log Location: Syslog.global.logDir and optionaly Syslog.global.logDirUnique if you want redirect all ESXi hosts to the same directory
I also recommend to send logs to remote syslog server over network which is done with advanced setting 
  • Remote Syslog Server(s): Syslog.global.logHost
ESXi core dumps can be also transferred over to network to central Core Dump Server. It has to be configured with following esxcli commands.
esxcli system coredump network set --interface-name vmk0 --server-ipv4 [Core_Dump_Server_IP] --server-port 6500
esxcli system coredump network set --enable true
esxcli system coredump network check

Wednesday, May 06, 2015

DELL Force10 VLT and vSphere Networking

DELL Force10 VLT is multi chassis LAG technology. I wrote several blog posts about VLT so for VLT introduction look at http://blog.igics.com/2014/05/dell-force10-vlt-virtual-link-trunking.html. All Force10 related posts are listed here.  By the way DELL Force10 S-Series switches has been renamed to DELL S-Series switches with DNOS 9 (DNOS stands for DELL Network Operating System) however I’ll keep using Force10 and FTOS in my series to keep it uniform. 

In this blog post I would like to discuss Force10 VLT specific failure scenario when VLTi fails.

VLT Domain is actually cluster of two VLT nodes (peers). One node is configured as primary and second node as secondary. VLTi is a peer link between two VLT nodes. The main role of VLTi peer link is to synchronize MAC addresses interface assignments which is used for optimal traffic in VLT port-channels. In other words if everything is up and running data traffic over VLT port-channels (virtual LAGs)  is optimize and optimal link will be chosen to eliminate inter VLTi traffic. VLTi is used for data traffic only in case of some VLT link failure in one node and another VLT link still available on another node.

Now you can ask what happen in case of VLTi failure. In this situation backup link will kick in and act as a backup communication link for VLT Domain cluster. This situation is called Split-Brain scenario and exact behavior is nicely described in VLT Reference guide.
The backup heartbeat messages are exchanged between the VLT peers through the backup links of the OOB Management network. When the VLTI link (port-channel) fails, the MAC/ARP entries cannot be synchronized between the VLT peers through the failed VLTI link, hence the Secondary VLT Peer shuts the VLT port-channel forcing the traffic from the ToR switches to flow only through the primary VLT peer to avoid traffic black-hole. Similarly the return traffic on layer-3 also reaches the primary VLT node. This is Split-brain scenario and when the VLTI link is restored, the secondary VLT peer waits for the pre-configured time (delay-restore) for the MAC/ARP tables to synchronize before passing the traffic. In case of both VLTi and backup link failure, both the VLT nodes take primary role and continue to pass the traffic, if the system mac is configured on both the VLT peers. However there would not be MAC/ARP synchronization.
With all that being said let’s look at some typical VLT topologies with VMware ESXi host. Force10 S4810 is L3 switch therefore VLT domain can provide switching and routing services. Upstream router is single router for external connectivity. ESXi host has two physical NIC interfaces.

First topology

First topology is with VMware switch independent connectivity. This is very common and favorite ESXi network connectivity because of simplicity for vSphere administrator.




The problem with this topology is when VLTi peer-link has a failure (red cross in the drawing). We already know that in this scenario the backup link will kick in and VLT links from secondary node are intentionally disabled (black cross in the drawing). However our ESXi host is not connected via VLT therefore the server facing port will stay up.  VLT Domain doesn’t know anything about VMware vSwitch topology therefore it must keep port up which implies as a black hole scenario (black circle in the drawing) for virtual machines pinned into VMware vSwitch Uplink 2.
I hear you. You ask what the solution for this problem is.  I think there are two solutions.  First out-of-the-box solution is to use VLT down to the ESXi host which is depicted on second topology later in this post. Second solution could be to leverage UFD (Uplink Failure Detection) and track some VLT ports together with server facing ports. I did not test this scenario but I think it should work and there is big probability I’ll have to test it soon.   

Second topology

Second topology is leveraging VMware LACP. LACP connectivity is obviously more VLT friendly because VLT is established down to the server and downlink to ESXi host is correctly disabled. Virtual machines are not pinned directly to VMware vSwitch uplinks but they are connected through LACP virtual interface. That’s the reason you will not experience black hole scenario for some virtual machines.







Conclusion

Server virtualization is nowadays on every modern datacenter. That’s the reason why virtual networking has to be taken in to account for any datacenter network design. VMware switch independent NIC teaming is simple for vSphere administrator but it can negatively impact network availability in some scenarios. Unfortunately VMware standard virtual switch doesn’t support dynamic port-channel (LACP) but only static port-channel. Static port-channel should work correctly with VLT but LACP is recommended because of LACP keep-alive mechanism.  LACP is available only with VMware distributed virtual switch which requires the highest VMware licenses (vSphere Enteprise Plus edition). VMware’s distributed virtual switch with LACP uplink is the best solution for Force10 VLT.  In case of the budget or technical constraint you have to design an alternative solution leveraging either static port-channel (VMware call it “IP Hash load balancing”) or FTOS UFD (Uplink Failure Detection) to mitigate risk of black hole scenario. 

Update 2015-05-13:
I have just realized that NPAR is actually technical constraint avoiding to use port-channel technology on ESXi host virtual switch. NPAR technology allows switch independent network partitioning of physical NIC ports into more logical NICs. However port-channel cannot be configured on NPAR enabled NICs therefore UFD is probably the only solution to avoid black hole scenario when VLT peer-link fails. 

CISCO UCS Product Poster

Here is nice poster depicting CISCO Unified Compute System components.

Wednesday, April 08, 2015

Force10 Link Dampening

First of all let's explain why we should use Link Dampening?

Interface state changes occur when interfaces are administratively brought up or down or if an interface state changes. Every time an interface changes a state or flaps, routing protocols are notified of the status of the routes that are affected by the change in state. These protocols go through the momentous task of re-converging. Flapping; therefore, puts the status of entire network at risk of transient loops and black holes. Link dampening minimizes the risk created by flapping by imposing a penalty for each interface flap and decaying the penalty exponentially. After the penalty exceeds a certain threshold, the interface is put in an Error-Disabled state and for all practical purposes of routing, the interface is deemed to be “down.” After the interface becomes stable and the penalty decays below a certain threshold, the interface comes up again and the routing protocols re-converge.

Dampening parameters:
Syntax: dampening [[[[half-life] [reuse-threshold]] [suppress-threshold]] [max-suppress-time]]
·         half-life
o    The number of seconds after which the penalty is decreased. The penalty decreases half after the half-life period expires. The range is from 1 to 30 seconds. The default is 5 seconds.
·         reuse-threshold
o    The number as the reuse threshold, the penalty value below which the interface state is changed to “up”. The range is from 1 to 20000. The default is 750.
·         suppress-threshold
o    The number as the suppress threshold, the penalty value above which the interface state is changed to “error disabled”. The range is from 1 to 20000. The default is 2500.
·         max-suppress-time
o    The maximum number for which a route can be suppressed. The default is four times the half-life value. The range is from 1 to 86400. The default is 20 seconds.

Dampening algorithm:
With each flap, Dell Networking OS penalizes the interface by assigning a penalty (1024) that decays exponentially depending on the configured half-life. After the accumulated penalty exceeds the suppress threshold value, the interface moves to the Error-Disabled state. This interface state is deemed as “down” by all static/dynamic Layer 2 and Layer 3 protocols. The penalty is exponentially decayed based on the half-life timer. After the penalty decays below the reuse threshold, the interface enables.

Dampening settings timing example: 
Lets say we have dampening 10 100 1000 60
·         half-life = 10 seconds
·         reuse-threshold = 10
·         suppress-threshold = 1000
·         max-suppress-time = 60 second
Time after flap
Penalty
Port state
Comment
0s
1024
Down
Penalty set to 1024
Penalty (1024) > Supress-threshold (1000)  then port state down
10s
512
Down
Penalty set to 1024 / 2
Penalty (512) > Reuse-threshold (100) then port state still down
20s
256
Down
Penalty set to 512 / 2
Penalty (256) > Reuse-threshold (100) then port state still down
30s
128
Down
Penalty set to 256 / 2
Penalty (256) > Reuse-threshold (100) then port state still down
40s
64
Up
Penalty set to 128 / 2
Penalty (64) < Reuse-threshold (100) then port state is changed to UP