Wednesday, December 31, 2014

esxcli formatters

I have just read great blog post "Hidden esxcli Command Output Formats You Probably Don’t Know" where the author (Steve Jin) exposed undocumented esxcli options to choose different formatters of esxcli output. Following esxcli formatters are available:

  • xml
  • csv
  • keyvalue

Here is example of one particular esxcli command without formatter.
~ # esxcli system version get
   Product: VMware ESXi
   Version: 5.5.0
   Build: Releasebuild-1746018
   Update: 1
Here is example with csv formatter
~ # esxcli --formatter=csv system version get
Build,Product,Update,Version,
Releasebuild-1746018,VMware ESXi,1,5.5.0,

Here is example with keyvalue formatter
~ # esxcli --formatter=keyvalue system version get
VersionGet.Build.string=Releasebuild-1746018
VersionGet.Product.string=VMware ESXi
VersionGet.Update.string=1
VersionGet.Version.string=5.5.0

It is quite obvious how xml formatter output looks like, right? All these formatters can help with output parsing which can be very helpful for automation scripts. Even more formatters are available when debug option is used. Following esxcli debug formatters are available:

  • python
  • json
  • html
  • table
  • tree
  • simple


Here is example with json formatter
~ # esxcli --debug --formatter=json system version get
{"Build": "Releasebuild-1746018", "Product": "VMware ESXi", "Update": "1", "Version": "5.5.0"}

Conclusion

esxcli formatters might be very handy for those who do automation and need easier parsing of esxcli output. You probably know that esxcli can be also used from PowerCLI so formatters can significantly simplified the script code.   

Friday, December 19, 2014

Galera Cluster for MySQL vs MySQL (NDB) Cluster

In the past, I used to use MySQL database for lot of projects so I was really interested what is the progress in MySQL clustering technologies and what is possible today. Therefore I have attended very interesting webinar  about MySQL clustering possibilities. Official webinar name is "Galera Cluster for MySQL vs MySQL (NDB) Cluster. A High Level Comparison" and Webinar Replay & Slides are available online  here.

I was very impressed what is possible today and even I moved from software architecture and development to datacenter infrastructure architecture and consulting it is good to know what database clustering is available nowadays.

Monday, December 15, 2014

Disk queue depth in an ESXi environment

In the internet, there are lot of information and documents about ESXi and Disk Queue Depth but I didn't find any single document explained all details I would like to know and in the format for easy consumption. Different vendors have their specific recommendations and best practices but without deeper explanation and holistic view. Some documents are incomplete and some other have, in my opinion, wrong or at least strange information. That's the reason I have decided to write this blog post mainly for my own but I hope it helps to other people from the community.

When I did research I found Frank Denneman blog post from March 2009 where almost everything is very well and accurately explained but technology little bit changed over the time. Another great document is here on VMware Community Docs.

Disclaimer: I work for DELL Services therefore I work on lot of projects with DELL Compellent Storage Systems, DELL Servers and QLogic HBA's. Therefore this blog post is based on "DELL Compellent Best Practices with VMware vSphere 5.x" and "Qlogic Best Practices for VMware vSphere 5.x". I have all this equipment in the lab so I was able to test it and verify what's going on. However although this blog has information about specific hardware general principles are the same for any other HBAs and storage systems.

Some blog sections are just copy and paste from public documents mentioned above so all credits go there. Some other sections are based on my lab tests, experiments and thoughts.

What is queue depth?

Queue depth is defined as the number of disk transactions that are allowed to be “in flight” between an initiator and a target, where the initiator is typically an ESXi host HBA port/iSCSI initiator and the target is typically the Storage Center front-end port. Since any given target can have multiple initiators sending it data, the initiator queue depth is generally used to throttle the number of transactions being sent to a target to keep it from becoming “flooded”. When this happens, the transactions start to pile up causing higher latencies and degraded performance. That being said, while increasing the queue depth can sometimes increase performance, if it is set too high, there is an increased risk of over-driving the storage array. As data travels between the application and the storage array, there are several places that the queue depth can be set to throttle the number of concurrent disk transactions. The most common places queue depth can be modified are:
  • The application itself (Default=dependent on application)
  • The virtual SCSI card driver in the guest (Default=32)
  • The VMFS layer (DSNRO) (Default=32)
  • The HBA VMkernel Module driver (Default=64)
  • The HBA BIOS (Default=Varies)

HBA Queue Depth = AQLEN 

(Default Varies but in my QLogic card it is 2176)

Here are Compellent recommended BIOS settings for QLogic Fibre Channel card:
  1. The “connection options” field should be set to 1 for point to point only
  2. The “login retry count” field should be set to 60 attempts
  3. The “port down retry” count field should be set to 60 attempts
  4. The “link down timeout” field should be set to 30 seconds
  5. The “queue depth” (or “Execution Throttle”) field should be set to 255. This queue depth can be set to 255 because the ESXi VMkernel driver module and DSNRO can more conveniently control the queue depth
I don't think  point (5) is correct and up to date in Compellent Best Practices. Based on QLogic document HBA "Queue Depth" doesn't exist in QLogic BIOS and "Execution Throttle" is not used any more. QLogic Adapter Queue Depth is only managed by driver and default value is 64 for vSphere 5.x. For more info look at here.

I believe QLogic Host Bus Adapter has Queue Depth 2176 because this number is visible in esxtop as AQLEN (Adapter Queue Length) for each particular HBA (vmhba).

HBA LUN Queue Depth 

(Default=64)
HBA LUN Queue Depth is by default 64 but can be changed via VMkernel Module driver. Compellent recommends to set HBA max queue depth to 255 but this queue depth can be used just for single VM running on single LUN (disk). If more VMs are running on LUN than DSNRO has precedence. However, when SIOC (Storage I/O Control) is used then DSNRO is not used at all and HBA VMkernel Module Driver Queue Depth value is used for each device queue depth  (DQLEN). I think SIOC is beneficial in this situation because you can have deeper queues with dynamic queue management across all ESXi hosts connected to particular LUN. Please note, that special attention has to be taken for correct SIOC latency threshold especially on LUNs with sub-lun tiering.

If you want to check your Disk (aka LUN) Queue Depth value it is nicely visible in esxtop as DQLEN.

So here is the procedure how to change HBA VMkernel Module Driver Queue Depth.

Find the appropriate driver name for the module that is loaded for QLogic HBA:
~ # esxcli system module list | grep ql
qlnativefc                          true        true
In our case, the name of module for QLogic driver is qlnativefc.

Set the QLogic driver queue depth and timeouts using the esxcli command:
esxcli system module parameters set -m qlnativefc -p "ql2xmaxqdepth=255 ql2xloginretrycount=60 qlport_down_retry=60"
You have to reboot ESXi host to apply module changes.

Below is description of QLogic Parameters we have just change
  • ql2xmaxqdepth (int) - Maximum queue depth to report for target devices. 
  • ql2xloginretrycount (int) - Specify an alternate value for the NVRAM login retry count. 
  • qlport_down_retry (int) - Maximum number of command retries to a port that returns a PORT-DOWN status.
You can verify QLogic driver options by following command:
esxcfg-module --get-options qlnativefc 
And affected change (after ESXi reboot) is visible on esxtop on disk devices as DQLEN.

VMFS Layer DSNRO 

(Default=32)
When two or more virtual machines share a LUN (logical unit number), this parameter controls the total number of outstanding commands permitted from all virtual machines collectively on the host to that LUN (this setting is not per virtual machine). For more information see KB

The HBA LUN queue depth (DQLEN) determines how many commands the HBA is willing to accept and process per LUN and target. If a single VM virtual machine is issuing IO, the HBA LUN Queue Depth (DQLEN) setting is applicable. When multiple VM’s are simultaneously issuing IO’s to the LUN, the ESX parameter Disk.SchedNumReqOutstanding (DSNRO) value becomes the leading parameter, and HBA LUN Queue Depth is ignored. Above statement is only true when SIOC is not enabled on the LUN (datastore).

Parameter can be changed to the value between 1 and 256 so when 
esxcli storage core device set -d -O
No of outstanding IOs with competing worlds (DSNRO) can be listed for each device
~ # esxcli storage core device list -d naa.6000d3100025e7000000000000000098
naa.6000d3100025e7000000000000000098
   Display Name: COMPELNT Fibre Channel Disk (naa.6000d3100025e7000000000000000098)
   Has Settable Display Name: true
   Size: 1048576
   Device Type: Direct-Access
   Multipath Plugin: NMP
   Devfs Path: /vmfs/devices/disks/naa.6000d3100025e7000000000000000098
   Vendor: COMPELNT
   Model: Compellent Vol
   Revision: 0504
   SCSI Level: 5
   Is Pseudo: false
   Status: on
   Is RDM Capable: true
   Is Local: false
   Is Removable: false
   Is SSD: false
   Is Offline: false
   Is Perennially Reserved: false
   Queue Full Sample Size: 0
   Queue Full Threshold: 0
   Thin Provisioning Status: yes
   Attached Filters:
   VAAI Status: supported
   Other UIDs: vml.02000200006000d3100025e7000000000000000098436f6d70656c
   Is Local SAS Device: false
   Is Boot USB Device: false
   No of outstanding IOs with competing worlds: 32
The issue with DSNRO is that it must be set per devices (LUNs) on all ESXi hosts so it has negative impact on implementation and overall manageability. It is usually leaved on default value anyway because 32 outstanding (async) IOs is good enough on low latency SANs and properly designed storage devices with good response times. Bigger queues are beneficial on higher latency SANs and/or storage systems which need more parallelism (more parallel threads) to draw out more I/O's storage is physically capable of. It doesn't help in situations when storage doesn't have enough back end performance (usually spindles). Bigger queues can generate more outstanding IOs to storage system which can overload the storage and everything will be even worse. That's the reason why I think it is good idea to leave DSNRO on default value.

And once again. DSNRO is not used when SIOC is enabled because in that case "HBA LUN queue depth" is the leading parameter for disk queue depth (DQLEN).

Storage Target Port Queue Depth

Queues are on lot of places and you have to understand the path from end to end. A queue exist on the storage array controller port as well, this is called the “Target Port Queue Depth“. Modern midrange storage arrays, like most EMC- and HP arrays can handle around 2048 outstanding IO’s.
2048 IO’s sounds a lot, but most of the time multiple servers communicate with the storage controller at the same time. Because a port can only service one request at a time, additional requests are placed in queue and when the storage controller port receives more than 2048 IO requests, the queue gets flooded. When the queue depth is reached, this status is called (QFULL), the storage controller issues an IO throttling command to the host to suspend further requests until space in the queue becomes available. The ESXi host accepts the IO throttling command and decreases the LUN queue depth to the minimum value, which is 1! The QFULL condition is handled by the qlogic driver itself. The QFULL status is not returned to the OS. But some storage devices return BUSY rather than QFULL. BUSY errors are logged in the /var/log/vmkernel. Not every busy error is a Qfull error! Check the scsi sense codes indicated in the vmkernel message to determine what type of error it is. The VMkernel will check every 2 seconds to check if the QFULL condition is resolved. If it is resolved, theVMkernel will slowly increase the LUN queue depth to its normal value, usually this can take up to 60 seconds.

Compellent SC8000 front-end ports act as targets and have Target Port Queue Depth 1900+. What 1900+ means? Compellent is excellent storage operating system on top of commodity servers, IO cards, disk enclosures and disks. By the way, is it software-designed-storage or not? But back to queues, 1900+ means that Queue Depth depends on IO card used for front-end port but 1900 is the guaranteed minimum. So for our case let's assume we have Storage Target Port Queue Depth 2048.

All together will give as the holistic view
Here are end to end disk queue parameters:
  • DSNRO is 32 - this is default ESXi value.
  • HBA LUN Queue Depth is 64 by default in QLogic HBAs. Compellent recommends to change it  to 256 by parameter of HBA VMkernel Module Driver.
  • HBA Queue Depth is 2176.
  • Compellent Target Ports Queue Depths are 2048.
Everything is depicted in the figure below.
Disk Queue Depth - Holistic View

What is the optimal HBA LUN Queue Depth?

We already know 64 is the default value for QLogic and you can change it via VMkernel Module driver. So what is the optimal HBA LUN Queue Depth?

To prevent flooding the target port queue depth, the result of the combination of number of host paths + HBA LUN Queue Depth+ number of presented LUNs through the host port must be less than the target port queue depth. In short T >= P * Q * L

T = Target Port Queue Depth (2048)
P = Paths connected to the target port (1)
Q = HBA LUN Queue Depth (?)
L = number of LUN presented to the host through this port (20)

So when I have 20 LUNs exposed from Compellent storage system HBA LUN Queue Depth should be maximally Q = T / P /L = 2048 / 1 / 20 = 102.4 for single ESXi host. When I have 16 hosts HBA LUN Queue Depth must be divided by 16 so the value for this particular scenario is  6.4.
This number is without any overbooking which is not the real case. Current QLogic HBA LUN Queue Depth default value (64) introduces fan in fan out ratio 10:1 which is probably based on long term experience with virtualized workloads. In the past Qlogic has lower default values 16 in ESX 3.5 and 32 in ESX 4.x which had lower fan in fan out ratios (2.5:1 and 5:1).

With Compellent recommended QLogic HBA LUN Queue Depth 255 we will have, in this particular case, fan in fan out ratio 40 : 1. But only when SIOC is used. And when SIOC is used and normalized datastore latency is higher then SIOC threshold dynamic queue management kick in and disk queues are automatically throttled. So all is good.

What is my real Disk Queue Depth?

VMkernel Disk Queue Depth (DQLEN) is the number which matters. However, this number is dynamic and it depends on several factors. Here are scenarios you can have:
Scenario 1 - SIOC enabledWhen you have datastore with SIOC enabled then your DQLEN will be set to HBA LUN Queue Depth. It is by default 64 but Compellent recommends to set it to 255. But Compellent doesn't recommend to use SIOC.
Scenario 2 - SIOC disabled and you have only one VM on datastoreWhen you have just one VM on datastore your DQLEN will be set o HBA LUN Queue Depth. But how often do you have just one VM on datastore?
Scenario 3 - SIOC disabled and you have two or more VMs on datastoreIn this scenario your DQLEN will be set to DSNRO which is by default 32 and it has precedence over HBA LUN Queue Depth.

ESXi Disk Queue Management - None, Adaptive Queuing or Storage I/O Control?

By default ESXi doesn't use any Disk Queue Length (DQLEN) throttling mechanism and your DQLEN is set to DSNRO (32) when two or more VM (vDisks) are on the datastore. When only one VM disk is on the datastore your DQLEN will be set to HBA LUN Queue Depth (default=64, Compellent recommends 255).

When you have enabled Adaptive Queuing or Storage I/O Control your Disk Queue Length (DQLEN) will be throttled automatically when I/O congestion occurs. The difference between Adaptive Queuing and SIOC is how I/O congestion is detected.

Adaptive queuing is waiting for storage SCSI Sense Code of BUSY or QUEUE FULL status on the I/O path. For adaptive queueing, there are some advanced parameters which must be set on a per host basis. From the Configuration tab, you have to select Software Advanced Settings. Navigate to Disk and the two parameters you need are Disk.QFullSampleSize and Disk.QFullThreshold. By default, QFullSampleSize is set to 0, meaning that it is disabled. When this value is set, adaptive queueing will kick in and half the queue depth when this number of queue full conditions is reported by the array. The QFullThreshold is the number of good status to receive before incrementing the queue once again.

Storage I/O Control uses the concept of a congestion threshold, which is based on latency. But not normal latency of datastore on only one particular ESXi host but there is sophisticated algorithm preparing normalized datastore latency of all datastore latencies across ESXi hosts using one particular datastore. On top of datastore wise normalization there is another type of normalization which takes in to account just "normal" I/O sizes. What is normal IO size? I cannot find this detail but I think I read somewhere that normal I/O size is between 2KB and 16KB.  Update 2016-10-07: Now when I work for VMware I have access to ESXi source codes and if I read code correctly it seems to me that IO size normalization works little bit different. I cannot publish exact formula as it is VMware intelectual property and implementation details but the whole idea is that for example 1MB I/Os would significantly skewed latency against normally accepted latency values in the storage industry therefore latency for bigger I/O's are somehow adjusted based on I/O size.

Information in above section are based on Cormag Hogan blog post.

Conclusion

Based on information above I think that default ESXi and QLogic values are the best for general workloads and tuning queues is not something I would recommend to do. It is important to know that queues tuning does not help with performance in most cases. Queues are used for latency improvements during transient I/O burst. If you have storage performance issues usually it is because of storage system performance and not about queues in the network path. Bigger Queue Depth can help with more asynchronous I/Os to storage subsystem which can be beneficial for storage performance in environment with just few VMs but when you have hundreds of VMs lot of I/Os are flying to the storage system so deeper queues will not help you anyway.

Question I still need answer to my self is why Compellent doesn't recommend VMware Storage I/O Control and Adaptive Queuing for dynamic queue management which is in my opinion very good thing.

UPDATE 2014-12-16: I have downloaded the latest Compellent Best Practices and there is new statement about SIOC. SIOC can be enabled but you must know what's happening and if it is beneficial for you. Here is the snip from the document
SIOC is a feature that was introduced in ESX/ESXi 4.1 to help VMware administrators regulate storage performance and provide fairness across hosts sharing a LUN. Due to factors such as Data Progression and the fact that Storage Center uses a shared pool of disk spindles, it is recommended that caution is exercised when using this feature. Due to how Data Progression migrates portions of volumes into different storage tiers and RAID levels at the block level, this could ultimately affect the latency of the volume, and trigger the resource scheduler at inappropriate times. Practically speaking, it may not make sense to use SIOC unless pinning particular volumes into specific tiers of disk.
I still believe SIOC is the way to go and special attention has to be paid to SIOC latency threshold. Compellent recommends to keep it on default value 30 milliseconds which makes perfect sense. Storage System will do all hard work for you but when there is huge congestion and normalized latency is too high dynamic ESXi disk queue management can kick in. It make sense for me.

I also believe that Adaptive Queuing is really good and practical safety mechanism when your storage array has full queues in storage front-end ports or LUNs. If Adaptive Queueing is not used, even SIOC cannot help you with LUNs/Datastores issues (Datastore Disconnections) because SIOC algorithm is based on device response time but not on queue full storage response. Therefore, I strongly recommend to enable SIOC together with Adaptive Queuing unless your storage vendor has really good justification to not do so. SIOC will help you with storage traffic throttling during high device response times and Adaptive Queuing when storage array queues are full and device cannot accept new I/O's. However, Adaptive Queuing should be configured in concert with your storage vendor. For more information how to enable Adaptive Queuing read VMware KB 1008113. Please note, that SIOC and Adaptive Queuing are just safety mechanisms how to mitigate impacts of storage issues but the root cause is the Capacity Planning on storage array.

If any of you with a deep understanding of vSphere and storage architectures see an error in my analysis, please let me know so that I can correct appropriately.

Useful links:

Wednesday, December 03, 2014

Force10: How to prepare logs and configs for DELL tech support

The command show tech-support will show you all configurations and logs required for troubleshooting on the console. It is usually not  what you want because you have to transfer support file somewhere. Therefore you can simply save it to internal flash device as an file and transfer it via ftp, tftp or scp to some computer.
F10-S4810-A#show tech-support | save flash://tech-supp.2014-12-03
Start saving show command report .......
The file tech-supp.2014-12-03 is created in the flash device and you can list it.
F10-S4810-A#dir
Directory of flash:
  1  drwx       4096   Jan 01 1980 00:00:00 +00:00 .
  2  drwx       3072   Dec 04 2014 03:49:32 +00:00 ..
  3  drwx       4096   Mar 01 2004 21:28:04 +00:00 TRACE_LOG_DIR
  4  drwx       4096   Mar 01 2004 21:28:04 +00:00 CORE_DUMP_DIR
  5  d---       4096   Mar 01 2004 21:28:04 +00:00 ADMIN_DIR
  6  drwx       4096   Mar 01 2004 21:28:06 +00:00 RUNTIME_PATCH_DIR
  7  drwx       4096   Nov 04 2014 01:09:36 +00:00 CONFIG_TEMPLATE
  8  -rwx       6731   Dec 04 2014 01:46:00 +00:00 startup-config
  9  -rwx       6285   Nov 07 2013 04:41:32 +00:00 startup-config.bak
 10  -rwx   25614609   Feb 21 2013 17:24:22 +00:00 FTOS-SE-8.3.12.1.bin
 11  -rwx     524528   Feb 21 2013 17:40:42 +00:00 U-boot.1.2.0.2.bin
 12  -rwx       7202   Nov 07 2013 05:35:14 +00:00 david.pasek-config
 13  drwx       4096   May 07 2014 04:06:42 +00:00 CONFD_LOG_DIR
 14  -rwx       4094   Jul 18 2014 02:25:28 +00:00 inetd.conf
 15  -rwx          0   Jul 19 2014 00:14:56 +00:00 pdtrc.lo0
 16  -rwx       6125   Jul 15 2014 02:05:52 +00:00 backup-config.dp
 17  -rwx         80   Jul 18 2014 21:45:22 +00:00 memtrc.lo0
 18  -rwx     199770   Dec 04 2014 01:46:08 +00:00 confd_cdb.tar.gz
 19  -rwx          0   Jul 18 2014 21:44:52 +00:00 pdtrc3.lo0
 20  drwx       4096   Jul 19 2014 00:14:56 +00:00 bgp-trc
 21  -rwx       6058   Oct 25 2014 05:13:10 +00:00 config-vrf-no-vrrp
 22  -rwx       6110   Oct 25 2014 05:05:08 +00:00 config-vrf-vrrp
 23  -rwx     135698   Dec 04 2014 03:50:16 +00:00 tech-supp.2014-12-03

Now it is very easy to transfer this single file somewhere via ssh(scp), ftp or tftp. You can use command similar to 
copy tech-supp.2014-12-03 scp://
I don't anybody wish troubles but hope this helps with troubleshooting when needed ...
 



Force10 switch port iSCSI configuration

Here is snippet of Force10 switch port configuration of port facing storage front-end port or host NIC port dedicated just for iSCSI. In other words this is non-DCB switch port configuration.
interface TenGigabitEthernet 0/12
  no ip address
  mtu 12000
  switchport
  flowcontrol rx on tx off
  spanning-tree rstp edge-port
  spanning-tree rstp edge-port bpduguard shutdown-on-violation
  storm-control broadcast 100 in
  storm-control unknown-unicast 50 in
  storm-control multicast 50 in
  no shutdown
MTU is set to 12000 which is the supported maximum and allowing to use Jumbo Frames. I know 9216 is enough MTU for iSCSI however modern switches is capable for larger MTU without performance overhead so why not use it?

Flow-control is enabled for receive traffic so switch can sen PAUSE frame to initiator or target to wait a while in case switch port buffers are full. Transience flow-control is not enabled because switch port buffers are not deep enough to help.

Switch port is set to edge-port to eliminate spanning-tree algorithm because connected devices are edge devices - server (iSCSI initiator) or storage (iSCSI target).

[Updated at 2015-01-19 based on Martin's comment]
Bpduguard can be optionally enabled on switch port as prevention against cabling mistake (also known as an human factor) because iSCSI storage should not send any BPDU. 
[/Updated]

Storm control of broadcasts, multicasts and unknown unicasts is enabled to eliminate unwanted storms. Please note that unicast storm may not be enabled because iSCSI generates lot of traffic which can be possible identified as unicast storm.

That's it for iSCSSI network configuration of dedicated ports. When you use shared ports for iSCSI and LAN traffic you should use DCB. DCB configuration is similar to configuration above but replacing flow-control with priority-flow-control (PFC) and leveraging link QoS technology (ETS) based on 802.1p CoS. You can check DCB design justification and sample configuration for iSCSI here

And as always, any comment is really appreciated. 

Saturday, November 29, 2014

The ZALMAN ZM VE200 SATA hard disk caddy with DVD/HDD/FDD emulation

I have just bought external USB drive with DVD emulation from ISO file. That's should be pretty handy for OS installs. I'm looking forward for first ESXi installation directly from ISO file.

Here is nice and useful tutorial how to use it.

Friday, November 21, 2014

Announcing the VMware Learning Zone

As a VMware vExpert I had a chance to use beta access to VMware Learning Zone. I blogged about my experience here. VMware Learning Zone has been officially announced today.

VMware Learning Zone is a new subscription-based service that gives you a full year of unlimited, 24/7 access to official VMware video-based training. Top VMware experts and instructors discuss solutions, provide tips and give advice on a variety of advanced topics. Your VMware Learning Zone subscription gives you:

  • Easy to consume training on the latest products and technologies
  • Powerful search functionality to find the answers you need fast
  • Content that delivers exactly the knowledge you need
  • Mobile access for on the go viewing
  • Much more

Learn more here.

ISCSI Best Practices

General ISCSI Best Practices
  • Separate VLAN for iSCSI traffic. 
  • Two separate networks or VLANs for multipathed iSCSI. 
  • Two separate IP subnets for the separate networks or VLANs in multipathed iSCSI. 
  • Gigabit Full Duplex connectivity between Storage Center and all local Servers 
  • Auto-Negotiate for all switches that will correctly negotiate at Gigabit Full Duplex 
  • Gigabit Full Duplex hard set for all iSCSI ports, for both Storage Center and Servers for switches that do not correctly negotiate 
  • Bi-Directional Flow Control enabled for all Switch Ports that servers or controllers are using for iSCSI traffic. 
  • Bi-Directional Flow Control enabled for all Server Ports used for iSCSI (Storage Center and QLogic HBA's automatically enable it). 
  • Bi-Directional Flow Control enabled for all ports that handle iSCSI traffic. This includes all devices in between two sites that are used for replication. 
  • Unicast storm control disabled on every switch that handles iSCSI traffic. 
  • Multicast disabled at the switch level for any iSCSI VLANs. 
  • Multicast storm control enabled (if available) when multicast cannot disabled. 
  • Broadcast disabled at the switch level for any iSCSI VLANs. 
  • Broadcast storm control enabled (if available) when broadcast cannot disabled. 
  • Routing disabled between regular network and iSCSI VLANs. 
  • Do not use Spanning Tree (STP or RSTP) on ports which connect directly to end nodes (the server or Compellent controllers iSCSI ports.) If you must use it, enable the Cisco PortFast option or equivalent on these ports so that they are configured as edge ports. 
  • Ensure that any switches used for iSCSI are of a non-blocking design. 
  • When deciding which switches to use, remember that you are running SCSI traffic over it. Be sure to use a quality managed enterprise class networking equipment. It is not recommended to use SBHO (small business/home office) class equipment outside of lab/test environments. 
For Jumbo Frame Support
  • Some switches have limited buffer sizes and can only support Flow Control or Jumbo Frames, but not both at the same time. Compellent strongly recommends choosing Flow Control. 
  • All devices connected through iSCSI need to support 9k jumbo frames. 
  • All devices used to connect iSCSI devices need to support it. 
  • This means every switch, router, WAN Accelerator and any other network device that will handle iSCSI traffic needs to support 9k Jumbo Frames. 
  • If is not 100% positive that every device in iSCSI network supports 9k Jumbo Frames, then NOT turn on Jumbo Frames. 
  • Because devices on both sides (server and SAN) need Jumbo Frames enabled, change disable to enable Jumbo Frames is recommended during a maintenance window. If servers have it enabled first, the Storage System will not understand their packets. If Storage System enables it first, servers will not understand its packets.

VMware ESXi iSCSI tunning

  • Disabling "TCP Delayed ACK" (esxcli iscsi adapter param set -A vmhba33 -k DelayedAck -v 0 - command not tested)
  • Adjust iSCSI Login Timeout (esxcli iscsi adapter param set -A vmhba33 -k LoginTimeout -v 60)
  • Disable large receive offload (LRO) (esxcli system settings advanced set -o /Net/TcpipDefLROEnabled 0 or esxcfg-advcfg -s 0 /Net/TcpipDefLROEnabled)
  • Set up Jumbo Frames is configured end to end (esxcli network vswitch standard set -m 9000 -v vSwitch2 and  esxcli network ip interface set -m 9000 -i vmk1)
  • Set up appropriate multi pathing based on iSCSI storage system
  • FlowControl is enabled on ESXi by default. To display FlowControl settings use ethtool --show-pause vmnic0 or esxcli system module parameters list --module e1000 | grep "FlowControl"
If you know about some other best practice, tuning setting or recommendation don't hesitate to leave comment below this blog post. 

Saturday, November 15, 2014

How to quickly get changed ESXi advanced settings?

Below is esxcli command to list ESXi Advanced Settings that have changed from the system defaults:
esxcli system settings advanced list -d
Here is real example form my ESXi host in lab ...
~ # esxcli system settings advanced list -d
   Path: /UserVars/SuppressShellWarning
   Type: integer
   Int Value: 1
   Default Int Value: 0
   Min Value: 0
   Max Value: 1
   String Value:
   Default String Value:
   Valid Characters:
   Description: Don't show warning for enabled local and remote shell access
You can see that I'm suppressing Shell Warning because I really want to have SSH enabled and running on my lab ESXi all the time.

If you want list kernel settings there is another command
esxcli system settings kernel list
and you can also used option -d to get just changed settings from default.

Friday, November 14, 2014

Virtualisation Design & Project Framework

Gareth Hogarth wrote excellent high level plan (aka methodology, framework) how to properly deliver virtualization project as a turn key solution. I used very similar approach and not only for virtualization project but to any IT project where I have a role of Leading Architect. I have never written a blog post about this particular topic because it is usually internal intellectual property  of any consulting organization. So if you have never seen any similar methodology look at Gareth's post to get an idea of project phases and overall project process. It is good to note that all these methodologies are just frameworks and frameworks are usually good starting points which doesn't stop you to improve it to fulfill all specific project requirements and make your project successful.

Friday, November 07, 2014

40Gb over existing LC fiber optics

Do you know DELL has QSFP+ LM4 transciever allowing 40Gb traffic up to 160m on LC OM4 MMF (multi mode fiber) or up to 2km on LC SMF (single mode fiber)?


Use Case:  

This optic has an LC connection and is ideal for customers who want to use existing LC fiber.  It can be used for 40GbE traffic up to 160m on MultiMode Fiber OR 2km on Single Mode fiber.

Specification

Periferal Type: DELL QSFP+ LM4
Connection: LC Connection, Dulplex Multi-Mode Fiber or Dulpex Single-Mode Fiber
Max Distance: 140m OM3 or 160m OM4 MMF, 2km SMF
Transmitter Output Wavelength (nm): 1270 to 1330
Transmit Output Power (dBm): -7.0 to 3.5 [avg power per lane]
Receive Input Power (dBm): -10.0 to 3.5 [avg power per lane]
Temperature: 0 to 70C
Power:  3.5W max

Based on wavelength range 1270 to 1330 I assume 40Gb is achieved as 4 x 10Gb leveraging wavelength-division multiplexing (CWDM) on following wave lengths:

  • 1270 nm
  • 1290 nm
  • 1310 nm
  • 1330 nm


Thursday, November 06, 2014

ESXi Network Troubleshooting

Introduction

As VMware Expert I had a chance and privilege to use VMware Learning Zone. There are excellent training videos. Today I would like to blog about useful commands trained on video training “Network Troubleshooting at the ESXi Command Line”.  If you ask me I have to say that Vmware Learning Zone has very valuable content and it come really handy during real troubleshooting. 

NIC Adapters Information

To see Network Interface Cards Information you can run following command
~ # /usr/lib/vmware/vm-support/bin/nicinfo.sh | more
Network Interface Cards Information.

Name    PCI Device     Driver  Link  Speed  Duplex  MAC Address        MTU   Description
----------------------------------------------------------------------------------------
vmnic0  0000:001:00.0  bnx2    Up     1000  Full    14:fe:b5:7d:8d:05  1500  Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX
vmnic1  0000:001:00.1  bnx2    Up     1000  Full    14:fe:b5:7d:8d:07  1500  Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX
vmnic2  0000:002:00.0  bnx2    Up     1000  Full    14:fe:b5:7d:8d:6d  1500  Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX
vmnic3  0000:002:00.1  bnx2    Up     1000  Full    14:fe:b5:7d:8d:6f  1500  Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX

NIC:  vmnic0

NICInfo:
   Advertised Auto Negotiation: true
   Advertised Link Modes: 1000baseT/Full, 2500baseT/Full
   Auto Negotiation: true
   Cable Type: FIBRE
   Current Message Level: -1
   Driver Info:
      NICDriverInfo:
         Bus Info: 0000:01:00.0
         Driver: bnx2
         Firmware Version: 7.8.53 bc 7.4.0 NCSI 2.0.13
         Version: 2.2.3t.v55.7
   Link Detected: true
   Link Status: Up
   Name: vmnic0
   PHY Address: 2
   Pause Autonegotiate: false
   Pause RX: true
   Pause TX: true
   Supported Ports: TP, FIBRE
   Supports Auto Negotiation: true
   Supports Pause: true
   Supports Wakeon: true
   Transceiver: internal
   Wakeon: MagicPacket(tm)
Ring parameters for vmnic0:
Pre-set maximums:
RX:             4080
RX Mini:        0
RX Jumbo:       16320
TX:             255
Current hardware settings:
RX:             255
RX Mini:        0
RX Jumbo:       0
TX:             255

Output above is sniped just for vmnic0. You can see useful information like PCI Device ID, Driver, Link Status, Speed, Duplex and MTU for each vmnic.
It also shows detail driver information, FlowControl (Pause Frame) status, cable type. etc.
To find particular vmnic PCI Vendor ID's use command vmkchdev
~ # vmkchdev -l | grep vmnic0
0000:01:00.0 14e4:163a 1028:02dc vmkernel vmnic0

PCI Slot: 0000:01:00.0
VID (Vendor ID): 14e4
DID (Device ID): 163a
SVID (Sub-Vendor ID): 1028
SSID (Sub-Device ID): 02dc

You can use PCI devices Vendor ID’s  to find the latest drivers at VMware Compatibility Guide (http://www.vmware.com/go/hcl/).


Below is another command how to find full details of all PCI devices.
esxcli hardware pci list
If you are interested just for particular vmnic PCI details command below can be used.
~ # esxcli hardware pci list | grep -B 6 -A 29 vmnic0
000:001:00.0
   Address: 000:001:00.0
   Segment: 0x0000
   Bus: 0x01
   Slot: 0x00
   Function: 0x00
   VMkernel Name: vmnic0
   Vendor Name: Broadcom Corporation
   Device Name: Broadcom NetXtreme II BCM5709S 1000Base-SX
   Configured Owner: Unknown
   Current Owner: VMkernel
   Vendor ID: 0x14e4
   Device ID: 0x163a
   SubVendor ID: 0x1028
   SubDevice ID: 0x02dc
   Device Class: 0x0200
   Device Class Name: Ethernet controller
   Programming Interface: 0x00
   Revision ID: 0x20
   Interrupt Line: 0x0f
   IRQ: 15
   Interrupt Vector: 0x2b
   PCI Pin: 0x75
   Spawned Bus: 0x00
   Flags: 0x0201
   Module ID: 4125
   Module Name: bnx2
   Chassis: 0
   Physical Slot: 0
   Slot Description: Embedded NIC 1
   Passthru Capable: true
   Parent Device: PCI 0:0:1:0
   Dependent Device: PCI 0:0:1:0
   Reset Method: Link reset
   FPT Sharable: true

Note: same command can be used for HBA cards by substituting vmnic0 by vmhba0

VLAN Sniffing

The commands below enable VLAN statistics collection on particular vmnic which can be shown and used for troubleshooting.  
esxcli network nic vlan stats set --enabled=true -n vmnic0
~ # esxcli network nic vlan stats get -n vmnic0
VLAN 0
   Packets received: 22
   Packets sent: 0

VLAN 22
   Packets received: 21
   Packets sent: 10

VLAN 201
   Packets received: 28
   Packets sent: 0

VLAN 202
   Packets received: 28
   Packets sent: 0

VLAN 204
   Packets received: 5
   Packets sent: 0

VLAN 205
   Packets received: 5
   Packets sent: 0

Don’t forget to disable VLAN statistics after troubleshooting.

esxcli network nic vlan stats set --enabled=false -n vmnic0


VMkernel Arp Cache

To work with ESXi ARP cache you can use command
esxcli network ip neighbor  
Below is example how to list ARP entries …
~ # esxcli network ip neighbor list
Neighbor   Mac Address        Vmknic   Expiry  State  Type
---------  -----------------  ------  -------  -----  -------
10.2.22.1  5c:26:0a:ae:5a:c6  vmk0    933 sec         Unknown

You can see there just default gateway 10.2.22.1
Let’s ping some other device in the same broadcast domain and look at ARP entries again.
~ # ping 10.2.22.51
PING 10.2.22.51 (10.2.22.51): 56 data bytes
64 bytes from 10.2.22.51: icmp_seq=0 ttl=128 time=0.802 ms

~ # esxcli network ip neighbor list
Neighbor    Mac Address        Vmknic    Expiry  State  Type
----------  -----------------  ------  --------  -----  -------
10.2.22.51  00:0c:29:4a:5b:ba  vmk0    1195 sec         Unknown
10.2.22.1   5c:26:0a:ae:5a:c6  vmk0     878 sec         Unknown

Now you can see  entry for device 10.2.22.51 in ARP table as well. Below is another command to remove ARP entry from ARP table.
~ # esxcli network ip neighbor remove -v 4 -a 10.2.22.51
… and let’s check if ARP entry has been removed.
~ # esxcli network ip neighbor list
Neighbor   Mac Address        Vmknic   Expiry  State  Type
---------  -----------------  ------  -------  -----  -------
10.2.22.1  5c:26:0a:ae:5a:c6  vmk0    817 sec         Unknown

Note: ESXi ARP timeout is 1200 second therefore remove command can be handy in some situations.

VMkernel Routing

Since vSphere 5.1 it is possible to have more than one networking stack. Normally you work with default networking stack.
To show ESXi routing table you can use command
esxcli network ip route ipv4 list  
~ # esxcli network ip route ipv4 list
Network    Netmask        Gateway    Interface  Source
---------  -------------  ---------  ---------  ------
default    0.0.0.0        10.2.22.1  vmk0       MANUAL
10.2.22.0  255.255.255.0  0.0.0.0    vmk0       MANUAL

You can see default gateway 10.2.22.1 used for default networking stack.
Command esxcli network ip connection list shows all IP network connections from and to ESXi host.
~ # esxcli network ip connection list
Proto  Recv Q  Send Q  Local Address                    Foreign Address     State        World ID  CC Algo  World Name
-----  ------  ------  -------------------------------  ------------------  -----------  --------  -------  ---------------
tcp         0       0  127.0.0.1:8307                   127.0.0.1:54854     ESTABLISHED     34376  newreno  hostd-worker
tcp         0       0  127.0.0.1:54854                  127.0.0.1:8307      ESTABLISHED    570032  newreno  rhttpproxy-work
tcp         0       0  127.0.0.1:443                    127.0.0.1:54632     ESTABLISHED    570032  newreno  rhttpproxy-work
tcp         0       0  127.0.0.1:54632                  127.0.0.1:443       ESTABLISHED   1495503  newreno  python
tcp         0       0  127.0.0.1:8307                   127.0.0.1:61173     ESTABLISHED     34806  newreno  hostd-worker
tcp         0       0  127.0.0.1:61173                  127.0.0.1:8307      ESTABLISHED    570032  newreno  rhttpproxy-work
tcp         0       0  127.0.0.1:80                     127.0.0.1:60974     ESTABLISHED     34267  newreno  rhttpproxy-work
tcp         0       0  127.0.0.1:60974                  127.0.0.1:80        ESTABLISHED     35402  newreno  sfcb-vmware_bas
tcp         0       0  10.2.22.101:80                   10.44.44.110:50351  TIME_WAIT           0
tcp         0       0  127.0.0.1:5988                   127.0.0.1:14341     FIN_WAIT_2      35127  newreno  sfcb-HTTP-Daemo
tcp         0       0  127.0.0.1:14341                  127.0.0.1:5988      CLOSE_WAIT    1473527  newreno  hostd-worker
tcp         0       0  127.0.0.1:8307                   127.0.0.1:45011     ESTABLISHED     34806  newreno  hostd-worker
tcp         0       0  127.0.0.1:45011                  127.0.0.1:8307      ESTABLISHED    570032  newreno  rhttpproxy-work

NetCat

Netcat program (nc) is available on ESXi and it can test TCP connectivity to some IP target.
~ # nc -v 10.2.22.100 80
Connection to 10.2.22.100 80 port [tcp/http] succeeded!

TraceNet

Tracenet is very handy program available in ESXi to identify also latencies inside vmkernel IP stack.
~ # tracenet 10.2.22.51
Using interface vmk0 ...
Time         0.068 0.023 0.019 ms
Location:    ESXi-Firewall
Time         0.070 0.025 0.020 ms
Location:    VLAN_InputProcessor@#
Time         0.073 0.027 0.022 ms
Location:    vSwitch0: port 0x2000004
Time         0.089 0.030 0.024 ms
Location:    VLAN_OutputProcessor@#
Time         0.090 0.031 0.025 ms
Location:    DC01
Endpoint:       10.2.22.51
Roundtrip Time: 0.417 0.195 0.196 ms


Dropped packets

In this section are commands to verify dropped packets on different places of VMkernel Ip stack.
Command net-stats –l list all devices (Clients – nic-ports,vmk-ports, vm-ports) connected to VMware switch. You can simply identify to which vSwitch port number (PortNum) is device connected.
~ # net-stats -l
PortNum          Type SubType SwitchName       MACAddress         ClientName
33554434            4       0 vSwitch0         14:fe:b5:7d:8d:05  vmnic0
33554436            3       0 vSwitch0         14:fe:b5:7d:8d:05  vmk0
33554437            5       9 vSwitch0         00:0c:29:4a:5b:ba  DC01
33554438            5       9 vSwitch0         00:0c:29:f0:df:4c  VC01

Note: SubType is VM Hardware Version
vSwitch port numbers are important for following commands.
Command esxcli network port stats get –p shows statistics for particular vSwitch port.
~ # esxcli network port stats get -p 33554434
Packet statistics for port 33554434
   Packets received: 2346445
   Packets sent: 5853
   Bytes received: 295800113
   Bytes sent: 1225842
   Broadcast packets received: 1440669
   Broadcast packets sent: 336
   Multicast packets received: 896958
   Multicast packets sent: 120
   Unicast packets received: 8818
   Unicast packets sent: 5397
   Receive packets dropped: 0
   Transmit packets dropped: 0

You can also show filter statistics for ESXi firewall by command esxcli network port filter stats get –p 33554436
~ # esxcli network port filter stats get -p 33554436
Filter statistics for ESXi-Firewall
   Filter direction: Receive
   Packets in: 5801
   Packets out: 5660
   Packets dropped: 141
   Packets filtered: 150
   Packets faulted: 0
   Packets queued: 0
   Packets injected: 0
   Packet errors: 0

Filter statistics for ESXi-Firewall
   Filter direction: Transmit
   Packets in: 4893
   Packets out: 4887
   Packets dropped: 6
   Packets filtered: 6
   Packets faulted: 0
   Packets queued: 0
   Packets injected: 0
   Packet errors: 0

To show physical NIC statistics you have to use command esxcli network nic stats get –n vmnic0
~ # esxcli network nic stats get -n vmnic0
NIC statistics for vmnic0
   Packets received: 2350559
   Packets sent: 8083
   Bytes received: 312690659
   Bytes sent: 5791889
   Receive packets dropped: 0
   Transmit packets dropped: 0
   Total receive errors: 0
   Receive length errors: 0
   Receive over errors: 0
   Receive CRC errors: 0
   Receive frame errors: 0
   Receive FIFO errors: 0
   Receive missed errors: 0
   Total transmit errors: 0
   Transmit aborted errors: 0
   Transmit carrier errors: 0
   Transmit FIFO errors: 0
   Transmit heartbeat errors: 0
   Transmit window errors: 0

Packet Capture

If you want to do deeper network troubleshooting you can do packet capturing on ESXi host. You have two tools available for packet capturing

  •  tcpdump-uw (example: tcpdump-uw –I vmk0 –s0 –C100M –W 10 –w /var/tmp/test.pcap)
  • pktcap-uw
pktcap Examples:
  • pktcap-uw –uplink vmnicX –capture UplinkRcv
  • pktcap-uw –uplink vmnicX –capture UplinkSnd
  •    you can filter for icmp –proto 0x01 or beacon probes –ethtype 0x8922