Sunday, July 24, 2016

ESXi PSOD and HeartbeatPanicTimeout

A Purple Screen of Death (PSOD) is a diagnostic screen with white type on a purple background that is displayed when the VMkernel of an ESX/ESXi host experiences a critical error, becomes inoperative and terminates any virtual machines that are running.  For more info look here.

Nobody is happy to see PSOD in ESXi host but it is important to say that it is just another safety mechanism how to protect your server workloads because PSOD is intentionally initiated by ESXi's vmkernel in situations when something really bad happens in low level. It is usually related to hardware, firmware or driver issue. You can find further information in VMware KB article - Interpreting an ESX/ESXi host purple diagnostic screen (1004250).

The main purpose of this blog post is to explain the timing of PSOD for just single type of error message - "Lost heartbeat". If there is no heartbeat in some time interval PSOD looks like screenshot below. 

no heartbeat
There is no doubt that something serious has to happened in ESXi vmkernel, however regardless what exactly happened following two vSphere advanced settings are used to control heartbeat time interval in which heartbeat must be received otherwise PSOD is executed.   
  • ESXi - Misc.HeartbeatPanicTimeout
  • VPXD (aka vCenter) - vpxd.das.heartbeatPanicMaxTimeout
Let's start with ESXi advanced setting Misc.HeartbeatPanicTimeout. It defines interval in seconds after which vmkernel goes to panic if no heartbeat is received. Please, don't mixed this "Panic Heartbeat" with "HA network heartbeat". These two heartbeats are very different. "HA network heartbeat" is heart beating mechanism between HA cluster members (master<-><->sleaves) over ethernet network but "Panic Heartbeat" is heartbeat inside single ESXi host between vmkernel and COS software components. You can see "Panic Heartbeat" settings by issuing following esxcli command
esxcli system  settings advanced list | grep -A10 /Misc/HeartbeatPanicTimeout
 [root@esx01:~] esxcli system settings advanced list | grep -A10 /Misc/HeartbeatPanicTimeout  
   Path: /Misc/HeartbeatPanicTimeout  
   Type: integer  
   Int Value: 14  
   Default Int Value: 14  
   Min Value: 1  
   Max Value: 86400  
   String Value:  
   Default String Value:  
   Valid Characters:  
   Description: Interval in seconds after which to panic if no heartbeats received  

I have tested that  Misc.HeartbeatPanicTimeout has different values in different situations. Default value is always 14 seconds but
  1. if you have single standalone ESXi host not connected to HA Cluster effective value is 900 seconds
  2. if you have ESXi host as a member of vSphere HA Cluster then the value is 14 seconds
So now we know that the value in ESXi host with enabled HA is 14 seconds (panicTimeoutMS = 14000) and it usually works without any problem. However, if you will, from whatever reasons, decide to change this value it is worth to know that in HA enabled ESXi host is in HA code hardcoded cap of 60 seconds on this value. It is a cap so it does not change the value if it is already less than 60. However, if you use for example the value 900 it will be caped to 60 seconds anyway. I did a test in vSphere6/ESXi6 and it works exactly like that and I assume it works in the same way in vSphere5/ESXi5.

Side note: It was very different in vSphere4/ESXi4 because HA cluster was rewritten in vSphere 5 from the scratch but it is already a history and I hope nobody use vSphere4 anymore.

Behavior justification:
Behavior described in paragraph above makes perfect sense if you ask me. If you have standalone ESXi host and you are experiencing some hardware issue it is better to wait 900 seconds (15 minutes) before ESXi goes to PSOD state because virtual machines running on top of this ESXi host cannot be automatically restarted in other ESXi hosts anyway. And guess what, if ESXi host have some significant hardware failure, it has most probably negative impact on virtual machines running on top of this particular ESXi host, right? Unfortunately, if you have just a single ESXi host vSphere cannot do anything for you.

On the other hand, if affected ESXi host is a member of vSphere HA cluster then it is better to wait only 14 seconds (by default) or maximally 60 seconds and put ESXi host into PSOD quicker because HA cluster will restart affected virtual machines automatically and helps to mitigate the risk of unavailable virtual machines and with that application services running inside these virtual machines.

So that's the explanation how ESXi setting /Misc/ HeartbeatPanicTimeout behaves. Now we can look what vpxd.das.heartbeatPanicMaxTimeout setting is. My understanding is that vpxd.das.heartbeatPanicMaxTimeout is vCenter (VPXD) global configuration for ESXi advanced setting Misc.HeartbeatPanicTimeout. But don't forget that HA cluster is capping Misc.HeartbeatPanicTimeout value on ESXi hosts as described above.

You can read further details about vpxd.das.heartbeatPanicMaxTimeout in VMware KB 2033250 but I think that following description is little bit misleading.
"This option impacts how long it takes for a host impacted by a PSOD to release file locks and hence allow HA to restart virtual machines that were running on it. If not specified, 60s is used. HA sets the host Misc.HeartbeatPanicTimeout advanced option to the value of this HA option. The HA option is in seconds."
My understanding is that description should be reworded to something like ...
"This option is in seconds and impacts how long it takes for ESXi host experiencing some critical issue to go into a PSOD. Setting vpxd.das.heartbeatPanicMaxTimeout is a global setting used for vCenter managed ESXi advanced option Misc.HeartbeatPanicTimeout however Misc.HeartbeatPanicTimeout is adjusted automatically in certain situations. 
In standalone ESXi host 900s is used. In vSphere HA Cluster ESXi host it is automatically changed to 14s and capped to maximum of 60s. This setting have indirect impact on time when file locks are released and hence allow HA cluster to restart virtual machines that were running on affected ESXi host."
Potential side effects and impacts
  • ESXi HA Cluster restart of virtual machines - if your Misc.HeartbeatPanicTimeout is set to 60 seconds than HA cluster will most probably try to restart VMs on another ESXi hosts because network heartbeat (also 14 seconds) will not be received. However because it is not in PSOD the file lock still exist and VM restart will be unsuccessful. 
  • ESXi Host Profiles - if you use the same host profile for HA protected and also non-protected ESXi hosts then it can report difference of Misc.HeartbeatPanicTimeout against compliance.
Blog posts in blogosphere covering "no heartbeat" issues:

Friday, July 15, 2016

DELL Force10 : DNS, Time and Syslog server configuration

It is generally good practice to have time synchronized on all network devices and configure remote logging (syslog) to centralized syslog server for proper troubleshooting and problem management. Force10 switches are not exceptions therefore let's configure time synchronization and remote logging to my central syslog server - VMware LogInsight in my case.

I would like to use hostnames instead of IP addresses so let's start with DNS resolution, continue with time settings and finalize the mission with remote syslog configuration.

Below are my environment details:

  • My DNS server is 192.168.4.21
  • DNS domain name is home.uw.cz
  • I will use internet following three NTP servers/pools - ntp.cesnet.cz, ntp.gts.cz and cz.pool.ntp.org
  • My syslog server is at syslog.home.uw.cz 

Step 1/ DNS resolution configuration
f10-s60#conf
f10-s60(conf)#ip name-server 192.168.4.21
f10-s60(conf)#ip domain-name home.uw.cz
f10-s60(conf)#ip domain-lookup
f10-s60(conf)#exit
Don't forget to configure "ip domain-lookup" because it is the command which enables domain name resolution.

Now let's test name resolution by ping www.google.com
f10-s60#ping www.google.com            Translating "www.google.com"...domain server (192.168.4.21) [OK]
Type Ctrl-C to abort.
Sending 5, 100-byte ICMP Echos to 172.217.16.164, timeout is 2 seconds:!!!!!Success rate is 100.0 percent (5/5), round-trip min/avg/max = 40/44/60 (ms)
We should also test some local hostname in long format
f10-s60#ping esx01.home.uw.cz        
Translating "esx01.home.uw.cz"
...domain server (192.168.4.21) [OK]
Type Ctrl-C to abort.
Sending 5, 100-byte ICMP Echos to 192.168.4.101, timeout is 2 seconds:
!!!!!
Success rate is 100.0 percent (5/5), round-trip min/avg/max = 0/0/0 (ms)
and short format
f10-s60#ping esx01
Translating "esx01"
...domain server (192.168.4.21) [OK]
Type Ctrl-C to abort.
Sending 5, 100-byte ICMP Echos to 192.168.4.101, timeout is 2 seconds:
!!!!!
Success rate is 100.0 percent (5/5), round-trip min/avg/max = 0/0/0 (ms)
Step 2/ Set current date, time and NTP synchronization

You have to decide if you want to use GMT or local time. The hardware time should be always set to GMT and you can configure timezone and summer-time if you wish. So let's configure GMT time in the first place.
f10-s60#calendar set 15:12:46 july 15 2016
and test it
f10-s60#sho calendar
15:12:39  Fri Jul 15 2016
Ok, so hardware time is set correctly to GMT.

If you really want to play with timezone and summer-time you can do it in conf mode with following commands.

f10-s60(conf)#clock ?
summer-time         Configure summer (daylight savings) time
timezone             Configure time zone   
I prefer to keep GMT time everywhere because it, in my opinion, simplifies troubleshooting, problem management and capacity planning.

Step 3/ Configuration of remote logging

FTOS by default doesn't use date and time for log messages. It uses uptime (time from last boot) therefore you can see when something happened since last system boot. However, because we already have time configured properly it is good idea to change this default behavior to use date and time.
f10-s60(conf)#service timestamps log datetime
To be honest, you generally don't need date and time on log messages because remote syslog server will add date and time to messages but I generally prefer to have both times - time from device and time when message arrived to syslog server. If you want to disable time stamping on syslog messages, use no service timestamps [log | debug].

And now, finally, let's configure remote syslog server by single configuration command
f10-s60(conf)#logging syslog.home.uw.cz
Translating "syslog.home.uw.cz"
Translating "syslog.home.uw.cz"
...domain server (192.168.4.21) [OK]
And we are done. Now you can see incoming log messages in your syslog server. See screenshot of my VMware Log Insight syslog server.

VMware Log Insight with Force10 log messages.
Hope you find it useful and as always - any comment is very appreciated.

Monday, June 27, 2016

ESXi boot mode - UEFI or BIOS

Legacy BIOS bootstrapping along with a master boot record (MBR) is uses with x86 compatible systems for ages. The concept of MBRs was publicly introduced in 1983 with PC DOS 2.0. It is unbelievable that we are still using the same concept after more then 30 years.

However, there must be some limitations in 30 years old technology, isn't it?

BIOS limitations (such as 16-bit processor mode, 1 MB addressable space and PC AT hardware) had become too restrictive for the larger server platforms. The effort to address these concerns began in 1998 and was initially called Intel Boot Initiative later renamed to EFI. In July 2005, Intel ceased its development of the EFI specification at version 1.10, and contributed it to the Unified EFI Forum, which has evolved the specification as the Unified Extensible Firmware Interface (UEFI).

What is EFI (or UEFI) firmware?

UEFI replaces the the old Basic Input/Output System (BIOS). UEFI can be used on

  • physical server booting ESXi hypervisor or 
  • Virtual Machine running on top of ESXi hypervisor. 
This blog post is about ESXi boot mode however just for completeness I would like to mention that VMware Virtual Machine with at least hardware version 7 supports UEFI as well. For further information about VM UEFI look at [1].

Originally called Extensible Firmware Interface (EFI), the more recent specification is known as Unified Extensible Firmware Interface (UEFI), and the two names are used interchangeably.

EFI (Extensible Firmware Interface) is a specification for a new generation of system firmware. An implementation of EFI, stored in ROM or Flash RAM, provides the first instructions used by the CPU to initialize hardware and pass control to an operating system or bootloader. It is intended as an extensible successor to the PC BIOS, which has been extended and enhanced in a relatively unstructured way since its introduction. The EFI specification is portable, and implementations may be capable of running on platforms other than PCs.
 
For more information, see  the Wikipedia page for Unified Extensible Firmware Interface.

General UEFI Advantages

UEFI firmware provides several technical advantages over a traditional BIOS system:

  • Ability to boot from large disks (over 2 TB) with a GUID Partition Table (GPT)
  • CPU-independent architecture
  • CPU-independent drivers
  • Flexible pre-OS environment, including network capability
  • Modular design
  • Since UEFI is platform independent, it may be able to enhance the boot time and speed of the computer. This is especially the case when large hard drives are in use. 
  • UEFI can perform better while initializing the hardware devices.
  • UEFI can work alongside BIOS. It can sit on top of BIOS and work independently.
  • It supports MBR and GPT partition types.

Note: Modern systems are only emulating the legacy BIOS. They are EFI native.

UEFI on ESXi

vSphere 5.0 and above supports booting ESXi hosts from the Unified Extensible Firmware Interface (UEFI). With UEFI, you can boot systems from hard drives, CD-ROM drives, USB media, or network.

UEFI benefits

  • ESXi can boot from a disk larger than 2 TB provided that the system firmware and the firmware on any add-in card that you are using supports it. 

UEFI drawbacks

  • Provisioning with VMware Auto Deploy requires the legacy BIOS firmware and is not available with UEFI BIOS configurations. I hope that this limitation will be lifted soon.

Notes: Changing the host boot type between legacy BIOS and UEFI is not supported after you install ESXi 6.0. Changing the boot type from legacy BIOS to UEFI after you install ESXi 6.0 might cause the host to fail to boot.

Conclusion

UEFI is meant to completely replace BIOS in the future and bring in many new features and enhancements that can’t be implemented through BIOS. BIOS can be used in servers that do not require large storage for boot. To be honest, even you can, it is not very common to use boot disks greater then 2 TB for ESXi hosts therefore you may be using BIOS at the moment, but I would recommend shifting to UEFI, as it is the future while BIOS will fade away slowly.

ESXi hypervisor supports both boot modes therefore if you have modern server hardware and don't use VMware Auto Deploy then UEFI should be your preferred choice.

References:
[1] VMware : Using EFI/UEFI firmware in a VMware Virtual Machine 
[2] VMware : Best practices for installing ESXi 5.0 (VMware KB 2005099)
[3] VMware : Best practices to install or upgrade to VMware ESXi 6.0 (VMware KB 2109712)
[4] Usman Khurshid : [MTE Explains] Differences Between UEFI and BIOS
[5] Wikipedia : Unified Extensible Firmware Interface

Thursday, June 16, 2016

Role and responsibility of IT Infrastructure Technical Architect

In this article, I would like to describe the infrastructure architect role and his responsibility.

Any architect generally leads the design process with the goal to build the product.  The product can be anything the investor would like to build and use. The architect is responsible to gather all investor's goals, requirements, constraints and try to understand all use cases of the final product.

The product of IT technical infrastructure architect is an IT infrastructure system, also known as a computer system, running an IT applications supporting business services. That's very important statement. Designed IT infrastructure system is usually not built just in sake of infrastructure itself but to support business services.

There is no doubt that technical architect must be a subject matter expert in several technical areas including compute, storage, network, operating systems and applications but that's just a technical foundation required to fulfill all technical requirements. However, systems are not impacted just by technology but also by other external non-technical factors like business requirements, operational requirements and human factors. It is obvious that the architect's main responsibility is to fulfill all these requirements of the final product, IT infrastructure system in this particular case, however the last mentioned factor,  a human factor, usually has the biggest impact on any systems design because we usually build systems for human usage and these systems has to be also maintained and operated by other humans as well.

Now, when we know what IT Infrastructure Technical Architect does, let's describe what are his typical tasks and activities?

The Architect has to communicate with investor's stakeholders to gather all design factors including requirements, constraints and use cases. Unfortunately, there are usually also some design factors nobody have a specific requirement. These factors has to be documented as assumptions. When all relevant design factors are collected and revalidated with requestors and investor authorities, the architect starts design analysis and prepare conceptual design. The conceptual design is a high level design which helps to understand the overall concept of proposed product. Such conceptual design has to be reviewed by all design stakeholders and when everybody feels comfortable with the concept the architect can start low level design.

Low level design is usually prepared as decomposition of conceptual design. Low level design should be decomposed into several design areas because it is almost always beneficial to divide complex system into sub-systems until these become simple enough to be solved directly. This decomposition approach is also known as "Divide and conquer" method. The main purpose of low level design is to document all details important for successful implementation and operation of the product. Therefore it must be reviewed and validated by particular subject matter experts - other architects, operators, and implementers - for particular area. The low level design is usually divided into logical and physical design. Logical design is detailed technical design but general logical components are used without using a particular suppliers physical product models. materials, configuration details or other physical specifications. The purpose of logical design is to document general principles principles of overall design or particular decomposed, thus simplified, design area. Logical design is also used for proper product sizing and capacity planning. Physical design, on the other hand, is detailed technical design with specific products, materials and implementation details. Physical design is primarily intended to product builders and implementors because the product is build or implemented based on the physical design.

It is good to mention that there is no product or system without a risk. That's another responsibility of the architect. He should identify and document all risks and design limitations associated with proposed product. The biggest threats are not risks in general but unknown risks. Therefore, potential risks documentation and risk mitigation options is very important architect's responsibility. Risk mitigation plan or at least contingency plan should be the part of product design.

At the end of the day, the design should be implemented therefore the implementation plan is just another activity and document the architect must prepare to make the product real even the implementation is usually out of the architect scope.

It is worth to mention, that here is no proven design without design tests. Therefore the Architect should also prepare and perform the test plan. Test plan have to include validation and verification part. Validation part validates design requirements after product build or implementation. Only after validation, the architect can honestly proof that the product really fulfill all requirements holistically. Verification part verifies that everything was implemented as designed and operational personnel knows how to operate and maintain the system.

There is no perfect design nor product, therefore the architect should continually improve even already built product by communication with end users, operators and other investor stakeholders and take their feedback in to account for future improvements. After some period of time, the architect should initiate design review and incorporate all gathered feedback in to the next design version.

Now, when we know what the architect is responsible for let's summarize what skills are important for any good architect. The architect must have following decent skills and expertise:

  • communication skills
  • presentation skills
  • consulting skills
  • cross check validation skills
  • documentation skills
  • systematic, analytical, logical and critical thinking
  • technical expertise
  • ability to think and work in different levels of detail
  • ability to see a big picture but also have attention for detail because the devil is in the details
Even you have read this article to this point, you can ask what is the architect main responsibility. That's a faire question. Here is short answer.

The architect main responsibility is the happiness of all users using designed product during the whole lifecycle of the product.




Wednesday, June 01, 2016

Force10 Operating System 9.10 changes maximum MTU size

Force10 operating system (aka FTOS, DNOS) always had the maximal configurable MTU size per port 12000 bytes. I have just been informed by former colleague of mine that it is not the case since FTOS 9.10 and above. Since FTOS 9.10 the maximum MTU size per switch port is 9261. If you used MTU 12000 then after upgrade to firmware 9.10 the MTU should be adjusted automatically. But I have been told that it is automatically adjusted to standard MTU 1500 therefore if you use Jumbo Frames (9000 bytes payload) it is necessary to change configuration before upgrade from 12000 to 9261.

Disclaimer: I had no chance to test it so I don't guarantee all information on this post are correct. 

UPDATE: Please read comments below this article for further information and great Martin's explanation of real MTU behavior. Thanks Martin and Kaloyan for your comments.

Martin's comment:
MTU 12000 in configuration was not reflecting real hardware MTU of underlaying chipset, after upgrade to 9.10 it's just adjusted to reflect real hardware MTU. Tested on S4048 9.10(0.1). When you boot into 9.10 you can see log messages saying that configuration is adjusted to reflect real maximum hardware MTU.
Also in configuration
 ethswitch1(conf-if-te-1/47)#mtu ?  
 <594-12000> Interface MTU (default = 1554, hardware supported maximum = 9216)  
 ethswitch1(conf-if-te-1/47)#mtu   

                                 

Wednesday, May 25, 2016

ESXi : How to mask storage device causing some issues

I have heard about the issue with ESXi 6 Update 2 and HP 3PAR storage where VVOLs are enabled. I have been told that the issue is caused by issuing unsupported SCSI command to PE LUN (256). PE stands for Protocol Endpoint and it is VVOL technical LUN for data path between ESXi and remote storage system.

Observed symptoms:
  • ESX 6 Update 2 – issues (ESXi disconnects from vCenter, console is very slow)
  • Hosts may take a long time to reconnect to vCenter after reboot or hosts may enter a "Not Responding" state in vCenter Server
  • Storage-related tasks such as HBA rescan may take a very long time to complete
  • I have been told that ESX 6 Update 1 doesn't experience such issues (there are entries are in log file but no other symptoms occur)
Below is a snippet from a log file ..

 2016-05-18T11:31:27.319Z cpu1:242967)WARNING: NMP: nmpDeviceAttemptFailover:603: Retry world failover device "naa.2ff70002ac0150c3" - issuing command 0x43a657470fc0  
 2016-05-18T11:31:27.320Z cpu31:33602)WARNING: NMP: nmpCompleteRetryForPath:352: Retry cmd 0x28 (0x43a657470fc0) to dev "naa.2ff70002ac0150c3" failed on path "vmhba0:C0:T2:L256" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0.  
 2016-05-18T11:31:27.320Z cpu31:33602)WARNING: NMP: nmp_PathDetermineFailure:2973: Cmd (0x28) PDL error (0x5/0x25/0x0) - path vmhba0:C0:T2:L256 device naa.2ff70002ac0150c3 - triggering path failover  
 2016-05-18T11:31:27.320Z cpu31:33602)WARNING: NMP: nmpCompleteRetryForPath:382: Logical device "naa.2ff70002ac0150c3": awaiting fast path state update before retrying failed command again.  

Possible workarounds

  • ESXi hostd restart helps therefore SSH to ESXi hosts was enabled for quick resolution in case of problem
  • LUN masking of LUN 256

Final solution

  • Application of HP 3PAR firmware patch (unfortunately patch is not available for current firmware thus firmware upgrade has to be planned and excuted)
  • Investigation of root cause why ESXi 6 Update 2 is more sensitive then ESXi 6 Update 1

Immediate steps

  • Application of workarounds mentioned above  

Now the question is how to mask particular LUN on ESXi host.
Here is quick solution for impatient readers.

Let's sat we have following device with following path.
  • Device: naa.6589cfc000000bf5e731ffc99ec35186
  • Path: vmhba36:C0:T0:L1
LUN Masking
esxcli storage core claimrule add -P MASK_PATH -r 500 -t location -A vmhba36 -C 0 -T 0 -L 1
esxcli storage core claimrule load
esxcli storage core claiming reclaim -d naa.6589cfc000000bf5e731ffc99ec35186

LUN Unmasking
esxcli storage core claimrule remove --rule 500
esxcli storage core claimrule load
esxcli storage core claiming unclaim --type=path --path=vmhba36:C0:T0:L1
esxcli storage core claimrule run

... continue reading for details.

LUN MASKING
Exact LUN masking procedure is documented in vSphere 6 Documentation here. It is also documented in these KB articles 1009449 and 1014953.

List storage devices

 [root@esx02:~] esxcli storage core device list  
 naa.6589cfc000000bf5e731ffc99ec35186  
   Display Name: FreeNAS iSCSI Disk (naa.6589cfc000000bf5e731ffc99ec35186)  
   Has Settable Display Name: true  
   Size: 10240  
   Device Type: Direct-Access  
   Multipath Plugin: NMP  
   Devfs Path: /vmfs/devices/disks/naa.6589cfc000000bf5e731ffc99ec35186  
   Vendor: FreeNAS  
   Model: iSCSI Disk  
   Revision: 0123  
   SCSI Level: 6  
   Is Pseudo: false  
   Status: degraded  
   Is RDM Capable: true  
   Is Local: false  
   Is Removable: false  
   Is SSD: true  
   Is VVOL PE: false  
   Is Offline: false  
   Is Perennially Reserved: false  
   Queue Full Sample Size: 0  
   Queue Full Threshold: 0  
   Thin Provisioning Status: yes  
   Attached Filters:  
   VAAI Status: supported  
   Other UIDs: vml.010001000030303530353661386131633830300000695343534920  
   Is Shared Clusterwide: true  
   Is Local SAS Device: false  
   Is SAS: false  
   Is USB: false  
   Is Boot USB Device: false  
   Is Boot Device: false  
   Device Max Queue Depth: 128  
   No of outstanding IOs with competing worlds: 32  
   Drive Type: unknown  
   RAID Level: unknown  
   Number of Physical Drives: unknown  
   Protection Enabled: false  
   PI Activated: false  
   PI Type: 0  
   PI Protection Mask: NO PROTECTION  
   Supported Guard Types: NO GUARD SUPPORT  
   DIX Enabled: false  
   DIX Guard Type: NO GUARD SUPPORT  
   Emulated DIX/DIF Enabled: false
  
 naa.6589cfc000000ac12355fe604028bf21  
   Display Name: FreeNAS iSCSI Disk (naa.6589cfc000000ac12355fe604028bf21)  
   Has Settable Display Name: true  
   Size: 10240  
   Device Type: Direct-Access  
   Multipath Plugin: NMP  
   Devfs Path: /vmfs/devices/disks/naa.6589cfc000000ac12355fe604028bf21  
   Vendor: FreeNAS  
   Model: iSCSI Disk  
   Revision: 0123  
   SCSI Level: 6  
   Is Pseudo: false  
   Status: degraded  
   Is RDM Capable: true  
   Is Local: false  
   Is Removable: false  
   Is SSD: true  
   Is VVOL PE: false  
   Is Offline: false  
   Is Perennially Reserved: false  
   Queue Full Sample Size: 0  
   Queue Full Threshold: 0  
   Thin Provisioning Status: yes  
   Attached Filters:  
   VAAI Status: supported  
   Other UIDs: vml.010002000030303530353661386131633830310000695343534920  
   Is Shared Clusterwide: true  
   Is Local SAS Device: false  
   Is SAS: false  
   Is USB: false  
   Is Boot USB Device: false  
   Is Boot Device: false  
   Device Max Queue Depth: 128  
   No of outstanding IOs with competing worlds: 32  
   Drive Type: unknown  
   RAID Level: unknown  
   Number of Physical Drives: unknown  
   Protection Enabled: false  
   PI Activated: false  
   PI Type: 0  
   PI Protection Mask: NO PROTECTION  
   Supported Guard Types: NO GUARD SUPPORT  
   DIX Enabled: false  
   DIX Guard Type: NO GUARD SUPPORT  
   Emulated DIX/DIF Enabled: false  

So we have two device with following NAA IDs
  • naa.6589cfc000000bf5e731ffc99ec35186
  • naa.6589cfc000000ac12355fe604028bf21
Now let's list paths of both of my iSCSI devices

[root@esx02:~] esxcli storage nmp path list
iqn.1998-01.com.vmware:esx02-096fde38-00023d000001,iqn.2005-10.org.freenas.ctl:test,t,257-naa.6589cfc000000bf5e731ffc99ec35186
   Runtime Name: vmhba36:C0:T0:L1
   Device: naa.6589cfc000000bf5e731ffc99ec35186
   Device Display Name: FreeNAS iSCSI Disk (naa.6589cfc000000bf5e731ffc99ec35186)
   Group State: active
   Array Priority: 0
   Storage Array Type Path Config: {TPG_id=1,TPG_state=AO,RTP_id=3,RTP_health=UP}
   Path Selection Policy Path Config: {current path; rank: 0}

iqn.1998-01.com.vmware:esx02-096fde38-00023d000001,iqn.2005-10.org.freenas.ctl:test,t,257-naa.6589cfc000000ac12355fe604028bf21
   Runtime Name: vmhba36:C0:T0:L2
   Device: naa.6589cfc000000ac12355fe604028bf21
   Device Display Name: FreeNAS iSCSI Disk (naa.6589cfc000000ac12355fe604028bf21)
   Group State: active
   Array Priority: 0
   Storage Array Type Path Config: {TPG_id=1,TPG_state=AO,RTP_id=3,RTP_health=UP}
   Path Selection Policy Path Config: {current path; rank: 0}

Let's mask iSCSI devices exposed as a LUN 1.
So our path we want to mask is vmhba36:C0:T0:L1 and device UID is naa.6589cfc000000bf5e731ffc99ec35186

So let's create masking rule of path above. In this particular case we have just a single path because it is local device. In real environment we have usually multiple paths and all paths should be masked.

 esxcli storage core claimrule add -P MASK_PATH -r 500 -t location -A vmhba36 -C 0 -T 0 -L 1
 esxcli storage core claimrule load  

We can list our claim rules to see the result

 [root@esx02:~] esxcli storage core claimrule list  
 Rule Class  Rule Class  Type    Plugin   Matches                  XCOPY Use Array Reported Values XCOPY Use Multiple Segments XCOPY Max Transfer Size  
 ---------- ----- ------- --------- --------- ---------------------------------------- ------------------------------- --------------------------- -----------------------  
 MP       0 runtime transport NMP    transport=usb                            false            false            0  
 MP       1 runtime transport NMP    transport=sata                           false            false            0  
 MP       2 runtime transport NMP    transport=ide                            false            false            0  
 MP       3 runtime transport NMP    transport=block                           false            false            0  
 MP       4 runtime transport NMP    transport=unknown                          false            false            0  
 MP      101 runtime vendor   MASK_PATH vendor=DELL model=Universal Xport                  false            false            0  
 MP      101 file   vendor   MASK_PATH vendor=DELL model=Universal Xport                  false            false            0  
 MP      500 runtime location  MASK_PATH adapter=vmhba36 channel=0 target=0 lun=1              false            false            0  
 MP      500 file   location  MASK_PATH adapter=vmhba36 channel=0 target=0 lun=1              false            false            0  
 MP     65535 runtime vendor   NMP    vendor=* model=*                          false            false            0  

We can see that new claim rule (500) is in configuration file (/etc/vmware/esx.com) and also loaded in runtime.

However, to really mask our particular device without ESXi host reboot we have to reclaim device

 [root@esx02:~] esxcli storage core claiming reclaim -d naa.6589cfc000000bf5e731ffc99ec35186  

The particular device disappear from ESXi host immediately. ESXi host reboot is not needed.
So we are done. Particular device is not visible to ESXi host anymore.

Note: I was unsuccessful when I was testing LUN masking with local device. Therefore I assume that LUN masking works only with remote disks (iSCSI, Fibre Channel). 

LUN UNMASKING
Just in case you would like to unmask device and use it again here is the procedure.

Let's start with removing claimrules for our previously masked path.

 [root@esx02:~] esxcli storage core claimrule remove --rule 500  
 [root@esx02:~] esxcli storage core claimrule list  
 Rule Class  Rule Class  Type    Plugin   Matches                  XCOPY Use Array Reported Values XCOPY Use Multiple Segments XCOPY Max Transfer Size  
 ---------- ----- ------- --------- --------- ---------------------------------------- ------------------------------- --------------------------- -----------------------  
 MP       0 runtime transport NMP    transport=usb                            false            false            0  
 MP       1 runtime transport NMP    transport=sata                           false            false            0  
 MP       2 runtime transport NMP    transport=ide                            false            false            0  
 MP       3 runtime transport NMP    transport=block                           false            false            0  
 MP       4 runtime transport NMP    transport=unknown                          false            false            0  
 MP      101 runtime vendor   MASK_PATH vendor=DELL model=Universal Xport                  false            false            0  
 MP      101 file   vendor   MASK_PATH vendor=DELL model=Universal Xport                  false            false            0  
 MP      500 runtime location  MASK_PATH adapter=vmhba36 channel=0 target=0 lun=1              false            false            0  
 MP     65535 runtime vendor   NMP    vendor=* model=*                          false            false            0  
 [root@esx02:~]   

You can see that rule is removed from file configuration but it is still running. We have to re-load claimrules from file to runtime.

 [root@esx02:~] esxcli storage core claimrule load  
 [root@esx02:~] esxcli storage core claimrule list  
 Rule Class  Rule Class  Type    Plugin   Matches              XCOPY Use Array Reported Values XCOPY Use Multiple Segments XCOPY Max Transfer Size  
 ---------- ----- ------- --------- --------- --------------------------------- ------------------------------- --------------------------- -----------------------  
 MP       0 runtime transport NMP    transport=usb                        false            false            0  
 MP       1 runtime transport NMP    transport=sata                        false            false            0  
 MP       2 runtime transport NMP    transport=ide                        false            false            0  
 MP       3 runtime transport NMP    transport=block                       false            false            0  
 MP       4 runtime transport NMP    transport=unknown                      false            false            0  
 MP      101 runtime vendor   MASK_PATH vendor=DELL model=Universal Xport              false            false            0  
 MP      101 file   vendor   MASK_PATH vendor=DELL model=Universal Xport              false            false            0  
 MP     65535 runtime vendor   NMP    vendor=* model=*                       false            false            0  
 [root@esx02:~]   

Here we go. Now there is no rule with id 500.

But the device is still not visible and we cannot execute command
esxcli storage core claiming reclaim -d naa.6589cfc000000bf5e731ffc99ec35186
because such device is not visible to ESXi host. We mask it, right? So it is exactly how it should behave.

ESXi host would probably help but can we do it without ESXi host reboot?
The answer is yes we can.
We have to unclaim the path to our device and re-run claim rules.

 esxcli storage core claiming unclaim --type=path --path=vmhba36:C0:T0:L1  
 esxcli storage core claimrule run  

and now we can see both paths to iSCSI LUNs again.

 [root@esx02:~] esxcli storage nmp path list  
 iqn.1998-01.com.vmware:esx02-096fde38-00023d000001,iqn.2005-10.org.freenas.ctl:test,t,257-naa.6589cfc000000bf5e731ffc99ec35186  
   Runtime Name: vmhba36:C0:T0:L1  
   Device: naa.6589cfc000000bf5e731ffc99ec35186  
   Device Display Name: FreeNAS iSCSI Disk (naa.6589cfc000000bf5e731ffc99ec35186)  
   Group State: active  
   Array Priority: 0  
   Storage Array Type Path Config: {TPG_id=1,TPG_state=AO,RTP_id=3,RTP_health=UP}  
   Path Selection Policy Path Config: {current path; rank: 0}  
 iqn.1998-01.com.vmware:esx02-096fde38-00023d000001,iqn.2005-10.org.freenas.ctl:test,t,257-naa.6589cfc000000ac12355fe604028bf21  
   Runtime Name: vmhba36:C0:T0:L2  
   Device: naa.6589cfc000000ac12355fe604028bf21  
   Device Display Name: FreeNAS iSCSI Disk (naa.6589cfc000000ac12355fe604028bf21)  
   Group State: active  
   Array Priority: 0  
   Storage Array Type Path Config: {TPG_id=1,TPG_state=AO,RTP_id=3,RTP_health=UP}  
   Path Selection Policy Path Config: {current path; rank: 0}  

Hope this helps to other vmware users having a need for LUN masking / unmasking.

Monday, May 23, 2016

Storage DRS Design Considerations

This blog post follows blog post "VMware vSphere SDRS - test plan of SDRS initial placement" and summarize several facts having impact on SDRS design decisions. If you want to see results of several SDRS tests I did in my home lab read my previous blog post.

SDRS design considerations:
  • SDRS Initial Placement algorithm does NOT take VM swap file capacity into account. However Subsequent Rebalance Calculations are based on free space on particular datastores therefore if Virtual Machines are in PowerOn state and swap files exist then it is considered because VM swap file usage will be decreased from datastore total free space and it has impact on space load. Space load formula is [space load] = [total consumed space on the datastore] / [datastore capacity]
  • Storage Space threshold is just a threshold (soft limit) used by SDRS for balancing and defragment. It is not hard limit. SDRS is trying to keep free space on datastores based on space threshold but SDRS doesn't guarantee you will have always some amount of free space in datastores. See. Test-3 in my SDRS test plan here [6].
  • SDRS defragmentation works but there can be some cases when initial placement fails even the storage was freed up and there will be free continuous space in some datastore after defragmentation. See. Test-2 in my SDRS test plan here [6]. That's up to the component which does provisioning. It is important to understand how provisioning to datastore cluster really works. Lot of people think that Datastore Cluster behaves like a giant datastore. It is true from high level view (abstracted view) but in reality Datastore Cluster is nothing else then just a group of single datastores where SDRS is "just" a scheduler on top of Datastore Cluster. You can imagine a scheduler as a placement engine which prepare placement recommendations for initial placement and continuous balancing. That means that other software component (C# Client, Web Client, PowerCLI, vRealize Automation, vCloud Director, etc) are responsible for initial placement provisioning and SDRS give them recommendations where is the best place to put a new storage objects (vmdk file or VM config file). Here [8] is the proof of my statement.
  • When SDRS is configured to consider I/O metrics for load balancing then it is considered also during initial placement. Please, do not mix up SDRS I/O metrics and SIOC. These are two different things even SDRS I/O metrics are leveraging normalized latency calculated by SIOC.
  • Q: Do I need to use SDRS I/O metrics for load balancing? A: It depends on your physical storage system (disk array). If you have disk array with modern storage architecture then you will have most probably all datastores  (aka LUNs, volumes) on single physical disk pool. In that case in doesn't make sense to load balance (do storage vMotion in case of I/O contention) between datastores because it will always end up in same physical spindles anyway and on top of that it will generate additional storage workload. The same is true for initial placement. If you have your datastores on different physical spindles then it can help. This is typically useful on storage systems using RAID groups which is not very common nowadays.
  • SDRS calculation is done on vmdks not the whole Virtual Machine. But affinity rules (keep together) tend to keep the vmdks together making it similar behaviour as if the was VM. By default virtual machine files are kept together in the working directory of the virtual machine. If the virtual machine needs to be migrated, all the files inside the virtual machines’ working directory are moved. However, if the default affinity rule is disabled (see. Screenshot 1), Storage DRS will move the working directory and virtual disks separately allowing Storage DRS to distribute the virtual disk files on a more granular level.
  • Q: When and how often is SDRS rebalancing kicked in? A: Rebalancing happens 1) at regular interval (default 8 hours - it can be changed see Screenshot 2); 2) when threshold violation is detected like above; 3) user requests a configuration change 4) API call like clicking run SDRS via client. 
  • Multiple VM provisioning can behaves differently less deterministically because of other SDRS calculation factors (I/O load, capacity usage trend) and also because of particular provisioning workflow and exact timing when SDRS recommendation is called and when datastore space is really consumed. Recall that datastore reported free capacity is one of the main factor for next SDRS recommendations.
  • From vSphere 6.0 SDRS is integrated with other relevant technologies like SDRS VASA awareness (array-based thin-provisioning, deduplication, auto-tiering, snapshot, replication), Site Recovery Manager (consistency groups and protection groups considerations), vSphere Replication (replica placement recommndations), Storage Policy Based Management (moves just between same complient VM storage polices). For further details read reference [9].
The general solution to overcome challenges highlighted above (architects call it risk mitigation) is to have "enough" free space on each datastore. Enough free storage space per datastore gives some flexibility to SDRS algorithm. What is "enough" and how to achieve it is out of scope of this blog post but think about following hints
  • Thin Provisioning on physical storage. See. Frank Denneman's note here [2] about thin provisioning alarm which should be considered by SDRS algorithm. I'm writing should because I had not chance to test it.
  • VVOLs, VSAN - I believe that one big object oriented datastore (storage container) eliminate the need of datastore free space considerations because of SDRS initial placement.
But don't forget that there are always other considerations with potential impacts to your specific design so think holistically and use critical thinking during all design decisions.

Last but not least - I highly encourage you to study carefully the book "VMware vSphere Clustering Deepdive 5.1" to be familiar with basic SDRS algorithm and terminology.

Screenshots:
Screenshot 1: SDRS - Default VM affinity 
Screenshot 2: SDRS - Default imbalance check is 8 hours. It can be changed to X minutes, hours, days 
Videos:
Video 1: VMware vSphere SDRS VM provisioning process 

Relevant resources:
  1. Frank Denneman : Storage DRS initial placement and datastore cluster defragmentation
  2. Frank Denneman : Storage DRS Initial placement workflow
  3. Frank Denneman : Impact of Intra VM affinity rules on Storage DRS
  4. Frank Denneman : SDRS out of space avoidance
  5. Duncan Epping, Frank Denneman : VMware vSphere Clustering Deepdive 5.1
  6. David Pasek : VMware vSphere SDRS - test plan of SDRS initial placement
  7. VMware : Ignore Storage DRS Affinity Rules for Maintenance Mode
  8. David Pasek : VMware vSphere SDRS VM provisioning process 
  9. Duncan Epping : What is new for Storage DRS in vSphere 6.0?