Tuesday, December 31, 2013

Storage Array Power Consumption Calculation

Although some mid-range Storage Arrays have custom ASICs they are usually build from commodity enterprise components. The real know-how and differentiators are in storage array software (aka firmware, operating system). Thanks to simple hardware architecture we can relatively easily calculate power consumption of storage array,

Storage controllers are usually rack-mount servers consuming around 200W each.
Typical mid-range storage array has two controllers but some arrays can have even more controllers. Below storage controllers are disk enclosures. Disk Enclosures typically consumes 150-200W. Disk enclosures are populated with disks. Below are typical power consumptions of modern disks.

300GB 15K SFF HDD6.2W8W
450GB 10K SFF HDD3.7W6.3W
600GB 10K SFF HDD4.1W6.3W
900GB 10K SFF HDD4.8W6.3W
1TB 7.2K SFF HDD2.95W3.84W
2TB 7.2K LFF HDD7.5W10.6W
3TB 7.2K LFF HDD8.5W11.8W

SFF = Small Form Factor; 2.5"
LFF = Large Form Factor; 3.5"

So here is example calculation for Storage Array HP 3PAR 7400 having two storage controllers and seven disk enclosures.

Storage Controllers = 2x 200W
Disk Enclosures = 7x 150W

And following disks: 8x 400GB MLC SSD, 128x 300GB 15K and 40x 900GB 10K = 8 x 3.7W + 128 x 8W + 40 x 6.3W = 29.6 + 1024 + 252 = 1,305.6W

Total power consumption of such storage system configuration is  2,755W = 2.76 kW.

Monday, December 23, 2013

FreeBSD running from read-only compact flash disk and accessible over serial console (COM1)

I very often use FreeBSD for some automation tasks or as a network appliance. I like hardware like SOEKRIS, ALIX and other similar rotate-less and low power consumption hardware platforms. On such platforms I'm running FreeBSD on Compact Flash card and we all know about CF limited writes, don't we? So lets prepare FreeBSD system to run on top of read-only disk and prolong compact flash live.

After normal FreeBSD installation edit /etc/rc.conf and add following lines
This will instruct FreeBSD to use tmp and var in memory file system (aka ram disk) instead of normal disk mount points. This will in conjunction with read-only disk significantly save writes to flash disk however /tmp and /var mount points will stay writable which is important for lot of applications.

Now we can setup boot disk to be read-only. I can do it simply by editing /etc/fstab and change Options from rw to ro for boot disk. I can also change Dump from 1 to 0.
Parameter Dump (dump-freq) adjusts the archiving schedule for the partition (used by dump).
/etc/fstab should looks like example below:
# Device        Mountpoint      FStype  Options Dump    Pass#
/dev/ada0p2     /               ufs     ro      0       1
So now is my FreeBSD system ready to run on top of Compact Flash card in read-only mode so it eliminates flash write issue and system can run significantly longer then on read-write disk. Of course with read-only mode limitations but that's ok for lot of automation and network appliances. When I need some data disk I usually use another disk (or CF) just for data.

After FreeBSD reboot your mount points should look like on the screenshot below
root@example:~ # mount
/dev/ada0p2 on / (ufs, local, read-only)
devfs on /dev (devfs, local, multilabel)
/dev/md0 on /var (ufs, local)
/dev/md1 on /tmp (ufs, local)

Because I configure hardware appliance I would like to have possibility to control the system without monitor and keyboard. Unix systems were always ready for serial terminal consoles. So we can simply redirect console to RS-232 port and use it for system administration.

Here is the process how to do it. 
Add following command to /boot/loader.conf. You can do it simply by running following command 
echo 'console="comconsole"' >> /boot/loader.conf
which redirect all the boot messages to the serial console.

Edit /etc/ttys and change off to on and dialup to xterm for the ttyu0 entry. Otherwise, a password will not be required to connect via the serial console, resulting in a potential security hole.

The line in /etc/ttys should looks like below
ttyu0   "/usr/libexec/getty std.9600"   xterm   on secure
Update 2016-06-26: This is not needed any more for FreeBSD 9.3 and later because a new flag, "onifconsole" has been added to /etc/ttys. This allows the system to provide a login prompt via serial console if the device is an active kernel console, otherwise it is equivalent to off. 

Before editing the file I have to change read-only mode of my disk to read-write otherwise I will not be able to save the file. I can switch from read-only to read-write mode by command below:
mount -u /
If I want to change back to read-only mode here is how I do it
mount -a
This command remount all mounts with options in /etc/fstab so my disk is read-only again.

I leave the disk in read-write mode for now because I have to make the last configuration change, instruct the system to use COM port for console.

I run command
echo '-P'  >> /boot.config
to add -P option to /boot.config file. The advantage of this (-P) configuration is the flexibility. If the keyboard is not present then console message are written to

  • serial and internal during boot phase
  • serial during boot loader phase
  • serial when system is running (in kernel phase)

If the keyboard is present in the system then monitor and keyboard is used as usual.
If the keyboard is absent the console is accessible over COM port.

Important note for systems without graphic card like SOEKRIS. Other virtual terminal entries in /etc/ttys should be commented otherwise you can see errors like

Dec 22 20:25:38 PRTG-watchdog getty[1469]: open /dev/ttyv0: No such file or directory
Dec 22 20:25:38 PRTG-watchdog getty[1470]: open /dev/ttyv1: No such file or directory
Dec 22 20:25:38 PRTG-watchdog getty[1471]: open /dev/ttyv2: No such file or directory
Dec 22 20:25:38 PRTG-watchdog getty[1472]: open /dev/ttyv3: No such file or directory
Dec 22 20:25:38 PRTG-watchdog getty[1473]: open /dev/ttyv4: No such file or directory
Dec 22 20:25:38 PRTG-watchdog getty[1474]: open /dev/ttyv5: No such file or directory
Dec 22 20:25:38 PRTG-watchdog getty[1475]: open /dev/ttyv6: No such file or directory
Dec 22 20:25:38 PRTG-watchdog getty[1476]: open /dev/ttyv7: No such file or directory
Dec 22 20:25:38 PRTG-watchdog getty[1477]: open /dev/ttyu0: Interrupted system call

I usually leave ttyv0 enabled otherwise you will not be able to use normal console (monitor + keyboard) on systems where VGA and keyboard exist.

So here is the screenshot from typical  /etc/ttys
ttyv0   "/usr/libexec/getty Pc"         xterm   on  secure
# Virtual terminals
#ttyv1  "/usr/libexec/getty Pc"         xterm   on  secure
#ttyv2  "/usr/libexec/getty Pc"         xterm   on  secure
#ttyv3  "/usr/libexec/getty Pc"         xterm   on  secure
#ttyv4  "/usr/libexec/getty Pc"         xterm   on  secure  
#ttyv5  "/usr/libexec/getty Pc"         xterm   on  secure
#ttyv6  "/usr/libexec/getty Pc"         xterm   on  secure    
#ttyv7  "/usr/libexec/getty Pc"         xterm   on  secure
#ttyv8  "/usr/local/bin/xdm -nodaemon"  xterm   off secure  
# Serial terminals
# The 'dialup' keyword identifies dialin lines to login, fingerd etc.
ttyu0   "/usr/libexec/getty std.9600"   xterm   on  secure

At the end don't forget to reboot the system to see if the changes took effect and everything work.

I'm writing this blog post primarily for me as a personal run-book but I believe it can be useful for some other FreeBSD hackers ;-)

Tuesday, December 17, 2013

SSL Certificate filename extensions

Original resource is here.

SSL has been around for long enough you'd think that there would be agreed upon container formats. And you're right, there are. Too many standards as it happens. So this is what I know, and I'm sure others will chime in.
  • .csr This is a Certificate Signing Request. Some applications can generate these for submission to certificate-authorities. It includes some/all of the key details of the requested certificate such as subject, organization, state, whatnot, as well as the public key of the certificate to get signed. These get signed by the CA and a certificate is returned. The returned certificate is the public certificate, which itself can be in a couple of formats.
  • .pem Defined in RFC's 1421 through 1424, this is a container format that may include just the public certificate (such as with Apache installs, and CA certificate files /etc/ssl/certs), or may include an entire certificate chain including public key, private key, and root certificates. The name is from Privacy Enhanced Email, a failed method for secure email but the container format it used lives on.
  • .key This is a PEM formatted file containing just the private-key of a specific certificate. In Apache installs, this frequently resides in /etc/ssl/private. The rights on this directory and the certificates is very important, and some programs will refuse to load these certificates if they are set wrong.
  • .pkcs12 .pfx .p12 Originally defined by RSA in the Public-Key Cryptography Standards, the "12" variant was enhanced by Microsoft. This is a passworded container format that contains both public and private certificate pairs. Unlike .pem files, this container is fully encrypted. Every time I get one I have to google to remember the openssl-fu required to break it into .key and .pem files.
A few other formats that show up from time to time:
  • .der A way to encode ASN.1 syntax, a .pem file is just a Base64 encoded .der file. OpenSSL can convert these to .pem. Windows sees these as Certificate files. I've only ever run into them in the wild with Novell's eDirectory certificate authority.
  • .cert .cer A .pem formatted file with a different extension, one that is recognized by Windows Explorer as a certificate, which .pem is not.
  • .crl A certificate revocation list. Certificate Authorities produce these as a way to de-authorize certificates before expiration.

In summary, there are three different ways to present certificates and their components:
  • PEM Governed by RFCs, it's used preferentially by open-source software. It can have a variety of extensions (.pem, .key, .cer, .cert, more)
  • PKCS12 A private standard that provides enhanced security versus the plain-text PEM format. It's used preferentially by Windows systems, and can be freely converted to PEM format through use of openssl.
  • DER The parent format of PEM. It's useful to think of it as a binary version of the base64-encoded PEM file. Not routinely used by anything in common usage.
More about certificates and cryptography can be found on wikipedia.

Public/private cloud - pure reality without marketing bla...bla...bla

We all know the datacenter cloud concept - consuming datacenter resources in standard and predictable way - is inevitable. However technology is not 100% ready to satisfy all cloud requirements. At least not efficiently and painlessly. I feel the same opinion from other professionals. I really like following statement mentioned at Scott Lowe interview with Jesse Proudman ...
Our customers and prospects are all evolving their cloud strategies in real time, and are looking for solutions that satisfy these requirements:
  1. Ease of use ­ new solutions should be intuitively simple. Engineers should be able to use existing tooling, and ops staff shouldn't have to go learn an entirely new operational environment.
  2. Deliver IaaS and PaaS - IaaS has become a ubiquitous requirement, but we repeatedly heard requests for an environment that would also support PaaS deployments.
  3. Elastic capabilities - the desire to the ability to grow and contract private environments much in the same way they could in a public cloud.
  4. Integration with existing IT infrastructure ­ businesses have significant investments in existing data center infrastructure: load balancers, IDS/IPS, SAN, database infrastructure, etc. From our conversations, integration of those devices into a hosted cloud environment brought significant value to their cloud strategy.
  5. Security policy control ­ greater compliance pressures mean a physical "air gap" around their cloud infrastructure can help ensure compliance and ease peace of mind.
  6. Cost predictability and control - Customers didn't want to need a PhD to understand how much they'll owe at the end of the month. Budgets are projected a year in advance, and they needed to know they could project their budgeted dollars into specific capacity.
This is very nicely summarized customer's cloud requirements.

Sunday, December 15, 2013

Redirect DELL PowerEdge server serial port to iDRAC

Let's assume you use COM2 serial port for console access into your operating system. This is usually used on linux, freebsd or other *nix like systems. Administrator then can use serial terminal to work with OS. However it is useful only for local access. What if you want to access terminal console remotely? If you have DELL PowerEdge server with iDRAC 7 you can redirect serial communication to your iDRAC. You probably know you can ssh into iDRAC for remote server operations. When you are in the iDRAC you can use command "connect" which will connect you to your serial terminal.

To get it working a few steps have to be taken on Power Edge server.

1/ Configure iDRAC
  • Go to Network & Serial
  • Set IPMI’s Baud Rate for example 9.6 kbps (Serial Port Baud Rate)
  • Apply Settings

2/ During boot enter the Server’s BIOS
  • Go to “Serial Communication”
  • Switch from “Off” to “On without Redirection”
  • Switch Port Configuration from “Serial Device1=COM1;Serial Device2=COM2” to “Serial Device1=COM2;Serial Device2=COM1”
  • Save Settings and Reboot Controller

After these steps the Server’s serial console is available via iDRAC:

Login to iDRAC using SSH and type “connect” at the prompt. After that the SSH session shows the serial console as if directly connected to the system’s serial port.

Sunday, December 08, 2013

Virtual SAN Hardware Guidance Part 1 – Solid State Drives

Here is very good read to understand different SSD types.

Force10 doesn't keep configuration after reload

I had a call from customer who was really unhappy because his Force10 S4810 switch configuration disappeared after switch reload or reboot.

At the end we have realized that his switch was configured for such behavior.

Force10 FTOS supports two reload types

  • reload-type jump-start
  • reload-type normal-reload

If jump-start mode is used then configuration is cleared after each reload. This reload type is useful for product demonstrations, technology introductions or proof of concepts. But it can be very frustrated for someone who want to use switch in production.

Solution is very simple. You just need to change reload type by single command "reload-type normal-reload"

Hope this saves time to someone.

Wednesday, December 04, 2013

Local and shared storage performance comparison

I have just answered typical question received by email. Here is the email I have got ...
Working with a customer to validate and tune their environment. We're running IOMETER and pointing at an R720 with local storage as well as an MD3200i. Local storage is 6 15k disks in RAID10. MD has 2 disk groups w/ 6 drives in each group w/ 6 15k drives. ISCSI running through a pair of Dell PC Switches that look to be properly configured. Tried MRU and RR PSP. The local disks are absolutely blowing away the MD3200i, IOPS, MB/s and Latency in a variety of specifications.

I haven't had the chance to play w/ such a well provisioned local array lately, but am surprised by the numbers, like w/ a 512k 50%/50% spec we're seeing 22,000 iops local and 5000 iops on the MD....

Maybe I will write information you know but I believe it can be useful to get the full context.

6x15k physical disks can give you physically around 6x180 IOPS = 1080.

But ...

1/ each IOPS is different – IO depends on block size and other access specifications like sequence/random, outstanding I/O (asynch I/O not waiting for queue ack), etc.

2/ each architecture is different:
  • local virtual disk (LUN) is connected via PERC having cache
  • SAN virtual disk (LUN) is connected over SAN which brings another complexity & latency (NIC/HBA queues, switches, storage controller queues or LUN queues, …)   
3/ Each storage controller is different
  • Local RAID controller is designed for single server workload => single thread can get full performance of disk performance and if more threads are used then performance drop down
  • Shared RAID controller is designed for multiple workloads (servers/threads) => each thread can get only portion of full storage performance but each other thread will get same performance. This is fair policy/algorithm for shared environment.

The cache and particular controller IO optimization can give you significantly better IOPSes so that’s why you get 5,000 from MD and 22,000 from local disk/PERC. But 22,000 is too high number to believe it works directly with disks so there is definitely cache magic.

Here are widely used average IOPSes for different type of disks:
  • 15k disk = 180 IOPS
  • 10k disk = 150 IOPS
  • 7k disk = 80 IOPS
  • SSD/MLC = 2500 IOPS
  • SSD/SLC = 5000 IOPS

Please note that
  • these are average numbers used for sizing. I have seen SATA/7k disk in Compellent handling over 200 IOPses but it was sequential access and disks were quite overloaded because latency was very high!!!
  • SSD numbers significantly differs among different manufacturers
All these calculations can give you available IOPSes for read or write to non-redundant virtual disk (LUN/volume). This means single disk or RAID 0. If you use redundant RAID you have to calculate RAID write penalty
  • RAID 10 = 2
  • RAID 5 =4
  • RAID 6 = 6
So you can see this is a quite complex topic and if you really want to show the customer the truth (who knows what is pure true? :-) ) then you have to consider all statements above.

Typical issues of IOmeter measuring without enough experience:
  • Small target disk file (entered in blocks = 512B). The disk target file must  be bigger than cache. I usually use the file between 20000000 (approx. 20GB) and 80000000 blocks (approx. 40GB).
  • Small number of threads (in IOmeter terminology workers)
  • Workload generated from single server. Do you know you can run dynamo on another computer and connect it to IOmeter over network? Then you will see more managers (servers) and you can define workers and access specifications from single GUI.
Hope this helps at least to someone and I would appreciate deeper discussion on this topic.

Saturday, November 23, 2013

Site Recovery Manager and vSphere Replication Network Ports Required - Simplified!

Sunny Dua published very usefull blog post describing SRM network ports among different SRM software components. When you need to known what ports are required for SRM look at  http://vxpresss.blogspot.cz/2013/11/site-recovery-manager-and-vsphere.html

Thursday, November 21, 2013

Qlogic HBA adapter Queue depth in windows 2008 r2

Here is interesting discussion about the topic ... bellow are the most valuable statements from the thread:

By default a QLogic HBA Execution Throttleis set to 16.  This setting specifies the maximum number of outstanding (SCSI / Fiber Channel) commands that can execute on any single Target port(WWPN).  When a Target port’sExecution Throttleis reached, the host computer will not issue any new commands until one of the current commands finishes executing.

To increase the number of outstanding (SCSI / Fiber Channel) commands to the Target port,increase the Execution Throttle.  However, increasing the Execution Throttlebeyond what the Target portmay handle will cause SCSI timeouts to occur.

According to QLogic it is considered a Best Practiceto optimize performance for all devices that may communicate with the Target port. To do this QLogic recommends that you divide the Target port'sMaximum Outstanding Commands (or Queue) by the number of host ports (Initiators) that access the Target port.  This type of computation is used to determine the Target portfan-in ratio as well.

(Maximum Storage Port Command Queue) / (Host Ports) = Execution Throttle


Ideally Storage team has to tell what is the QD that needs to be set on the HBA based on how many LUNS are published on a particular port

Queue depth is the number of I/O operations that can run in parallel on a device

Calculating the queue depth\Execution Throttle value

To prevent flooding the target port queue depth, the result of the combination of number of host paths + execution throttle value + number of presented LUNs through the host port must be less than the target port queue depth. In short T => P * q * L

T = Target Port Queue Depth
P = Paths connected to the target port
Q = Queue depth
L = number of LUN presented to the host through this port


Wednesday, November 13, 2013

VMware ESXi vim-cmd Command: A Quick Tutorial

Very nice blog post on www.doublecloud.org ...
Command lines are very important for system administrors when it comes to automation. Although GUIs are more likely (not always as I’ve seen too many bad ones) to be more intuitive and easier to get started with, sooner or later administrators will use command lines more for better productivity. There are a few command line options in VMware ESXi, among which is the vim-cmd ... READ MORE.

Tuesday, November 12, 2013

Introduction to the Dell S6000 Layer 2 Gateway for VMware NSX

On following video you can see DELL Force10 S6000 integration with VMware NSX. That's beginning of real and usable software defined networking (SDN) or network virtualization if you wish.

 I'm looking forward for hands-on experience in the future.

Monday, October 28, 2013

VMware vSphere: Script to change Default PSP to Round Robin

Automated way how to set default PSP for particular SATP.
vCLI example:
esxcli --server myESXi --username user1 --password 'my_password' storage nmp satp set --default-psp=VMW_PSP_RR --satp=VMW_SATP_DEFAULT_AA

PowerCLI example:
$esxcli=Get-EsxCli -VMHost
$esxcli.storage.nmp.satp.set($null,"VMW_PSP_RR","VMW_ SATP_DEFAULT_AA")

Please note that for both examples ESX Server needs Reboot to take Effect.

Saturday, October 26, 2013

VMware VCDX by the numbers

Brian Suhr had a great idea to summarise and publicly share available information about VMware top certificated experts knows as VCDX (VMware Certified Design Experts).

It is real motivation for others preparing for VCDX.

Write-up is available here http://www.virtualizetips.com/2013/09/27/vmware-vcdx-numbers/

Friday, October 25, 2013

DELL EqualLogic valuable resources on the web

I've got an email from one DELL EqualLogic expert and he has in the mail signature links to very valuable DELL EqualLogic web resources. Here there are:
Also see my another blog post DELL Storage useful links.

I'm sharing these links here because I'm sure it is interested for other people.

Saturday, October 19, 2013

DELL is able to build CDN (Content Delivery Network) for telco providers

Are you surprised DELL is able to build CDN? Yes, that's true ... Dell, EdgeCast Shake Up Content Delivery Networks ...
"Every single teleco service provider globally is trying to build some kind of content delivery network," said Segil. The rapid expansion of the use of video, pictures, and multimedia text and graphics is putting a strain on network operators' capacity that would be relieved by effective use of a content delivery network. A film that is suddenly in demand from many parts of the world, for example, would be more effectively streamed from servers close to requestors than struggling to scale from one point.
... I know some people who cannot imagine DELL can help customers with CDN (Content Delivery Network). That's probably because DELL is well known as PC  & Laptop manufacturer. However it is not the right image of modern DELL anymore. DELL is manufacturing and delivering enterprise gear (servers, storage and network) almost 8 years and DELL GICS (Global Infrastructure Consulting Services) providing infrastructure consulting services. DELL has today all hardware components to build CDN. CDN is usually described as special virtual network (aka VPN, tunnel, overlay) on top of internet optimized to deliver digital content (aka digital objects). To be more specific DELL has a partnership with EdgeCast which has complete software solution leveraging commodity x86 hardware. Dell is producing a content delivery platform based on its PowerEdge servers and software from number-three content delivery network service supplier EdgeCast Networks. More information about DELL and EdgeCast CDN solution are here, here, here and here.

However it is worth to mention that before anybody is going to build their own CDN it is very important to gather business requirements, target users and content type for delivery. Conceptual and logical architecture has to be prepared based on specific requirements and constraints. Different CDN can be built for different purposes. And last but not least the technical architecture must be fully aligned with business model and the investor must fully believe that business forecast is achievable.

Wednesday, October 16, 2013

Out-of-band BIOS settings management

Today I did some troubleshooting with customer. We needed to verify what NUMA type is set in server's BIOS. In the past I posted more info about BIOS NUMA settings here. The customer sighed that he can not restart the server just to jump and look into BIOS screen. My answer was ...

... it is not necessary to reboot the server because you have modern gear which allows you to read BIOS settings via out-of-band management card.

In our case we had DELL rack server PowerEdge R620 with iDrac 7 management card. BIOS settings are not visible on iDRAC web interface and you have to use CLI (aka racadm). There are several methods how to use racadm cli but IMHO the simplest method is to SSH into iDRAC ip address and execute command:

 racadm get bios.MemSettings 
You should get result similar to the screenshot bellow. 

For more information look at DELL Tech Center.

Tuesday, October 15, 2013

iSCSI NetGear datastore issues

Yesterday I had a phone call from my neighbor who work as vSphere admin for one local system integrator. He was in the middle of upgrade from vSphere 4.1 to vSphere 5.5 and had a trouble.

He decided to use vSphere 5.5 but not by in place upgrade but as having two environments. The legacy one (vSphere 4.1) and new one (vSphere 5.5). Each environment had their own vCenter and he used one iSCSI datastore connected to both environments as transfer datastore.  He called me because he experienced issues with powering on particular VM stored on transfer datastore and registered on ESXi 5.5 managed by vCenter 5.5. When VM power on was initiated it took some time and the task failed in - he told me - 25%.

I remember  we were discussing some time ago if is better to use vSphere 5.1 or go directly to very new vSphere 5.5. My answer was "it depends" but at  the end we agreed that in small environment is possible to go directly to vSphere 5.5 and accept some risk.  That's the reason why I felt little bit guilty.

As we are neighbors he came to my garden. He smoked several cigarettes probably to organize his thoughts and we were discussing potential root cause and other best practices including migration possibilities. All those general ideas and recommendations were just best practices and hypothesis. At the end we agreed that we have to look at log files to understand what is really happening and what issue he is experiencing.

I have to say I like troubleshooting ...  the first log file to check in such situations is obviously /var/log/vmkernel.log

As he is more Microsoft (GUI) then *nix (CLI) oriented I navigated him over the phone how to enable ssh, login to ESXi  and check the log file.

When we start the command
tail -f /var/log/vmkernel.log 
the troubleshooting was almost done. Lot of SCSI errors were continuously logged into vmkernel.log. SCSI errors included following useful information 
H:0x0 D:0x2 P:0x0  SCSI sense keys: 0x0B 0x24 0x00
Let's translated log file information into human language ... device returns "aborted command" (0x0B) and additional sense code (0x24) is undocumented so it is probably device specific.
However  root cause was obvious ... it is storage related issue. We tried to create directory on affected datastore and it took almost 30 seconds which prove our  assumption of storage issue. Problematic datastore was backed by iSCSI NetGear storage. The same operation in another datastore backed by another storage connected directly by SAS was, of course, immediate.

So I asked him again (we talk about HCL at the beginning general discussion) if he checked HCL and he confirmed he does it but he will double check it. In one hour later he send me a message that storage model is supported but the firmware must be upgraded to work correctly with ESX 5.5

All my "ad-hoc consulting" was done just like quick help to friend of mine so I don't even know what NetGear iSCSI storage my neighbor has but I will ask him and update this post because it can help other people.

Update 10/16/2013:
I have been informed that exact NetGear iSCSI storage model is "NetGear Ready NAS 3100". I checked VMware HCL by my self and at the moment it is supported just for ESX 5.1 with firmware RAIDiator-x86 4.2.21. So I warn my neighbor that even it will work with new firmware this configuration will be unsupported.  Another lesson from this - don't trust anybody and validate everything by your self :-)

So what is the conclusion of this story? Plan, plan and plan again before any vSphere upgrade. Don't check just hardware models on HCL but check also firmwares. Modern hardware and operating systems (including hypervisors) are very software dependent so firmware versions matters.

Wednesday, October 09, 2013

Two (2) or four (4) socket servers for vSphere infrastructure?

Last week I had interesting discussion with customer subject matter experts and VMware PSO experts about using 2-socket versus 4-socket servers for VMware vSphere infrastructure in IaaS cloud environment. I was impressed how difficult is to persuade infrastructure professionals about 4-socket server benefits in some cases.

Although it seems as pretty easy question it is actually more complex if we analyze it deeper. It is common question from many of my customers and because it is a typical "it depends" answer I've decided to blog about it.

Let's start with some general statements:
  • 2-socket servers are designed and used for general business workloads
  • 2-socket servers are less expensive
  • 4-socket servers are very expensive
  • 4-socket servers are designed and used for high performance and mission-critical workloads
  • failure of single 4-socket server node in vSphere cluster has bigger impact on capacity 
All these general statements are relative so what is really better for particular environment depends on customer's requirements, expected workload size and chosen hardware platform.

It is important to note that at the time of writing this post there are two types of 4-socket Intel servers in the market. Servers with Intel E7 CPU Family and servers with Intel E5-4600 Family. Comparing the Intel E7-4780 (10 core, 2.4GHz) with an Intel E5-4650 (8 core, 2.7 GHz), you’ll find that the E5 server outperforms against the E7 server in the following benchmarks:
  •  CAE
  •  SPECfp*_rate_base2006
  •  Numerical Weather
  •  Financial Services
  •  Life Sciences
  •  Linpack AVX
  •  SPECint*_rate_base2006
The E7 server outperforms the E5 server in the following benchmarks:
  •  java* Middleware
  •  OLTP Database
  •  Enterprise Resource Planning
 CPU family comparison is taken from here.

Intel E7 are designed for mission critical workloads and E5-4600 family for general workloads with big CPU performance requirements. Therefore E7 processors are "very" (I would say more) expensive but price difference between E5-4600 (4-socket) and E5-2600 (2-socket) servers is usually less than 10 or 20 percent but it can vary among different hardware vendors.

Server consolidation is the most common use case of server virtualization. Before any server consolidation it is highly recommended to do "ASIS" capacity monitoring and "TOBE" capacity planning with consolidation scenarios. There are plenty of different tools for such exercise. For example VMware Capacity Planner, Platespin Recon, CIRBA, etc. However if we design green field environment and there is not legacy environment which can be monitored we have to define expected average and maximum VM. So, let's define our average and maximum workload we are planning to virtualize in single VM.

Let's assume our typical VM is configured as
  • 1 vCPU consuming 333 MHz CPU
  • 1 vCPU consuming 1/3 of one CPU Thread
  • 4GB RAM
and maximal VM (aka monster VM) is configured as
  • 8 vCPU
  • 128 GB RAM
So what physical servers to use for virtualization in such environment? E7 CPUs are significantly more expensive therefore let's compare 2-socket servers with E5-2665 (2.4GHz) and 4-socket server with E5-4640 (2.4GHz). So here are our server options in detail.

4S-SERVER: Single 4-socket server E5-4640 (8 cores) has 32 cores and 64 CPU Threads (logical CPU) in case hyper-threading is enabled. Total CPU capacity is 76.8 GHz. From RAM perspective it can accommodate 48 DIMMs (4 sockets x 4 channels x 3 DIMMs).

2S-SERVER: Single 2-socket server E5-2665 (8 cores) has 16 cores and 32 CPU Threads (logical CPUs) in case hyper-threading is enabled. Total CPU capacity is 38.4 GHz. From RAM perspective it can accommodate 24 DIMMs (2 sockets x 4 channels x 3 DIMMs).

So in first look 8 x 4-socket server has same compute capacity and performance as 16 x 2-socket servers, right? 4-socket server can accommodate double number of DIMMs, so total RAM capacity of 8 x 4-socket servers and 16 x 2-socket servers is also the same. It is 768GB RAM in 16GB DIMMs or 1536GB RAM in 32GB DIMMs.

If we will use vSphere Cluster with 8 x 4S-SERVER or 16 x 2S-SERVER we have same total raw capacity and performance but 16 x 2S-SERVERs will beet 8 x 4S-SERVERx in real available capacity because in case of single server fail we will lose just 1/16 of capacity and performance unlike 1/8 of capacity.

Is it true or not?

Yes, from memory perspective.
Yes and sometimes No, from CPU performance.

Let's concentrate on CPU performance and compare CPU performance of DELL 2-socket server  M620 (E5-2665/2.4GHz) with DELL 4-socket server M820 (E5-4640/2.4GHz). We all know that 1MHz on two different systems doesn't represent comparable performance, so the question is how to compare CPU performance. The answer is CPU normalization. Good, good ... but wait how we can normalize CPU performance. The answer is CPU benchmark. Good ... but what benchmark?

Below are listed different benchmark results for single host so based on results we can deeply discuss what system is better for particular environment.  Please note that some benchmark results are not available or published so I use results from similar systems. I believe it's enough accuracy for our comparison.

2S-SERVER: M620 (E5-2665/2.4GHz)
  • SPECint_rate2006: 611
  • SPECfp_rate2006: 467
  • VMmark: 5.1  Calculation: 2x M620 VMmark (E5-2680) has 10.20 @ 10 Tiles = 10.2 / 2
  • SPECvirt_sc2013: 236.15 Calculation: 1x HP DL380p G8 SPECvirt_sc2013 (E5-2690) 472.3 @ 27 =472.3 / 2
4S-SERVER: M820 (E5-4640/2.4GHz)
  • SPECint_rate2006: 1080
  • SPECfp_rate2006: 811
  • VMmark: 10.175 Calculation: 2x HP DL560 VMmark (E5-4650) 20.35 @ 18 Tiles =20.35 / 2
  • SPECvirt_sc2013: 454.05 Calculation: 1x HP DL560 SPECvirt_sc2013 (E5-4650) 908.1 @ 53 =908.1 / 2
Note 1: DELL 4S-SERVER VMware results are not published so I use results for HP DL560 servers
Note 2: Some SPECvirt_sc2013 results are not available for VMware vSphere so I use results for Redhat KVM.

Based on results above I prepared performance benchmark comparison table:

2x 2S
1x 4S
4S against 2S

So what does it mean? I explain it by way of 2-socket servers are better for RAW mathematical operations (integer and flouting point) but for more real live workloads 4-socket servers have generally same performance like 2-socket servers and more cores/threads per single system.

BTW: It seems to me that CPU performance normalization based on SPECint and/or SPECfp is not fair to 4-socket servers. That's what Platespin Recon use for CPU normalization.

We can say that there is not 1MHz performance difference between our 2S-SERVER and 4S-SERVER. So what is the advantage of 4-Socket servers based on E5-4600 CPUs? The CPU performance is not only about MHz performance but also about CPU scheduling (aka multi-threading). The 4S-SERVER advantage is bigger count of  logical CPUs which has positive impact on co-scheduling vCPUs of vSMP virtual machines. Although vCPU co-scheduling has been dramatically improved in ESX 3.0 some co-scheduling is required anyway. Co-scheduling executes a set of threads or processes at the same time to achieve high performance. Because multiple cooperating threads or processes frequently synchronize with each other, not executing them concurrently would only increase the latency of synchronization. For more information  about co-scheduling look at https://communities.vmware.com/docs/DOC-4960

In our example we are planning to have monster VMs with 8 vCPUs so 64 logical CPUs in 4S-SERVER offers potentially more scheduling opportunities against 32 logical CPUs in 2S-SERVER. As far as I know virtualization benchmarks tiles (tiles are group of VMs) usually have up to 2 vCPUs so I think co-scheduling issue is not covered by these benchmarks.

So final decision depends on expected number of  monster VMs which can affect real performance of workloads inside these monster VMs. CPU performance overloading can be monitored by ESX metric %RDY (vCPU READY but pCPU not) and co-scheduling execution delays by metric %CSTP (vCPU stoped because of co-scheduling). Recommended thresholds are discussed here but every environment has different quality requirements so your thresholds can be different and it depends what quality SLA you want to offer and what type of application you want to run on top of virtual infrastructure.

Anyway, co-scheduling of monster VMs is serious issue for IaaS Cloud Providers because it is really hard to explain customers that less vCPUs can give them paradoxically better CPU performance. I call this phenomenon "VIRTUALIZATION PARADOX".   

The final hardware selection or recommendation is always dependent on justification of vSphere Architect who has to carefully analyze specific requirements and constraints in particular environment and reasonably justify selected decision. We should remember there can be other requirements favoring a specific platform. Example of "other" requirement (sometime the constraint) can be the situation when blade servers want to be used. In 2-socket blade servers is usually very difficult and sometimes even impossible to avoid single point of failure of NIC/HBA/CNA adapter. 4-socket server are usually full height (dual slot) and therefore I/O cards are doubled ... but that's another topic.