A Downside to Great Storage Performance

There is a downside to great storage performance!

One might think that great storage performance is great performance! Given the infamous storage performance gap, what could possibly be wrong with great performing storage hardware like SSD?

Let me recount an actual case of a storage performance issue, one that did not involve SSDs, but which causes me to anticipate a spate of issues in the near future.

Let’s call this the case of Customer C. Over a weekend, critical production jobs began to run extremely slowly. Jobs which normally finished in a couple of hours were running many, many hours and not meeting the required deadlines! Issues seemed to be isolated to a couple of database servers, and those server stats showed pretty high read and write response times.

Obviously a STORAGE PERFORMANCE PROBLEM! What’s gone wrong with that storage?

Fortunately, there was a considerable amount of storage performance information available for the storage subsystem. The configuration consisted of about 80 RAID 5 arrays. One array was running consistently at 99+ % utilization, while all the others were running at modest utilization (see graph). Read and Write response times were pretty high too! Customer C quickly confirmed that the critical databases were indeed on that particular array. What more evidence of a storage problem could be required? Customer C demanded an “action plan”, storage remediation – namely, install SSDs! They are so fast that these response times would never be a problem.

Not withstanding the speed of SSDs, it is prudent to ask where all that activity on that busy array was coming from. Were the production databases suddenly saturating the array? No, in fact they had rather modest IO rates with admittedly high response times.

But there were several other extremely busy volumes allocated to that array. They had names like rootvgxxx. Hmm…. The xxx part of the volume names pointed to a couple of unimportant servers. Still a storage problem as far as Customer C was concerned. Nevertheless, someone took a look at those other servers and discovered they were in a tight loop of core dumps, writing constantly to dump logs on the suspect array, hour after hour, for several days. A reset on those (unimportant) systems solved the problem immediately.

You might ask how constant core dumps from multiple systems could go unnoticed. Well, there were over 300 systems attached to that storage subsystem. Some were considered important and others not. The production systems were closely monitored, but those “unimportant” systems did not involve close monitoring. Obviously, a rigorous process of performance monitoring should have caught this before it became a “storage problem”.  I am not so interested in criticizing the monitoring processes as in speculating what the impact of better storage allocation might have been, and how SSDs might have behaved in this set of circumstances.

Actually, it has not been best practice to allocate volumes to one particular RAID array for quite some time. Individual servers are powerful enough to overload a single array of 7 or 8 or 10 disks. It has (for some time) been better practice to use various techniques to spread activity across multiple RAID arrays. In this case, let us imagine how this situation might have played. The rogue systems doing all that writing would spread their activity across many more arrays and many more individual disks. The write response times for this activity might still be elevated, but not absurd. And the production systems would still have been impacted, but not quite so severely, not quite so suddenly and dramatically. It would look much more like a gradual degradation in storage performance.

And what if SSDs had been in the mix? SSDs are quick! Even writes, which are the weaker part of SSD performance, are quick. This problem might have gone undetected for days and days. So in the end I guess I do have to fault the monitoring process – or the lack thereof.

What about “automatic tiering” and dynamic storage optimization algorithms? They could make it even more difficult to see through this particular set of circumstances. But this was a real set of circumstances, not hypothetical at all. Fancy algorithms and super-speed hardware are no doubt great for optimizing performance in normal circumstances. But they might just make it harder to detect abnormal circumstance.

Storage performance monitoring is a crucial business process, even with SSDs and clever storage management algorithms. The downside of great performance is that it can mask the weakness in your performance monitoring and performance management processes.

Posted in Uncategorized | Comments Off on A Downside to Great Storage Performance

IBM 2013 Storage Performance Monitoring

For a quick look at the latest storage performance management techniques from IBM TiovliStorage Productivity Center see Brian Smith’s webinar presentation. The webinar storyboard is in a pdf.

Posted in Uncategorized | Comments Off on IBM 2013 Storage Performance Monitoring

Storage and More – CHM

Revolution – the first 2000 years of computing is the Computer History Museum‘s latest exhibit.  And what a show!  You can see models of early computational machines, short videos explaining how things work, and room after room of computer artifacts.  Storage technology has its own place in the tubes, transistors, delay lines, rotating disks and drums, bubbles, and other exotic media.

Beyond storage, the exhibit chronicles software, hardware, games, primary uses for computers throughout the ages, and much more.  You’ll be amazed at how well the show is put together. If you’re at all fascinated by computers, this is a must see!

Posted in Uncategorized | Comments Off on Storage and More – CHM

How Is Your Storage Performance?

Have you ever wondered how the performance of your storage subsystem compares to that of others? You may know how many IOs per second your subsystem does, or how many megabytes per second of data it delivers, but how does that compare to other subsystems in other IT shops?

Scatter plot of IO Rate and Datarate
The IO Rate and Response Time scatter plot shows IO rates from about 2,000 IO/sec to 200,000 IO/sec, and response times ranging from a fraction of a millisecond to about 15 msec of response time. Where might your subsystem fall in this space? Let’s look at datarates next.

Scatter Plot of Datrate and Response Time
This scatter plot shows datarates from less than 100 MB/sec to well over 2000 MB/sec, and response times from a fraction of a millisecond to about 60 msec per IO. There is a trend in this data that correlates higher datarates with higher response times.

 

Posted in Uncategorized | Comments Off on How Is Your Storage Performance?

Storage Virtualization

Thinking about the performance of traditional disk storage, things have really changed, and rather suddenly.

  • Storage capacity of disks has exploded.
  • Solid State Devices are finally making real inroads
  • To make it all simple – virtualization of everything from servers, to storage, and the networks in between!
Posted in Uncategorized | Comments Off on Storage Virtualization

Performance Matters – Capacity is Incidental

Computer disk storage has recently crossed into a new pricing realm. Storage capacity has become inexpensive to the point that access to storage (performance) becomes the major cost consideration.

Posted in Uncategorized | Comments Off on Performance Matters – Capacity is Incidental