Douglas Soltesz, vice president and CIO at Budd Van Lines, is facing a common problem: A seemingly endless flood of data.
"If you gave me an infinite amount of storage, I could fill it," he says. The most recent four months of high-definition surveillance video from the company's offices and warehouses now consumes 60TB on his Nexenta Stor NAS and SAN platforms. That video is one reason his storage needs are growing 50% to 80% per year.
If he had twice as much capacity, he says, his users would just ask to keep their video twice as long.
With existing hard drive technologies ending their decade-long run of ever-increasing densities, IT shops are waiting for new technologies such as shingled magnetic recording (SMR) and phase-change memory (PCM) to boost storage densities. In the meantime, they are holding down costs -- and boosting data access -- with software that virtualizes, deduplicates and caches data on commodity disk drives, solid-state drives (SSD) and server-side flash memory.
Disk Density Gets Higher Still
After about 10 years of steadily increasing densities, disks that use perpendicular magnetic recording (PMR) are topping out at about 1TB per square inch, says Mark Re, a senior vice president at storage vendor Seagate Technology.
In the second half of this year, Seagate will begin shipping drives that use SMR to squeeze more data onto disks by overlapping the data tracks on them like shingles on a roof, says Fang Zhang, a storage analyst at IHS iSuppli. That should eventually boost drive densities to 1.3T to 1.4T bits per square inch, says Re, who adds that Seagate's SRM drives will start with desktop form factors and spread to other platforms such as storage arrays next year.
The next advance, which will take disk drives to 5Tbits per square inch, is heat-assisted magnetic recording (HAMR), which uses a small laser to change the magnetic properties of the disk, says Re. Seagate's first HAMR drives are expected in 2015 or 2016.
If you gave me an infinite amount of storage, I could fill it. Douglas Soltesz, VP and CIO, Budd Van Lines
In the fourth quarter of this year, Seagate rival Western Digital is expected to release disk drives filled with helium, which provides less resistance than air and thus allows the addition of another storage platter or two to a drive. Those extra platters could lift the maximum capacity of PMR drives from today's 4TB to 5TB or 6TB, says Zhang. Western Digital says it also plans to release SMR and HAMR drives within about two years, and by the end of the decade it hopes to double hard drive density through the use of self-assembling molecules and nanoimprinting.
On the flash memory front, vendors are working to increase not only the density, but also the useful capacity and life span of flash memory used in server-based flash storage and SSDs.
The NAND flash on which most flash and SSD drives are based will begin to be replaced by a new form of nonvolatile memory called phase-change memory by around 2016, says Milan Shetti, CTO at HP Storage. Unlike magnetic recording that records data by changing the magnetic orientation of a physical piece of memory, PCM applies heat to change the electrical conductivity of the media. PCM drives are not only faster than NAND flash, but their memory cells can also withstand two to three times the number of read/write cycles as NAND flash, says Haris Pozidis, manager of memory and probe technologies at IBM's Zurich research lab. That's important for applications such as caching where data is constantly being read and written.
Shetti predicts initial drive capacities of about 200 to 250GB, with drive sizes at least doubling by 2018. He stresses that this will all be usable capacity, which is not the case in current SSDs, where 15% to 20% of raw capacity is set aside to replace cells that may wear out. Shetti says he expects prices per gigabyte to be comparable to those of current flash drives. That equates to a 15% to 20% price cut, since all of the raw capacity will actually be usable.
Dedupe: A Must-Have Feature
Over the past 10 years, deduplication -- the process of eliminating duplicate copies of data -- has moved from game-changing novelty to must-have feature.
Observers say not to expect any breakthrough increases in the amounts of data that deduplication can remove from hard drives. Currently, deduplication typically reduces data by a factor of seven to 10. Future improvements will come from increases in the speed at which data is deduplicated and from the use of standard deduplication systems across an enterprise.
Speeds will improve as a result of deduplication being performed in hardware rather than software, and in nonvolatile memory such as PCM, which is faster than today's NAND flash, observers say. Predicting that "every [nonvolatile memory] controller is going to have [deduplication] built in," Shetti also points out that, unlike disk drives, deduplication doesn't cause defragmentation on nonvolatile memory drives.
In-line deduplication, in which data is deduped before it is ever stored, reduces storage requirements from primary storage to backup and replicated copies. Pure Storage says its in-line data deduplication allows its flash arrays to store five or 10 times as much data as their designated size.
Observers also expect to see deduplication spread from its traditional use in backup to other applications and to more computing and storage devices. Dell says it plans to incorporate the deduplication technology it gained through its purchase of Ocarina into its EqualLogic and Compellant product lines, "first with compression primarily for... data like snapshots," and later for more frequently accessed data and files, says Travis Vigil, executive director for product marketing at Dell Storage.
Sean Kinney, director of product marketing at HP Storage, predicted the rise of unified deduplication platforms that organizations can use for all of their applications and storage. That, he says, will reduce licensing, training and management costs as well as the amount of storage an organization must buy.
Performance Meets Speed
Some users aren't upgrading their storage systems solely because they need help managing large volumes of data; they're also driven by the need to access data quickly.
Case Western Reserve University is moving 100TB of research file data from an EMC Celerra NS480 to a Panasas ActiveStor 8 for rapid analysis, and another 65TB of structured administrative data to a Nexsan NST 5310. Besides higher performance, users wanted to create single name spaces as big as 600TB -- far above the 64TB limit of both EMC and NetApp offerings, says Brian Christian, design senior technical lead at the school.
"Our first, small, high-performance cluster" used a traditional NAS device acting as Network File Server, "and we overloaded it. After talking with our peers, we saw that to grow as we needed, we needed a parallel NAS. That's when we acquired Panasas," says Christian.
To boost performance, many customers are using flash memory within servers, as well as solid-state drives in storage arrays, to cache speed-sensitive data before writing it to slower, but less expensive and higher-capacity hard drives.
Three years ago, slowing application performance and increasing upgrade costs spurred David Abbott, manager of IT infrastructure engineering at TripPak Services and ACS Advertising, both Xerox companies, to look for new platforms that could handle his expected 10TB of new storage per year "without management having a heart attack" over cost.
To the Rescue: Old, Slow Disk and Tape
Even as researchers fiddle with material science and software developers fine-tune clustered file systems, two old standbys -- slow, cheap, spinning disk and even older tape drives -- are playing a crucial role in managing the storage flood.
For customers using CleverSafe storage appliances, for example, "5,400 rpm and 7,200 rpm drives are the way to go" to achieve the lowest cost per gigabyte in dollars, power, cooling and space, says Chris Gladwin, Cleversafe's president and CEO. And Seagate senior vice president Mark Re points out that, not only would it be costly to replace every hard drive with flash, but it would also be impossible to manufacture that much flash.
Tape is even slower and less expensive than disk, and it's often derided as clumsy, hard to use and prone to failure. Nonetheless, "tape enjoys a significant space efficiency advantage over disk storage by virtue of its ability to pack more recording surface area into a given physical volume," according to a blog post by Eric Slack, an analyst for Storage Switzerland, an IT analyst firm based in Fort Worth, Texas.
According to an April 2012 presentation by IBM Systems Technology Group, while NAND flash and hard disk drive densities will grow 20% to 30% by 2014, tape densities could grow by 40% to 80%.
Therefore, Slack argues that tape will continue to be a good option for handling big data, which will consist of "file-based reference data that's stored for long periods but must still be available in a relatively short time frame."
- Robert L. Scheier
The software-as-a-service provider for the transportation industry is now using three network-attached storage (NAS) units from Starboard Storage Systems for storing 80TB of image files and 45TB of performance-sensitive data for 500 virtual machine images and more than 200 virtual desktops on a Pure Storage flash array.
Before moving to the Nexanta NAS/SAN platform, Budd Van Lines had relied on a Compellent SAN. While it wasn't full, "it was running out of IOPS" to handle a growing number of queries among applications for work such as month-end accounting, he says. To provide that performance, the NexantaStor platform caches data in solid-state drives for faster access, before writing that data to 7,200 rpm serial attached SCSI (SAS) drives for long-term storage.
NAS vendor NetApp also entered the flash array market with its EF540, the first in a line of arrays it says will combine consistent, low-latency performance, high-availability and integrated data protection with enterprise storage efficiency features such as in-line deduplication and compression.
Software Plus Commodity Disk
Online marketing SaaS provider Constant Contact is among those turning away from proprietary hardware and software to commodity disk managed by software.
"When I joined three and half years ago, our primary way of scaling was to buy more storage, faster storage, and bigger and faster database servers," says CTO Stefan Piesche. To reduce costs even while his storage needs grow 15% to 25% per year, he is switching from IBM's DB2 database running on 3Par SANs to the open-source MySQL and Cassandra NoSQL databases running on Dell servers, commodity disk and Fusion-io flash cards.
This new platform, he says, is not only an "order of magnitude faster" than its older storage but delivers high performance, availability and disaster recovery without the need for extensive management. The performance gain achieved by writing data to six storage nodes without transferring it over the network means storing multiple copies of the same data. However, says Piesche, the low price of commodity disk and servers make the trade-off worthwhile.
He also notes his customers won't suffer if the marketing data stored in one of those copies is a few milliseconds out of date -- although that wouldn't be true for a financial trading system where prices constantly change.
"Sharding," or splitting databases also helps Constant Contact scale easily, he says. "We can put a set of customers on Databases A, B and C, [which are] usually multiple instances of the same database with the same schema. We want them to be identical and on commodity hardware, to keep our operational costs low, so it's a non-event to roll out a new one. For 50,000 customers, we add two commodity database servers running MySQL," with no performance hit on other users, says Piesche.
Another vendor in this space is CommVault, which says its Simpana software platform cuts storage costs by up to 50%, administrative overhead by up to 80% and annual support costs by up to 35% by reducing the number of copies of data stored as well as the number of storage-related applications to buy and maintain.
Sanbolic claims its Melio5 data management platform provides high availability, application scale-out using shared-data server clusters, fast access to any size files in a variety of workloads, and is scalable to more than 2,000 physical or virtual nodes and up to 65,000 storage devices. Its Latency Targeted Allocator allows the Melio platform to share server-side flash and SSDs within storage arrays, as well as conventional hard drives, across nodes. This eliminates single points of failure and hard-to-access data and application silos, says CEO and co-founder Momchil Michailov.
Some newer vendors package their software in the form of physical hardware with disks and processors. Gridstore's storage appliances virtualize storage controllers as well as data to eliminate single points of failure and provide faster, parallel data access from many servers. This allows the number of controllers to grow, tapping unused computing power to scale performance as well as capacity. However, it currently supports only Windows and file-based storage.
Another software-based approach to scalability is distributing "slices" of data over many physical databases. Cleversafe's dsNet technology, also sold as appliances, works best with more than a petabyte of storage, made up of objects more than 50 to 100KB in size. This is ideal, says President and CEO Chris Gladwin, for applications such as photo sharing over the Web.
As hard drives get bigger and faster, flash gets bigger and more reliable, and open-source storage stacks mature, some industry watchers see fundamental changes in how organizations cope with the data flood.
With the adoption of new nonvolatile memory technologies, the need for tiering data between solid state and spinning disk will diminish as new technologies become cost-competitive with higher-end Fibre Channel and SAS disks, predicts Shetti. Higher-capacity, lower-cost SATA disks will still have a role, but he says the complexity of packaging and different software interfaces will discourage users from mixing nonvolatile memory and SATA in the same system.
Within three to five years, the price of flash drives will be somewhere around the same cost as high performance disk, says Hu Yoshida, CTO at Hitachi Data Systems. They are already at parity, he says, when the capacity of the hard drives is reduced by short-stroking (using only part of the disk capacity to speed performance by reducing the distance the read/write heads must travel to reach the data) and by writing data across multiple disks in RAID data protection configurations.
Even commodity hard drives, however, will gain speed as vendors add more cache to them. Seagate expects such "hybrid" drives to make up most of its product line by the middle of the decade.
Cloud storage services will provide slow but extremely low-cost archiving services to reduce the in-house storage load. Amazon Glacier, for example, costs as little as 1 cent per gigabyte per month. While "it could take three to five hours to retrieve that data," that might be no longer than it would take to restore data from tape stored offsite -- and Glacier would be cost-competitive with tape, says Greg Schulz, founder of consultancy StorageIO.
"Object stores can reduce storage costs and complexity by eliminating the need for hierarchical file systems," says Gladwin. "In a very large data storage system, running a file system [requires] additional racks of servers" that consume power, take up space and cost money. With an object store, he says, an application such as a social media website lets a user search for friends without using a file system.
Meanwhile, IT shops continue to be drawn to the cloud's combination of cost efficiencies, low-cost hardware and low-cost, open-source software.
Constant Contact, for example, is considering "private storage clouds," possibly using open-source software, on the system of a provider such as Amazon S3, for the low costs and "almost unlimited horizontal scale" they can deliver, says Piesche. Using Cassandra, for example, he says he would like to scatter storage clusters among distributed data centers for disaster recovery "without any licensing costs, without any complicated setup and without any manual intervention."
The replication capabilities he needs aren't available yet. But he has to keep looking because, as Schulz says, "For the vast majority of people there's no such thing as a data recession."
Scheier is a veteran technology writer. He can be reached at firstname.lastname@example.org.
Read more about data storage in Computerworld's Data Storage Topic Center.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.