The taming of big data

Big data has taken root faster than anyone could ever have expected. But thanks to an explosion of management solutions, it’s becoming manageable, says BigInsights’ research director, Shayum Rahim

Comments

Over the last year, we’ve witnessed a change in perceptions around big data. Up until recently, big data was thought of as ‘hype’, or a buzzword created by IT vendors to sell more products.

The interesting thing is that big data was not created by IT vendors at all. Big data is a by-product of consumer demand to do things digitally. Businesses began to fulfil that demand by moving more and more of their business functions online, then to mobile devices, sensors and machines. Almost everything we do today is digitised and that has triggered a deluge of data traditional IT hasn’t been able to deal with.

IT vendors, and in particular RDBMS (relational database management systems) vendors, have scrambled to straddle the market and capitalise on the massive opportunity presented by the data explosion. Of course, big data was not necessarily a new concept to them.

Many of their large enterprise customers already had hundreds of terabytes of data. Volume was not a problem either for state-of- the-art data warehousing capabilities offering zero failover, high-availability backup and more options than a stock market trading floor. This was business as usual.

What these vendors had a hard time factoring in were the other characteristics of big data, namely velocity and variety. More types of data were being created than ever previously imagined, from more sources than ever before. Data warehousing would be far too expensive a way to solve this problem. The industry needed something that would be cost effective and provide enough capacity to process and manage large volumes and varieties of data quickly.

Unstructured data rises

Enter unstructured data platforms. The two most notable and common are NoSQL databases and the Hadoop Distributed File System.

Without launching into a full history, Hadoop was the brainchild of Doug Cutting, who evolved it at Yahoo from an earlier prototype designed as an open source search engine. Yahoo released it as an open source project with the Apache Software Foundation and spawned a whole new sub-industry.

The appeal of Hadoop is that it can scale from a single server to thousands of machines. Hadoop became a sandbox for the early data scientists at Yahoo, but because of its open source nature it also had a wider appeal and began attracting more experiments, becoming ‘enterprise-grade’ in 2008.

Yahoo eventually spun out Hortonworks, an enterprise Hadoop management software company, much in the same way Red Hat or SUSE are management layers for the open source Linux platform. This was followed shortly by rivals Cloudera and MapR.

The open source approach meant the game was well and truly on. The open source community has been very active around Hadoop, and has produced many management tools that make the platform much easier to use. These include YARN, a Hadoop cluster scheduling module; Hive, a data warehouse structure that allows unstructured data querying; and Mahout, a machine learning and data mining engine. Many more tools are in development.

The other big data platform to come to the fore is NoSQL (Not only SQL). This refers to the structured query language which, until now, was the default data format. NoSQL departs from the tabular schema of data management typical in relational database management systems to look at data through different structures such as key value, graphical and documents, and allowed for horizontal scaling in the open source environment.

The key feature of the NoSQL concept is the ability to scale with different data values without sacrificing performance. As data begins to grow, you just need to add more hardware. This has triggered the advent of specialised NoSQL database companies that are now challenging enterprise IT vendors in a race that is wide open to everyone.

Startup proliferation

Since the big data phenomenon erupted on the market, we have seen many pure big data plays emerging to meet the challenge. Two big factors have contributed to this.

The first is the open source environment Hadoop and NoSQL technologies are available in. The market is no longer constrained by proprietary technologies. This has allowed the brightest minds to experiment, contribute and collaborate on projects. The result is very sophisticated systems and solutions.

Secondly, private investors and venture capital firms have recognised the enormous opportunity available to those taking these sophisticated solutions to market, and they’re quite ready to back the bright minds coming out of universities globally. The young ‘geeks’ turned entrepreneurs seem to have little trouble finding the capital to give their ideas wings, and for many investors it has certainly paid off.

What will be even more interesting is how these startups fare over the years. The proliferation of startups would imply competition could become very tight, and we should expect to see quite a bit of vendor consolidation.

However, contrary to the competitive nature of the proprietary vendors of the past, a different trend is emerging. The new breed of big data vendors are more agile and nimble than their traditional counterparts, and are headed by a new generation with a fresh view of how business should be done. These organisations are taking a more collaborative approach, likely fostered by their open source project experiments. This seems to have carried over to their corporate manifestations, and we see several of these organisations partnering in order to bring solutions to the market.

While traditional vendors dreamt of building an empire, developing solutions and acquiring in their attempt to be the one-stop shop, the new breed of vendors thus far have been content to stick to what they are good at. They then collaborate with those with more expertise in their particular niche in big data to address the problems their own solutions can’t. As a result, the Hadoop and NoSQL platforms are becoming more complementary based on the solutions by vendors of either platform.

Further down the track, there is bound to be some consolidation, but the ‘free for all’ development environment offered by open source platforms will spur new innovation.

Skillset shortage

One of the biggest concerns around the rise and growth of big data has been the shortage of big data skillsets. Big data was never deemed to be something traditional database skillsets could tackle, and it’s not easily upskilled for.

However, innovation taking place in the current environment suggests plenty of these processes can be automated and made friendlier to existing IT skills. Several pure-play big data organisations have new capabilities that allow SQL databases to run natively on their unstructured platform whether proprietary, Hadoop or NoSQL based. This is good news for organisations struggling with finding relevant skillsets for their big data initiatives.

In addition, we’re seeing a rise in cloud-based companies offering ‘everything-as-a-service’. Whether it is software, infrastructure or even a platform being offered as a service, there is now a lower cost of entry. This will be a game-changer for those who previously thought big data technologies were out of their reach. There is no longer the need for a heavy upfront investment in software and infrastructure.

For 30 per cent of organisations polled by BigInsights on this topic last year, the cost of technology infrastructure was inhibiting big data investments. Cloud computing has already shaken investment in on-premise solutions to the core, and now that it extends to delivering big data solutions and capabilities, the value they bring is fast becoming available to organisations across markets.

Resistance is futile

Big data has taken root in the market faster than it took many organisations to fully understand and accept it, and was initially met with a certain degree of scepticism, some apprehension and even fear. The benefits, however, overshadowed any misgivings and allowed a whole ecosystem of technologies and services to flourish that will allow all organisations to reap the benefits from the rich insights their data can deliver them. We are living in momentous times and should expect much more, for the renaissance has just begun.