It's no secret that data volumes are growing exponentially. What's a bit more mysterious is figuring out how to unlock the value of all of that data. A big part of the problem is that traditional databases weren't designed for big data-scale volumes, nor were they designed to incorporate different types of data (structured and unstructured) from different apps.
Lately, Apache Hadoop, an open-source framework that enables the processing of large data sets in a distributed environment, has become almost synonymous with big data. With Hadoop, end users can run applications on systems composed of thousands of nodes that pull in thousands of terabytes of data.
According to Gartner estimates, the current Hadoop ecosystem market is worth roughly $77 million. The research firm expects that figure to balloon to $813 million by 2016.
Here are 10 startups hoping to grab a piece of that nearly $1 billion pie. These startups were chosen and ranked based on a combination of funding, named customers, competitive positioning, the track record of its executives, and the ability to articulate a real-world problem and explain why the startup's solution is an ideal one to solve it.
(Please note that this lineup favors newer startups. As a result, some big, well-funded names have been left off, such as Cloudera, Datameer, DataStax, and MapR Technologies, simply because they've been around longer than most in this new market sector.)
What They Do: Provide a big data analytics solution that transforms raw data in Hadoop into interactive, in-memory business intelligence.
Headquarters: San Mateo, Calif.
CEO: Ben Werther, who formerly served as vice president of products at DataStax.
Funding: $65 million to date. The latest round ($38 million Series C) was locked down in March. Tenaya Capital led the round, while Citi Ventures, Cisco, Allegis Capital, Andreessen Horowitz, Battery Ventures, Sutter Hill Ventures, and In-Q-Tel all participated.
Why They're on This List: As with many startups on this list, Platfora was founded in order to simplify Hadoop. While businesses have been rapidly adopting Apache Hadoop as a scalable and inexpensive solution to store massive amounts of data, they struggle to extract meaningful value from that data. The Platfora solution masks the complexity of Hadoop, which makes it easier for business analysts to leverage their organization's myriad data.
Platfora tries to simplify the data collection and analysis process, automatically transforming raw data in Hadoop into interactive, in-memory business intelligence, with no ETL or data warehousing required. Platfora provides an exploratory BI and analytics platform designed for business analysts. Platfora gives business analysts visual, self-service analytical tools that help them navigate from events, actions, and behaviors to business facts.
Customers include Comcast, Disney, Edmunds.com and the Washington Post.
Competitive Landscape: Platfora competes with the likes of Datameer, Tableau, IBM, SAP, SAS, Alpine Data, and Rapid-I.
Key Differentiator: Platfora claims to have the first scale-out in-memory Big Data Analytics platform for Hadoop. Platfora's focus on simplifying Hadoop and Big Data analysis is becoming a more common goal of late, but they are an early mover in this respect.
What They Do: Provide a Hadoop-based data analysis platform.
Headquarters: San Francisco, Calif.
CEO: Joe Otto, formerly senior vice president of sales and service at Greenplum.
Funding: $23.5 million in total funding, including $16 in Series B Funding, from Sierra Ventures, Mission Ventures, UMC Capital and Robert Bosch Venture Capital.
Why They're on This List: Most executives and managers don't have the time or skills to code in order to glean data insights, nor do they have the time to learn about complex new infrastructures like Hadoop. Rather, they want to see the big picture. The trouble is that complex advanced analytics and machine learning typically require scripting and coding expertise, which can limit access to data scientists. Alpine Data mitigates this issue by making predictive analytics accessible via SaaS.
Alpine Data provides a visual drag-and-drop approach that allows data analysts (or any designated user) throughout an organization to work with large data sets, develop and refine models, and collaborate at scale without having to code. Data is analyzed in the live environment, without migrating or sampling, via a Web app that can be locally hosted.
Alpine Data leverages the parallel processing power of Hadoop and MPP databases and implements data mining algorithms in MapReduce and SQL. Users interact with their data directly where it already sits. Then, they can design analytics workflows without worrying about data movement. All this is done in a Web browser, and Alpine Data then translates these visual workflows into a sequence of in-database or MapReduce tasks.
Customers include Sony, Havas Media, Scala, Visa, Xactly, NBC, Avast, BlackBerry, and Morgan Stanley.
Competitive Landscape: Alpine will compete both with large incumbents (SAS, IBM, SPSS, and SAP) and such startups as Nuevora, Platfora, Skytree, Revolution Analytics, and Rapid-I.
Key Differentiator: Alpine Data Labs argues that most competing solutions are either desktop-based or a point solutions without any collaborative capability. In contrast, Alpine Data offers a "SharePoint-like" feel to it. On top of collaboration and search, it also provides modeling and machine learning under the same roof. Alpine is also part of the No-Data-Movement camp. Regardless if a company's data is in Hadoop or MPP Database, Alpine sends out instructions, via its In-Cluster Analytics, without ever moving data.
What They Do: Provide Hadoop-as-a-Service (HaaS).
Headquarters: Palo Alto, Calif.
CEO: Raymie Stata, who was previously CTO of Yahoo.
Founded: March 2012
Funding: Altiscale is backed by $12 million in Series A funding from General Catalyst and Sequoia Capital, along with investments from individual backers.
Why They're on This List: Hadoop has become almost synonymous with Big Data, yet the number of Hadoop experts available in the wild cannot hope to keep up with demand. Thus, the market for HaaS should rise in step with big data. In fact, according to TechNavio, the HaaS market will top $19 billion by 2016.
Altiscale's service is intended to abstract the complexity of Hadoop. Altiscale's engineers set up, run, and manage Hadoop environments for their customers, allowing customers to focus on their data and applications. When customers' needs change, services are scaled to fit -- one of the core advantages of a cloud-based service.
Customers include MarketShare and Internet Archive.
Competitive Landscape: The HaaS space is heating up. Competitors comes from incumbents, such as Amazon Elastic MapReduce (EMR), Microsoft's Hadoop on Azure, and Rackspace's service based on Hortonworks' distribution. Altiscale will also compete directly with Hortonworks and with such startups as Cloudera, Mortar Data, Qubole, and Xpleny.
Key Differentiator: Altiscale argues that they are "the only firm to actually provide a soup-to-nuts Hadoop deployment. By comparison, AWS forces companies to acquire, install, deploy, and manage a Hadoop implementation -- something that takes a lot of time."
What They Do: Provide a platform that enables users to transform raw, complex data into clean and structured formats for analysis.
Headquarters: San Francisco, Calif.
CEO: Joe Hellerstein, who in addition to serving as Trifacta's CEO is also a professor of Computer Science at Berkeley. In 2010, Fortune included him in their list of 50 smartest people in technology, and MIT Technology Review included his Bloom language for cloud computing on their TR10 list of the 10 technologies "most likely to change our world."
Funding: Trifacta is backed by $16.3 million in funding raised in two rounds from Accel Partners, XSeed Capital, Data Collective, Greylock Partners, and individual investors.
Why They're on This List: According to Trifacta, there is a bottleneck in the data chain between the technology platforms for Big Data and the tools used to analyze data. Business analysts, data scientists, and IT programmers spend an inordinate amount of time transforming data. Data scientists, for example, spend as much as 60 to 80 percent of their time transforming data. At the same time, business data analysts don't have the technical ability to work with new data sets on their own.
To solve this problem, Trifacta uses "Predictive Interaction" technology to elevate data manipulation into a visual experience, allowing users to quickly and easily identify features of interest or concern. As analysts highlight visual features, Trifacta's predictive algorithms observe both user behavior and properties of the data to anticipate the user's intent and make suggestions without the need for user specification. As a result, the cumbersome task of data transformation becomes a lightweight experience that is far more agile and efficient than traditional approaches. Lockheed Martin and Accretive Health are early customers.
Competitive Landscape: Trifacta will compete with Paxata, Informatica and CirroHow.
Key Differentiator: Trifacta argues that the problem of data transformation requires a radically new interaction model -- one that couples human business insight with machine intelligence. Trifacta's platform combines visual interaction with intelligent inference and "Predictive Interaction" technology to close the gap between people and data.
What They Do: Provide a Hadoop-based, SQL-compliant database designed for big data applications.
Headquarters: San Francisco, Calif.
CEO: Monte Zweben, who previously worked at the NASA Ames Research Center where he served as the Deputy Branch Chief of the Artificial Intelligence Branch. He later founded and served as CEO of Blue Martini Software.
Funding: They are backed by $19 million in funding from Interwest Partners and Mohr Davidow Ventures.
Why They're on This List: Application and Web developers have been moving away from traditional relational databases due to rapidly growing data volumes and evolving data types. New solutions are needed to solve scaling and schema issues. Splice Machine argues that even a few short months ago Hadoop, while viewed as a great place to store massive amounts of data, wasn't ready to power applications.
Now, with emerging database solutions, features that made RDBMS so popular for so long, such as ACID compliance, transactional integrity, and standard SQL, are available on top of the cost-effective and scalable Hadoop platform. Splice Machine believes that this enables developers to get the best of both worlds in one general-purpose database platform.
Splice Machine provides all the benefits of NoSQL databases, such as auto-sharding, scalability, fault tolerance, and high availability, while retaining SQL, which is still the industry standard. Splice Machine optimizes complex queries to power real-time OLTP and OLAP applications at scale without rewriting existing SQL-based apps and BI tool integrations. By leveraging distributed computing, Splice Machine can scale from terabytes to petabytes by simply adding more commodity servers. Splice Machine is able to provide this scalability without sacrificing the SQL functionality or the ACID compliance that are cornerstones of an RDBMS.
Competitive Landscape: Competitors include Cloudera, MemSQL, NuoDB, Datastax, and VoltDB.
Key Differentiator: Splice Machine claims to have the only transactional SQL-on-Hadoop database that powers real-time big data applications.
What They Do: Provide a real-time stream processing platform built on Hadoop.
Headquarters: Santa Clara, Calif.
CEO: Phu Hoang, who was previously a founding member of the engineering team at Yahoo, where he served as executive vice president of engineering.
Funding: The company closed an $8 million Series A round in June 2013. August Capital led the round and was joined by AME Cloud Ventures. The company previously secured $750K in seed funding from Morado Ventures and Farzad Nazem.
Why They're on This List: DataTorrent argues that we'll soon start thinking about latency issues when we think about Big Data solutions. DataTorrent points out that "data is happening now, streaming-in from various sources -- in real-time, all the time." Many organizations struggle to process, analyze, and act on this never-ending and ever-growing stream of information -- at all.
For some insights, by the time data is stored to disk, analyzed, and responded to -- it's already too late. For instance, if a hacker compromises a credit card account and manages to make a few purchase, plenty of damage has already been done, even if that account is cut off within minutes. DataTorrent contends that an organization's ability to recognize and react to events instantaneously isn't just a business advantage. In today's word, it is a necessity.
Unlike traditional batch processing that can take hours, DataTorrent claims to be able to execute hundreds of millions of data items per second. This enables organizations to process, monitor, and make decisions based on their data in real-time.
Competitive Landscape: DataTorrent's main competitors come from IBM (Infosphere Streams) and the Storm Open Source Project.
Key Differentiator: DataTorrent points to performance as a key differentiator, claiming their platform is 100-1,000 times faster than Storm.
What They Do: Offer Big Data-as-a-Service with a "true auto-scaling Hadoop cluster."
Headquarters: Mountain View, Calif.
CEO: Ashish Thusoo, who ran Facebook's data infrastructure team before co-founding Qubole. He also co-founded Apache Hive.
Funding: The company is backed by $7 million in Series A funding from Lightspeed Ventures and Charles River Ventures.
Why They're on This List: Since Hadoop is a relatively new technology, finding someone with the expertise necessary to run and maintain it can be a tall order. By providing a managed solution, Qubole hopes to make Hadoop an easy-to-use technology.
Qubole handles the initial setup and then maintains the clusters. Qubole's auto-scaling feature automatically spins up users' clusters when a job is started and automatically scales or contracts based on workload, cutting back on costs and management requirements.
An intuitive UI expands the reach of this service beyond data analysts to entire lines of businesses. Qubole contends that some customers have more than 60 percent of their employees using Qubole.
Customers include Pinterest, MediaMath, Nextdoor and Saavn.
Competitive Landscape: Qubole will compete with Altiscale, Amazon EMR, Treasure Data, and others.
Key Differentiator: Qubole points to its proprietary technology that provides true auto-scaling and storage optimization.
What They Do: Provide a Hadoop-based big data application hosting platform.
Headquarters: Palo Alto, Calif.
CEO: Jonathan Gray, who was previously an HBase software engineer at Facebook.
Funding: $12.5 million from Battery Ventures, Ignition Partners, Andreessen Horowitz, Data Collective and Amplify Partners.
Why They're on This List: Continuuity has come up with a clever way to get around the dearth of Hadoop experts: they offer an application developer platform targeted at Java developers. The lower-level infrastructure is all abstracted away by the Continuuity platform.
The company's flagship product, Reactor, is a Java-based integrated data and application framework that layers on top of Apache Hadoop, HBase, and other Hadoop ecosystem components. It surfaces capabilities of the infrastructure through simple Java and REST APIs, shielding end users from unnecessary complexity.
In late March, Continuuity released its latest service, Loom, a cluster management solution. Clusters created with Continuuity Loom utilize templates of any hardware and software stack, from simple standalone LAMP-stack servers and traditional application servers like JBoss to full Apache Hadoop clusters comprised of thousands of nodes. Clusters can be deployed across many cloud providers (Rackspace, Joyent, OpenStack) while utilizing common SCM tools (Chef and scripts).
One thing to keep an eye in is the CEO situation. Founding CEO Todd Papaioannou, who was previously vice president and chief cloud architect at Yahoo, left the company this past summer. Co-founder and previous CTO Jonathan Gray has taken over the CEO role. This is Gray's first role as a business leader.
Competitive Landscape: As of now, Continuuity is uniquely positioned. Indirect competitors come from the HaaS camp (AWS EMR, Altiscale, Infochimps, Mortar Data, etc.).
Key Differentiator: Continuuity is targeted at Java developers, which is a unique approach.
What They Do: Provide HaaS.
Headquarters: Tel Aviv, Israel
CEO: Yaniv Mor, who previously managed the NSW SQL Services practice at Red Rock Consulting.
Funding: An undisclosed amount of seed funding from Magma Venture Capital.
Why They're on This List: While Hadoop is being hyped like crazy these days, it has become the de facto infrastructure technology for big data. The trouble is that the development, implementation, and maintenance of Hadoop require a very specialized skill set.
Xplenty technology provides Hadoop processing on the cloud via a coding-free design environment, so businesses can quickly and easily benefit from the opportunities offered by Big Data without having to invest in hardware, software, or highly specialized personnel.
A drag-and-drop interface eliminates the need to write complex scripts or code of any kind. With its automatic server configuration feature, users can simply point to a data source, configure the data transformation tasks, and tell the platform where to write the results to. Xplenty's platform uses SQL terminology. Thus, for data analysts, the learning curve should be minimal.
Customers include DealPly Technologies, Fiverr, Iron Source, and WalkMe.
Competitive Landscape: The main competition comes from Amazon's EMR. Other HaaS competitors include Altiscale, Mortar Data, Qubole, and recently Microsoft with Hadoop on Azure. Rackspace is about to launch its own HaaS offering based on Hortonworks' distribution.
Key Differentiator: According to Xplenty, competing services still target developers, whereas Xplenty targets the data and Business Intelligence (BI) users who do not know how to write code, but who need to move data to a big data platform.
What They Do: Provide Big Data analytics applications.
Headquarters: San Ramon, Calif.
CEO: Phani Nagarjuna, who most recently served as executive vice president of products and business development for OneCommand, which provides a SaaS-based CRM and Loyalty Automation Platform for the auto retail industry.
Funding: $3 million in early funding from Fortisure Ventures.
Why They're on This List: Nuevora has set its sights on one of big data's early growth areas: marketing and customer engagement. Nuevora's nBAAP (Big Data Analytics & Apps) Platform features purpose-built analytics apps based on best-practices-driven predictive algorithms. nBAAP is based on three key big data technologies: Hadoop (data processing), R (predictive analytics), and Tableau (visualizations).
On top of all of this, Nuevora's algorithms work on disparate sources of data (transactional, social media, mobile, campaigns) to quickly identify patterns and predictors in order to tie specific goals to individual marketing tactics.
The platform includes pre-built apps for the customer marketing business process -- acquisition, retention, up-sell, cross-sell, profitability, and customer lifetime value (LTV). With only "last-mile" configurations required for individual customer situations, Nuevora's apps empower organizations to anticipate their customers' behaviors.
Competitive Landscape: When Nuevora assesses the competitive landscape, it zeroes in on big consulting firms, such as Accenture, and other predictive analytics companies, such as Alpine Data Labs.
However, since pretty much every marketing platform under the sun now includes some sort of analytics engine, I also expect them to compete with the major marketing automation providers, such as ExactTarget (which uses Pentaho for its big data analytics).
Key Differentiator: Nuevora gives end users the ability to continually recalibrate their predictions using a "closed-loop recalibration engine," which helps organizations keep up with only the most pertinent insights based on the latest data.
Jeff Vance is a freelance writer based in Santa Monica, Calif. Connect with him on Twitter @JWVance or by email at email@example.com.
Read more about business intelligence (bi) in CIO's Business Intelligence (BI) Drilldown.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.