CIO

Native Data Analysis Comes to MongoDB

Seeking to make it easier for you to apply analytics to your big data stores, Pentaho today announced the general availability of the latest version of its business analytics and data integration platform.

The Pentaho 5.1 release is intended to bridge the "data-to-analytics divide" for the whole spectrum of Pentaho users, from developers to data scientists to business analysts. Pentaho 5.1 adds the capability to run code-free analytics directly on MongoDB data stores, incorporates a new data science pack that acts as a data science "personal assistant," and adds full support for the Apache Hadoop 2.0 YARN architecture for resource management.

"The new capabilities in Pentaho 5.1 support our ongoing strategy to make the hardest aspects of big data analytics faster, easier and more accessible to all," says Christopher Dziekan, executive vice president and chief product officer at Pentaho. "With the launch of 5.1, Pentaho continues to power big analytics at scale, responding not only to the demands of the big data-driven enterprise but also provides companies big and small a more level playing field so emerging companies without large, specialist development teams can also enter the big data arena."

Data Integration Platform Enables Native Analysis of MongoDB Data

Previous versions of the Pentaho platform have provided the capability to integrate with MongoDB as a data source and provide reporting on MongoDB data. Now Pentaho is going a step further by enabling native analysis of data in MongoDB without having to go through an ETL process and with no required hand coding. MongoDB data collections can be analyzed directly at the source, reducing the time-to-insight as well as the need for specialist skills.

Dziekan points to healthcare costs solutions provider MultiPlan, which has nearly 900,000 healthcare providers under contract. It processes more than 40 million claims every year. Dziekan says MultiPlan takes the JSON source files from its portal and stores them in MongoDB. It uses the Pentaho Analyzer plugin, a drag-and-drop OLAP viewer, on top of MongoDB to slice-and-dice the data, creating dashboards and reports.

"Traditional RDBMS analytics can get very complicated and, quite frankly, ugly, when working with semi or unstructured data," says Chris Palm, lead software architecture engineer at MultiPlan. "The Pentaho 5.1 platform is meeting market needs, allowing users to directly analyze data in MongoDB. We have seen more accurate results with new analyses and are no longer constrained by having to pull only part of our data. We can now look across a more full set of data and govern our system of record to gain greater insights."

Data Scientists Get Personal Assistant

Pentaho has also added a new Data Science Pack to Pentaho 5.1 with an eye to making it simpler for data analysts and data scientists to rapidly build a 360-degree customer view blending data sources, like social and MongoDB. The pack adds an R script executor for Pentaho Data Integration (PDI) that allows an R script to be run as part of a PDI transformation, easing the burden of data preparation. It also adds a Weka scoring tool that allows users to apply classification, clustering and regression models constructed in Weka. And it adds Weka forecasting to help users leverage forecasting models created in Weka's time series analysis and forecasting environment.

"The data scientist just got a personal assistant," Dziekan says. "This Data Science Pack features tools data scientists are familiar with already and we're now operationalizing them."

The Pentaho 5.1 platform also adds full YARN integration, making it much simpler for developers working with Pentaho Data Integration to exploit the computational power of Hadoop without having to write complex MapReduce code. Dziekan says the YARN support allows PDI jobs to make elastic use of Hadoop resources, expanding and contracting as data volumes and processing requirements change. He notes that YARN's advanced resource management capabilities support mixed workload scenarios where continuous data transformation and analysis is required.