In another project, conducted over the last two years, machine learning is being used to extract data from PDF financial reports and documents.
“Sometimes they're in a structured format like XML or XPRL but often they're in PDF and they have a huge amount of data. To extract the data, in the past we've had armies of data analysts typing stuff in looking into these reports. It’s expensive, it's slow and we often don't have the recall that we want,” Mann says. “So we've been mapping a fairly involved research effort to extract data from those documents.”
As a next step to that work, the company is now researching how a machine can identify graphs and scatterplots to extract the numbers.
"It firstly looks at the scatterplot, then it identifies the axes of the scatterplot and the ticks on the scatterplot and then it registers each data point so that it can recover all of the data that was used to make up that scatterplot,” Mann says. “All of this is an effort to give structure to all of the unstructured data.”
Behind much of Bloomberg’s recent builds has been an open source ethic. Mann says there has been a sea change within the company about open source.
"When the company started in 1981 there really wasn't a whole lot of open source. And so there was a mentality of if it's not invented here we're not interested,” Mann says.
Indeed, Bloomberg once built networking gear for its clients, and had its own networking protocol. The company even produced its own keyboards before they became standardised.
“There's always this thread of ‘well we'll build it on our own when it's not widely available and then when it becomes a commodity then then we'll adopt’. And I think the same thing was true for open source,” Mann says.
The organisation took some convincing, but, championed by the CTO, there has been a “huge culture change” towards open source.
“There are two groups you got to convince: you’ve got to convince management that using open source is going to be safe and lead to better software, and then you also have to convince engineers that using open source is going to increase their skillset, will lead to software that’s easier to maintain and is less buggy and it's going to be a more beautiful system. Once you can kind of convince those two then you're set,” Mann says.
The company is an active contributor to projects including Solr, Hadoop, Apache Spark and Open Stack.
“I don't think you can be a leading edge technology company these days without being heavily invested in open source and without being – especially in the machine learning space – heavily invested in the academic community and in publishing.”
Although there is certainly a lot of buzz around machine learning, Mann believes it is well founded.
“It’s funny. There’s certainly a lot of hype around machine learning and data science right now. As a cautious person my inclination would be to be cynical, but my cynicism is tempered by the fact that when I look at what we know in the state of the art in academia, and what is in practice – there’s a phenomenally huge gap,” he says.
“I feel like if we didn’t learn anything new it would take five, 10 years for all of that learning to get integrated. This actually makes me very optimistic about the prognosis for machine learning, that it will really have a huge effect.”
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.