Making deep learning models robust for object recognition

Data for object recognition is often not reflective of the real world. Professor Wolfram Burgard, from Autonomous Intelligent Systems at the University of Freiburg, discusses a way around this.

Comments

Robots have a wide range of applications from assisting humans around the factory, home, work office, in the field and more. But if we are to rely on them for assistive tasks, their perception algorithms need to be robust.

This is what Professor Wolfram Burgard, head of research lab for Autonomous Intelligent Systems at the University of Freiburg, discussed at the 28^th Australasian Joint Conference on Artificial Intelligence in Canberra this week.

Burgard spoke about object recognition, as this is one of the fundamental capabilities of robots in many applications.

“If we want to build robots that can actually act in the real world, then we need to have robust object detection,” he pointed out.

“Robots need to be able to learn what we mean when we say 'tomato juice'. By having a robot looking at a scene, we would like it to be able to label these individual objects and assign semantics to those, words that describe what type of object that is.”

The problem with many deep learning models that can carry out object recognition is that the data used for this is often not reflective of the real world. Datasets for RGB images, for example, are recorded under controlled settings where a camera captures images of objects on a turn table, which has perfect lighting conditions and clear visibility.

“That is not the case when you are in the real world, where you have poor lighting conditions, shadowing, and so on,” Burgard said.

Robots also need to perform well in cluttered surroundings and dynamically changing scenes, he said. This noise especially affects depth sensors and the ability to recognise objects.

Therefore, by inserting noise into the depth data with missing values, deep learning models in robots are able to better deal with the visually messy real world, he said.

Burgard and his team at the University of Freiburg recently looked at fusing RGB and depth data to get more robust deep learning classification models. RGB data looks at the appearance and texture of an object, while depth data looks at the shape of an object and it is invariant to lighting.

The team built a deep neural network that fuses together two convolution neural network streams for colour and depth information.

“We have two channels – one for the RGB data, one for the depth data – the networks are more or less identical, and then at the end we have like a fusion layer that combines the two outputs and generates the joined [network],” Burgard explained.

They then encoded the depth images by normalising the depth values to be between 0 and 255, and then applied a jet colourmap on an image to transform the image or “colourise the depth”.

“So the idea is to convert the depth domain in a way we can use the same [convolutional neural] network that has been successfully applied to the RGB domain in order to also deal with depth data,” he said.

Next is to add noise to the data so that it contains missing data patterns that reflect real world scenes.

“We see a Coke can here,” he said pointing to an example. “This is basically the real image that you get, where the real image has substantially more noise. The idea here is to basically add noise to the training data, so augment the training data with additional noise.

“So we created 50,000 noise samples. For every training batch, we sample from these noise samples and we also shuffle them and generate noisy input for the [deep neural] network.

“This makes the network in the end more robust to the specific noise that you see in the particular image there.”

One issue, however, is that even with techniques for robustness, the lack of labelled image data is still a challenge in using deep learning for object recognition, Burgard said. As deep learning models need to be fed huge amounts of data in order to work, getting access to this can be extremely difficult, he said.

“When you put objects on a turn table and rotate them in front of a camera you have to manually assign a label to them. You can hire students for doing so, but students are expensive and they get tired over time and they also want to value it so they also want to do something else.

“So it’s really hard to get labelled data, especially the large amount of labelled data that we would need. And also in the RGB-D domain there are a very few datasets on this, and those that are there are relatively small," he said.

Follow CIO Australia on Twitter and Like us on Facebook… Twitter: @CIO_Australia, Facebook: CIO Australia, or take part in the CIO conversation on LinkedIn: CIO Australia. Follow Rebecca Merrett on Twitter: @Rebecca_Merrett.