Google improves real-time visual translation app with neural net

Builds small neural net so app can work effectively on smartphones with low computing power

Comments

Google has built a small neural net for its real-time visual translation app to work effectively on smartphones, which don't have the high intense computing power of data centres to carry out image recognition and translation.

The app enables users to point their camera an object that contains words so they can translate things like menus and signs. The search giant also added 20 languages to its app.

“We want to be able to recognise a letter with a small amount of rotation, but not too much. If we overdo the rotation, the neural network will use too much of its information density on unimportant things. So we put effort into making tools that would give us a fast iteration time and good visualisations,” Otavio Good, software engineer for Google Translate, wrote in a blog post.

“Inside of a few minutes, we can change the algorithms for generating training data, generate it, retrain, and visualise.

“To achieve real-time, we also heavily optimized and hand-tuned the math operations. That meant using the mobile processor’s SIMD instructions and tuning things like matrix multiplies to fit processing into all levels of cache memory.”

The app filters out background objects when reading letters in images, such as people, trees, cars, and so on. By looking at “blobs of pixels” with similar colour and are in near proximity to each other, the app recognises it as a continuous line of text to read.

The app has been trained using a convolutional neural network to learn what different letters in languages look like and differentiate letters from non-letters.

A letter generator was also built to create noise around the letters or characters being translated such as smudges and rotation so that the app does not need to always have clear, well-presented text in order to work.

The app uses dictionary lookups for the different languages once the letters are recognised, with it still being able to recognise words from a group of letters if it accidentally reads one letter as a number. For example, if it reads 'S' as '5' by mistake, it will still be able to recognise the word from the following letters, 'super'.

The translation is then rendered on top of the original words.

“We can do this because we’ve already found and read the letters in the image, so we know exactly where they are. We can look at the colours surrounding the letters and use that to erase the original letters. And then we can draw the translation on top using the original foreground colour.”