Predicting people’s daily activities with deep learning

Researchers train a convolutional neural network with 'late fusion ensemble’ on 40,103 photos detailing participants' daily activities over six months

Comments

A new study has found that applying deep learning on photos taken continuously from a wearable camera can predict human behaviours and may even help in the fight against chronic diseases.

A team of researchers from Georgia Institute of Technology trained a convolutional neural network with a random decision forest, ‘late fusion ensemble’ (a classification method) on 40,103 photos from people’s daily activities over a six-month period to predict human behaviours.

They published their results in a new paper, Predicting Daily Activities From Egocentric Images Using Deep Learning.

The researchers say they expect this to have applications in the healthcare industry as daily activities and lifestyle behaviours have strong links to the development of chronic diseases.

Data was labelled into 19 categories such as chores, driving, cooking, exercising, reading, eating, working, and so on. Participants used an annotation tool which allowed them to easily and rapidly label the photos automatically taken throughout the day, as well as remove ones that are sensitive to privacy. Participants removed, on average, 20 per cent of the data per day.

The wearable camera was attached to a smartphone placed in portrait mode on a participant's chest and supported with a neck identity holder. It automatically took photos throughout the day.

As the data collected from participants is relatively small for deep learning, the researchers used the ImageNet image database to fine-tune their model. They also used Caffe, a deep learning framework developed by the Berkeley Vision and Learning Center. “It has achieved good results in the past and has a large open-source community.”

To prevent overfitting - where the model is trained so well on the training dataset but performs badly when given new data - and to include contextual metadata and global image features, the researchers added a late fusion ensemble technique.

“Using this we outperform the classic ensemble and the normal CNN [convolutional neural net] model by approximately 5 per cent,” they said in the paper.

The model outperformed a traditional convolutional neural net and other methods for similar predictions tasks such as k-nearest neighbours, reaching a total accuracy of 83.07 per cent and an average class accuracy of 65.87 per cent.

“Given the egocentric image and the contextual datetime information, our method achieves an overall accuracy of 83.07 per cent at determining which one of these 19 activities the user is performing at any moment.”

The researchers noted that other research on using cameras and detecting people’s activity rely mostly on video and hand-crafted features, whereas their work goes beyond traditional classification or machine learning approaches and combines image pixel data, contextual metadata (time) and global image features.

“Convolutional Neural Networks have recently been used with success on single image classification with a vast number of classes and have been effective at learning hierarchies of features. However, little work has been done on classifying activities on single images from an egocentric device over extended periods of time. This work aims to explore that area,” the researchers said.