{"id":2683,"date":"2019-10-11T19:40:01","date_gmt":"2019-10-11T19:40:01","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/10\/11\/faster-video-recognition-for-the-smartphone-era\/"},"modified":"2019-10-11T19:40:01","modified_gmt":"2019-10-11T19:40:01","slug":"faster-video-recognition-for-the-smartphone-era","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/10\/11\/faster-video-recognition-for-the-smartphone-era\/","title":{"rendered":"Faster video recognition for the smartphone era"},"content":{"rendered":"<p>Author: Kim Martineau | MIT Quest for Intelligence<\/p>\n<div>\n<p>A branch of machine learning called deep learning has helped computers surpass humans at well-defined visual tasks like reading medical scans, but as the technology expands into interpreting videos and real-world events, the models are getting larger and more computationally intensive.\u00a0<\/p>\n<p>By\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1910.00932.pdf\">one estimate<\/a>, training a video-recognition model can take up to 50 times more data and eight times more processing power than training an image-classification model. That\u2019s a problem as demand for processing power to train deep learning models continues to\u00a0<a href=\"https:\/\/openai.com\/blog\/ai-and-compute\/\">rise exponentially<\/a>\u00a0and <a href=\"https:\/\/www.technologyreview.com\/s\/613630\/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes\/\">concerns<\/a>\u00a0about AI\u2019s massive carbon footprint grow. Running large video-recognition models on low-power mobile devices, where many AI applications are heading, also remains a challenge.\u00a0<\/p>\n<p><a href=\"https:\/\/songhan.mit.edu\/\">Song Han<\/a>, an assistant professor at MIT\u2019s\u00a0<a href=\"https:\/\/www.eecs.mit.edu\/\">Department of Electrical Engineering and Computer Science<\/a> (EECS), is tackling the problem by designing more efficient deep learning models. In a <a href=\"https:\/\/arxiv.org\/pdf\/1811.08383.pdf\">paper<\/a>\u00a0at the\u00a0<a href=\"http:\/\/iccv2019.thecvf.com\/\">International Conference on Computer Vision<\/a>, Han, MIT graduate student\u00a0<a href=\"http:\/\/linji.me\/\">Ji Lin<\/a>\u00a0and\u00a0<a href=\"https:\/\/mitibmwatsonailab.mit.edu\/\">MIT-IBM Watson AI Lab<\/a>\u00a0researcher\u00a0<a href=\"https:\/\/scholar.google.com\/citations?user=PTeSCbIAAAAJ&#038;hl=en\">Chuang Gan<\/a>, outline a method for shrinking video-recognition models to speed up training and improve runtime performance on smartphones and other mobile devices. Their method makes it possible to shrink the model to one-sixth the size by reducing the 150 million parameters in a state-of-the-art model to 25 million parameters.\u00a0<\/p>\n<div class=\"cms-placeholder-content-video\"><\/div>\n<p>\u201cOur goal is to make AI accessible to anyone with a low-power device,\u201d says Han. \u201cTo do that, we need to design efficient AI models that use less energy and can run smoothly on edge devices, where so much of AI is moving.\u201d\u00a0<\/p>\n<p>The falling cost of cameras and video-editing software and the rise of new video-streaming platforms has flooded the internet with new content. Each hour, <a href=\"http:\/\/www.statista.com\/statistics\/259477\/hours-of-video-uploaded-to-youtube-every-minute\/\" target=\"_blank\" rel=\"noopener noreferrer\">30,000 hours<\/a> of new video are\u00a0uploaded\u00a0to YouTube alone. Tools to catalog that content more efficiently would help viewers and advertisers locate videos faster, the researchers say. Such tools would also help institutions like hospitals and nursing homes to run AI applications locally, rather than in the cloud, to keep sensitive data private and secure.\u00a0<\/p>\n<p>Underlying image and video-recognition models are neural networks, which are loosely modeled on how the brain processes information. Whether it\u2019s a digital photo or sequence of video images, neural nets look for patterns in the pixels and build an increasingly abstract representation of what they see. With enough examples, neural nets \u201clearn\u201d to recognize people, objects, and how they relate.\u00a0<\/p>\n<p>Top video-recognition models currently use three-dimensional convolutions to encode the passage of time in a sequence of images, which creates bigger, more computationally-intensive models. To reduce the calculations involved, Han and his colleagues designed an operation they call a\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1811.08383.pdf\">temporal shift module<\/a>\u00a0which shifts the feature maps of a selected video frame to its neighboring frames. By mingling spatial representations of the past, present, and future, the model gets a sense of time passing without explicitly representing it.<\/p>\n<p>The result: a model that outperformed its peers at recognizing actions in the\u00a0<a href=\"https:\/\/20bn.com\/datasets\/something-something\">Something-Something<\/a>\u00a0video dataset, earning first place in <a href=\"https:\/\/20bn.com\/datasets\/something-something\/v1\">version 1<\/a> and <a href=\"https:\/\/20bn.com\/datasets\/something-something\/v2\">version 2<\/a>, in recent public rankings. An online version of the shift module is also nimble enough to read movements in real-time. In\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=0T6u7S_gq-4\">a recent demo<\/a>, Lin, a PhD student in EECS, showed how a single-board computer rigged to a video camera could instantly classify hand gestures with the amount of energy to power a bike light.\u00a0<\/p>\n<p>Normally it would take about two days to train such a powerful model on a machine with just one graphics processor. But the researchers managed to borrow time on the U.S. Department of Energy\u2019s\u00a0<a href=\"https:\/\/www.olcf.ornl.gov\/summit\/\">Summit<\/a>\u00a0supercomputer, currently ranked the fastest on Earth. With Summit\u2019s extra firepower, the researchers showed that with 1,536 graphics processors the model could be trained in just 14 minutes, near its theoretical limit. That\u2019s up to three times faster than 3-D state-of-the-art models, they say.<\/p>\n<p>Dario Gil, director of IBM Research, highlighted the work in his recent\u00a0<a href=\"https:\/\/youtu.be\/2RBbw6uG94w\">opening remarks<\/a>\u00a0at\u00a0<a href=\"https:\/\/www.research.ibm.com\/artificial-intelligence\/ai-research-week\/\">AI Research Week<\/a>\u00a0hosted by the MIT-IBM Watson AI Lab.<\/p>\n<p>\u201cCompute requirements for large AI training jobs is doubling every 3.5 months,\u201d he said later. \u201cOur ability to continue pushing the limits of the technology will depend on strategies like this that match hyper-efficient algorithms with powerful machines.\u201d\u00a0<\/p>\n<\/div>\n<p><a href=\"http:\/\/news.mit.edu\/2019\/faster-video-recognition-smartphone-era-1011\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Kim Martineau | MIT Quest for Intelligence A branch of machine learning called deep learning has helped computers surpass humans at well-defined visual tasks [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/10\/11\/faster-video-recognition-for-the-smartphone-era\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":469,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2683"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2683"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2683\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/460"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2683"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2683"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2683"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}