{"id":2610,"date":"2019-09-24T06:34:50","date_gmt":"2019-09-24T06:34:50","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2019\/09\/24\/we-see-in-3d-so-should-our-cnn-models\/"},"modified":"2019-09-24T06:34:50","modified_gmt":"2019-09-24T06:34:50","slug":"we-see-in-3d-so-should-our-cnn-models","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2019\/09\/24\/we-see-in-3d-so-should-our-cnn-models\/","title":{"rendered":"We See in 3D \u2013 So Should Our CNN Models"},"content":{"rendered":"<p>Author: William Vorhies<\/p>\n<div>\n<p><strong><em>Summary:<\/em><\/strong> <em>\u00a0Autonomous vehicles (AUVs) and many other systems that need to accurately perceive the world around them will be much better off when image classification moves from 2D to 3D.\u00a0 Here we examine the two leading approaches to 3D classification, Point Clouds and Voxel Grids.<\/em><\/p>\n<p>\u00a0<\/p>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3608908328?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3608908328?profile=RESIZE_710x\" width=\"350\" class=\"align-right\"><\/a>One of the well-known problems in CNN image classification is that because the CNN classifier sees only a 2D image of the object it won\u2019t recognize that same object if it\u2019s rotated.\u00a0 The solution thus far has been to train on many different orthogonal views of the same object and that vastly expands the problem of training data and training time.<\/p>\n<p>In the AUV world, simultaneous localization and mapping (SLAM) is the technical term for how the vehicle maintains awareness of its surroundings, both static (traffic lights) and moving (other cars and pedestrians).\u00a0 If the car could visualize itself in 3D space instead of a series of 2D snapshots then performance would be significantly better.<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Sensors Already See in 3D<\/strong><\/span><\/p>\n<p>The two primary sensors on AUVs for this dynamic awareness are LIDARs and RGB-D cameras.\u00a0 RGB-D, also known a depth cameras capture not only the 2D RGB data but also depth based on \u2018time of flight\u2019, literally the time it takes the photons to reach the sensor.\u00a0 Not too many years ago this was enormously expensive and complex but sensor technology has made it reasonable to put these on lots of devices.\u00a0 Think Microsoft Kinect from 2010.<\/p>\n<p>So by putting two RGB-D cameras on a car you have introduced stereoscopic vision and the ability to capture a full 3D dataset of all the objects surrounding the car.<\/p>\n<p>But curiously, the 3D data from both LIDAR and RGB-D cameras has so far been analyzed only using our existing 2D CNN algorithms.\u00a0 Basically we\u2019ve been throwing all that valuable 3D data away.<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Storing the Data for Analysis<\/strong><\/span><\/p>\n<p>Deep learning is finally catching up with techniques for 3D CNNs.\u00a0 The techniques are relatively new but are rapidly approaching commercialization.<\/p>\n<p>There are fundamentally two ways to store the 3D image data for 3D CNN image classification.<\/p>\n<p><strong><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3608912024?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3608912024?profile=RESIZE_710x\" width=\"300\" class=\"align-right\"><\/a>Point Clouds<\/strong> (a) are simply collections of 3D points in space as might be collected from the rapidly rotating LIDAR beam on our AUV.\u00a0 They have an \u2018xyz\u2019 address in space and may also capture the \u2018rgb\u2019 data to better differentiate the object.\u00a0 This data is \u2018per pixel\u2019 converted to point clouds for processing.<\/p>\n<p><strong>Voxel grids<\/strong> (b) are 3D versions of pixels (a mash up of \u2018volume\u2019 and \u2018pixel\u2019).\u00a0 In our 2D CNN world we analyze only one \u2018slice\u2019 of pixels.\u00a0 In the Voxel world there are many different 2D slices that add up to the full 3D object.\u00a0 Resolution depends on the size of the pixel as well as the depth of the \u2018slice\u2019.<\/p>\n<p>Point clouds can literally be an infinite number of points in space with coordinates that \u2018float\u2019 versus Voxel grids where each voxel has a discrete coordinate within a predefined space.<\/p>\n<p>Point clouds are by definition unordered while Voxel grids are ordered data.<\/p>\n<p>Both versions appear to have the same computational obstacle.\u00a0 Given the common 256 x 256 image size for 2D CNN classification, it would seem that taking a 3D layer would require 256^3 pixels and that would impose very high computational and memory costs.\u00a0 In fact, through experiment, 3D voxels of 32^3 or 64^3 yields accuracy similar to the larger 2D image.\u00a0 A 64^3 image has essentially the same requirement as a 512^2 image.<\/p>\n<p>\u00a0<\/p>\n<p><span style=\"font-size: 12pt;\"><strong>Performance Comparisons<\/strong><\/span><\/p>\n<p>Research is increasingly aided by a growing number of open source data bases for training.\u00a0 Notably:<\/p>\n<ul>\n<li>MODELNET10: 10 categories, 4,930 objects<\/li>\n<li>MODELNET40: 40 categories, 12,431 objects<\/li>\n<li>SHAPENET CORE V2: 55 categories, 51,191 objects.<\/li>\n<\/ul>\n<p>The leading model packages so far are:<\/p>\n<p>For Point-based: PointNet and PointNet++<\/p>\n<p>For Voxel-base: Voxel ResNet and Voxel CNN<\/p>\n<p>Although research continues along both fronts, a recent study published by Nvdia comparing traditional, point-based, and voxel-based systems shows the Voxel approach in the lead.\u00a0<a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3608916588?profile=original\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/3608916588?profile=RESIZE_710x\" width=\"600\" class=\"align-center\"><\/a><\/p>\n<p>The Point cloud data could be used but the accuracy was lower and the computational cost higher than the Voxel approaches.<\/p>\n<p>Voxelization converged faster making it preferable for real time object classification and in addition was more accurate.<\/p>\n<p>For more background try these papers <em><u><a href=\"https:\/\/www.ri.cmu.edu\/pub_files\/2015\/9\/voxnet_maturana_scherer_iros15.pdf\">here<\/a><\/u><\/em> and <em><u><a href=\"https:\/\/uwaterloo.ca\/mobile-sensing\/sites\/ca.mobile-sensing\/files\/uploads\/files\/2019-wang-cheng-etal-li-neucom.pdf\">here<\/a><\/u><\/em> or the original NVDIA study <em><u><a href=\"http:\/\/on-demand.gputechconf.com\/gtc\/2018\/presentation\/s8453-point-cloud-deep-learning.pdf\">here<\/a><\/u><\/em>.<\/p>\n<p>\u00a0\u00a0<\/p>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/profiles\/blog\/list?user=0h5qapp2gbuf8\"><em><u>Other articles by Bill Vorhies<\/u><\/em><\/a><\/p>\n<p>\u00a0<\/p>\n<p>About the author:\u00a0 Bill is Contributing Editor for Data Science Central.\u00a0 Bill is also President &#038; Chief Data Scientist at Data-Magnum and has practiced as a data scientist since 2001.\u00a0 His articles have been read more than 2 million times.<\/p>\n<p>He can be reached at:<\/p>\n<p><a href=\"mailto:Bill@DataScienceCentral.com\">Bill@DataScienceCentral.com<\/a> <span>or<\/span> <a href=\"mailto:Bill@Data-Magnum.com\">Bill@Data-Magnum.com<\/a><\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:890748\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: William Vorhies Summary: \u00a0Autonomous vehicles (AUVs) and many other systems that need to accurately perceive the world around them will be much better off [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2019\/09\/24\/we-see-in-3d-so-should-our-cnn-models\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":456,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2610"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=2610"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/2610\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/471"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=2610"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=2610"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=2610"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}