{"id":5594,"date":"2022-04-29T18:00:00","date_gmt":"2022-04-29T18:00:00","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2022\/04\/29\/a-one-up-on-motion-capture\/"},"modified":"2022-04-29T18:00:00","modified_gmt":"2022-04-29T18:00:00","slug":"a-one-up-on-motion-capture","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2022\/04\/29\/a-one-up-on-motion-capture\/","title":{"rendered":"A one-up on motion capture"},"content":{"rendered":"<p>Author: Lauren Hinkel | MIT-IBM Watson AI Lab<\/p>\n<div>\n<p>From \u201cStar Wars\u201d to \u201cHappy Feet,\u201d many beloved films contain scenes that were made possible by motion capture technology, which records movement of objects or people through video. Further, applications for this tracking, which involve complicated interactions between physics, geometry, and perception, extend beyond Hollywood to the military, sports training, medical fields, and computer vision and robotics, allowing engineers to understand and simulate action happening within real-world environments.<\/p>\n<p>As this can be a complex and costly process \u2014 often requiring markers placed on objects or people and recording the action sequence \u2014 researchers are working to shift the burden to neural networks, which could acquire this data from a simple video and reproduce it in a model. Work in physics simulations and rendering shows promise to make this more widely used, since it can characterize realistic, continuous, dynamic motion from images and transform back and forth between a 2D render and 3D scene in the world. However, to do so, current techniques require precise knowledge of the environmental conditions where the action is taking place, and the choice of renderer, both of which are often unavailable.<\/p>\n<p>Now, a team of researchers from MIT and IBM has developed a trained neural network pipeline that avoids this issue, with the ability to infer the state of the environment and the actions happening, the physical characteristics of the object or person of interest (system), and its control parameters. When tested, the technique can outperform other methods in simulations of four physical systems of rigid and deformable bodies, which illustrate different types of dynamics and interactions, under various environmental conditions. Further, the methodology allows for imitation learning \u2014 predicting and reproducing the trajectory of a real-world, flying quadrotor from a video.<\/p>\n<p>\u201cThe high-level research problem this paper deals with is how to reconstruct a digital twin from a video of a dynamic system,\u201d says Tao Du PhD &#8217;21, a postdoc in the Department of Electrical Engineering and Computer Science (EECS), a member of Computer Science and Artificial Intelligence Laboratory (CSAIL), and a member of the research team. In order to do this, Du says, \u201cwe need to ignore the rendering variances from the video clips and try to grasp of the core information about the dynamic system or the dynamic motion.\u201d<\/p>\n<p>Du\u2019s co-authors include lead author Pingchuan Ma, a graduate student in EECS and a member of CSAIL; Josh Tenenbaum, the Paul E. Newton Career Development Professor of Cognitive Science and Computation in the Department of Brain and Cognitive Sciences and a member of CSAIL; Wojciech Matusik, professor of electrical engineering and computer science and CSAIL member; and MIT-IBM Watson AI Lab principal research staff member Chuang Gan. This work was presented this week the\u00a0International Conference on Learning Representations.<\/p>\n<p>While capturing videos of characters, robots, or dynamic systems to infer dynamic movement makes this information more accessible, it also brings a new challenge. \u201cThe images or videos [and how they are rendered] depend largely on the on the lighting conditions, on the background info, on the texture information, on the material information of your environment, and these are not necessarily measurable in a real-world scenario,\u201d says Du. Without this rendering configuration information or knowledge of which renderer is used, it\u2019s presently difficult to glean dynamic information and predict behavior of the subject of the video. Even if the renderer is known, current neural network approaches still require large sets of training data. However, with their new approach, this can become a moot point. \u201cIf you take a video of a leopard running in the morning and in the evening, of course, you&#8217;ll get visually different video clips because the lighting conditions are quite different. But what you really care about is the dynamic motion: the joint angles of the leopard \u2014 not if they look light or dark,\u201d Du says.<\/p>\n<p>In order to take rendering domains and image differences out of the issue, the team developed a pipeline system containing a neural network, dubbed \u201crendering invariant state-prediction (RISP)\u201d network. RISP transforms differences in images (pixels) to differences in states of the system \u2014 i.e., the environment of action \u2014 making their method generalizable and agnostic to rendering configurations. RISP is trained using random rendering parameters and states, which are fed into a differentiable renderer, a type of renderer that measures the sensitivity of pixels with respect to rendering configurations, e.g., lighting or material colors. This generates a set of varied images and video from known ground-truth parameters, which will later allow RISP to reverse that process, predicting the environment state from the input video. The team additionally minimized RISP\u2019s rendering gradients, so that its predictions were less sensitive to changes in rendering configurations, allowing it to learn to forget about visual appearances and focus on learning dynamical states. This is made possible by a differentiable renderer.<\/p>\n<p>The method then uses two similar pipelines, run in parallel. One is for the source domain, with known variables. Here, system parameters and actions are entered into a differentiable simulation. The generated simulation\u2019s states are combined with different rendering configurations into a differentiable renderer to generate images, which are fed into RISP. RISP then outputs predictions about the environmental states. At the same time, a similar target domain pipeline is run with unknown variables. RISP in this pipeline is fed these output images, generating a predicted state. When the predicted states from the source and target domains are compared, a new loss is produced; this difference is used to adjust and optimize some of the parameters in the source domain pipeline. This process can then be iterated on, further reducing the loss between the pipelines.<\/p>\n<p>To determine the success of their method, the team tested it in four simulated systems: a quadrotor (a flying rigid body that doesn\u2019t have any physical contact), a cube (a rigid body that interacts with its environment, like a die), an articulated hand, and a rod (deformable body that can move like a snake). The tasks included estimating the state of a system from an image, identifying the system parameters and action control signals from a video, and discovering the control signals from a target image that direct the system to the desired state. Additionally, they created baselines and an oracle, comparing the novel RISP process in these systems to similar methods that, for example, lack the rendering gradient loss, don\u2019t train a neural network with any loss, or lack the RISP neural network altogether. The team also looked at how the gradient loss impacted the state prediction model\u2019s performance over time. Finally, the researchers deployed their RISP system to infer the motion of a real-world quadrotor, which has complex dynamics, from video. They compared the performance to other techniques that lacked a loss function and used pixel differences, or one that included manual tuning of a renderer\u2019s configuration.<\/p>\n<p>In nearly all of the experiments, the RISP procedure outperformed similar or the state-of-the-art methods available, imitating or reproducing the desired parameters or motion, and proving to be a data-efficient and generalizable competitor to current motion capture approaches.<\/p>\n<p>For this work, the researchers made two important assumptions: that information about the camera is known, such as its position and settings, as well as the geometry and physics governing the object or person that is being tracked. Future work is planned to address this.<\/p>\n<p>\u201cI think the biggest problem we&#8217;re solving here is to reconstruct the information in one domain to another, without very expensive equipment,\u201d says Ma. Such an approach should be \u201cuseful for [applications such as the] metaverse, which aims to reconstruct\u00a0the physical world in a virtual\u00a0environment,&#8221; adds Gan. \u201cIt is basically an everyday, available solution, that\u2019s neat and simple, to cross domain reconstruction or the inverse dynamics problem,\u201d says Ma.<\/p>\n<p>This research was supported, in part, by the MIT-IBM Watson AI Lab, Nexplore, DARPA Machine Common Sense program, Office of Naval Research (ONR), ONR MURI, and Mitsubishi Electric.<\/p>\n<\/div>\n<p><a href=\"https:\/\/news.mit.edu\/2022\/one-motion-capture-neural-network-0429\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Lauren Hinkel | MIT-IBM Watson AI Lab From \u201cStar Wars\u201d to \u201cHappy Feet,\u201d many beloved films contain scenes that were made possible by motion [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2022\/04\/29\/a-one-up-on-motion-capture\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":457,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[24],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/5594"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=5594"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/5594\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/464"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=5594"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=5594"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=5594"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}