articulated body pose estimation

{{Short description|Field of study in computer vision}}

In computer vision, articulated body pose estimation is the task of algorithmically determining the pose of a body composed of connected parts (joints and rigid parts) from image or video data.

This challenging problem, central to enabling robots and other systems to understand human actions and interactions, has been a long-standing research area due to the complexity of modeling the relationship between visual observations and pose, as well as the wide range of applications.{{Cite journal |last1=Moeslund |first1=Thomas B. |last2=Granum |first2=Erik |date=2001-03-01 |title=A Survey of Computer Vision-Based Human Motion Capture |url=https://dl.acm.org/doi/abs/10.1006/cviu.2000.0897 |journal=Computer Vision and Image Understanding |volume=81 |issue=3 |pages=231–268 |doi=10.1006/cviu.2000.0897 |issn=1077-3142|url-access=subscription }}{{Cite web |title=Survey of Advances in Computer Vision-based Human Motion Capture (2006) |url=http://www.sciencedirect.com/science/article/B6WCX-4M1DB7H-1/2/8da6f6e7a8c8e07d9331bc7738c6d499 |url-status=dead |archive-url=https://web.archive.org/web/20080302215124/http://www.sciencedirect.com/science/article/B6WCX-4M1DB7H-1/2/8da6f6e7a8c8e07d9331bc7738c6d499 |archive-date=2008-03-02 |access-date=2007-09-15}}

Description

Enabling robots to perceive humans in their environment is crucial for effective interaction. For example, interpreting pointing gestures requires the ability to recognize and understand human body pose. This makes pose estimation a significant and challenging problem in computer vision, driving extensive research and development of numerous algorithms over the past two decades. Many successful approaches rely on training complex models with large datasets.

Articulated pose estimation is particularly difficult due to the high dimensionality of human movement. The human body has 244 degrees of freedom and 230 joints. While not all joint movements are readily apparent, even a simplified representation of the body with 10 major parts and 20 degrees of freedom presents considerable challenges. Algorithms must account for substantial appearance variations caused by clothing, body shape, size, and hairstyles. Ambiguities arise from occlusions, both self-occlusions (e.g., a hand obscuring the face) and occlusions from external objects. Furthermore, most algorithms operate on monocular (2D) images, which lack inherent 3D information, exacerbating the ambiguity. Varying lighting conditions and camera viewpoints further complicate the task. These difficulties are amplified when real-time performance or other constraints are imposed. Recent research explores the use of RGB-D cameras, which capture both color and depth information, to address the limitations of monocular approaches.Droeschel, David, and Sven Behnke. "[https://www.researchgate.net/profile/Sven_Behnke/publication/221105480_3D_Body_Pose_Estimation_Using_an_Adaptive_Person_Model_for_Articulated_ICP/links/0912f5012c810acc14000000.pdf 3D body pose estimation using an adaptive person model for articulated ICP]." Intelligent Robotics and Applications. Springer Berlin Heidelberg, 2011. 157167.

Sensors

The typical articulated body pose estimation system involves a model-based approach, in which the pose estimation is achieved by maximizing/minimizing a similarity/dissimilarity between an observation (input) and a template model. Different kinds of sensors have been explored for use in making the observation, including the following:

  • Visible wavelength imagery,
  • Long-wave thermal infrared imagery,{{cite book| author=Han, J.| author2=Gaszczak, A.| author3=Maciol, R.| author4=Barnes, S.E.| author5=Breckon, T.P.| chapter=Human Pose Classification within the Context of Near-IR Imagery Tracking| title=Proc. SPIE Optics and Photonics for Counterterrorism, Crime Fighting and Defence|date=September 2013| volume=8901| number=E| pages=89010E| publisher=SPIE| doi=10.1117/12.2028375| chapter-url=http://www.durham.ac.uk/toby.breckon/publications/papers/han13humanpose.pdf| access-date=5 November 2013| series=Optics and Photonics for Counterterrorism, Crime Fighting and Defence IX; and Optical Materials and Biomaterials in Security and Defence Systems Technology X| citeseerx=10.1.1.391.380| s2cid=17034080| editor1-last=Zamboni| editor1-first=Roberto| editor2-last=Kajzar| editor2-first=Francois| editor3-last=Szep| editor3-first=Attila A.| editor4-last=Burgess| editor4-first=Douglas| editor5-last=Owen| editor5-first=Gari}}
  • Time-of-flight imagery, and
  • Laser range scanner imagery.

These sensors produce intermediate representations that are directly used by the model. The representations include the following:

  • Image appearance,
  • Voxel (volume element) reconstruction,
  • 3D point clouds, and sum of Gaussian kernelsM. Ding and G. Fan, [https://ieeexplore.ieee.org/document/7045868/ "Generalized Sum of Gaussians for Real-Time Human Pose Tracking from a Single Depth Sensor"] 2015 IEEE Winter Conference on Applications of Computer Vision (WACV), Jan 2015
  • 3D surface meshes.

Classical models

=Part models=

The basic idea of part based model can be attributed to the human skeleton. Any object having the property of articulation can be broken down into smaller parts wherein each part can take different orientations, resulting in different articulations of the same object. Different scales and orientations of the main object can be articulated to scales and orientations of the corresponding parts. To formulate the model so that it can be represented in mathematical terms, the parts are connected to each other using springs. As such, the model is also known as a spring model. The degree of closeness between each part is accounted for by the compression and expansion of the springs. There is geometric constraint on the orientation of

springs. For example, limbs of legs cannot move 360 degrees. Hence parts cannot have that extreme orientation. This reduces the possible permutations.Fischler, Martin A., and Robert A. Elschlager. "[https://pdfs.semanticscholar.org/719d/a2a0ddd38e78151e1cb2db31703ea8b2e490.pdf The representation and matching of pictorial structures]." IEEE Transactions on computers 1 (1973): 6792.

The spring model forms a graph G(V,E) where V (nodes) corresponds to the parts and E (edges) represents springs connecting two neighboring parts. Each location in the image can be reached by the x and y coordinates of the pixel location. Let \mathbf{p}_{i}(x, \, y) be point at \mathbf{i}^{th} location. Then the cost associated in joining the spring between \mathbf{i}^{th} and the \mathbf{j}^{th} point can be given by S(\mathbf{p}_{i},\,\mathbf{p}_{j}) = S(\mathbf{p}_{i} - \mathbf{p}_{j}). Hence the

total cost associated in placing l components at locations \mathbf{P}_{l} is given by

:

S(\mathbf{P}_{l}) = \displaystyle\sum_{i=1}^{l} \; \displaystyle\sum_{j=1}^{i} \; \mathbf{s}_{ij}(\mathbf{p}_{i},\,\mathbf{p}_{j})

The above equation simply represents the spring model used to describe body pose. To estimate pose from images, cost or energy function must be minimized. This energy function consists of two terms. The first is related to how each component matches the image data and the second deals with how much the

oriented (deformed) parts match, thus accounting for articulation along with object detection.Felzenszwalb, Pedro F., and Daniel P. Huttenlocher. "[https://www.cse.unr.edu/~bebis/CS773C/ObjectRecognition/Papers/Felzenswalb05.pdf Pictorial structures for object recognition]." International Journal of Computer Vision 61.1 (2005): 5579.

The part models, also known as pictorial structures, are of one of the basic models on which other efficient models are built by slight modification. One such example is the flexible mixture model which reduces the database of hundreds or thousands of deformed parts by exploiting the notion of local rigidity.Yang, Yi, and Deva Ramanan. "[https://vision.ics.uci.edu/papers/YangR_CVPR_2011/YangR_CVPR_2011.pdf Articulated pose estimation with flexible mixtures-of-parts]." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.

=Articulated model with quaternion =

The kinematic skeleton is constructed by a tree-structured chain.M. Ding and G. Fan, [https://ieeexplore.ieee.org/document/7350118/ "Articulated and Generalized Gaussian Kernel Correlation for Human Pose Estimation"] IEEE Transactions on Image Processing, Vol. 25, No. 2, Feb 2016 Each rigid body segment has its local coordinate system that can be transformed to the world coordinate system via a 4×4 transformation matrix T_l ,

:

T_{l} = T_{\operatorname{par}(l)}R_{l},

where R_l denotes the local transformation from body segment S_l to its parent \operatorname{par}(S_l). Each joint in the body has 3 degrees of freedom (DoF) rotation. Given a transformation matrix T_l , the joint position at the T-pose can be transferred to its corresponding position in the world coordination. In many works, the 3D joint rotation is expressed as a normalized quaternion [x,y,z,w] due to its continuity that can facilitate gradient-based optimization in the parameter estimation.

Deep learning based models

Since about 2016, deep learning has emerged as the dominant method for performing accurate articulated body pose estimation. Rather than building an explicit model for the parts as above, the appearances of the joints and relationships between the joints of the body are learned from large training sets. Models generally focus on extracting the 2D positions of joints (keypoints), the 3D positions of joints, or the 3D shape of the body from either a single or multiple images.

= Supervised =

== 2D joint positions ==

The first deep learning models that emerged focused on extracting the 2D positions of human joints in an image. Such models take in an image and pass it through a convolutional neural network to obtain a series of heatmaps (one for each joint) which take on high values where joints are detected.{{Citation|last1=Insafutdinov|first1=Eldar|title=DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model|date=2016|url=http://dx.doi.org/10.1007/978-3-319-46466-4_3|work=Computer Vision – ECCV 2016|pages=34–50|place=Cham|publisher=Springer International Publishing|isbn=978-3-319-46465-7|access-date=2021-06-30|last2=Pishchulin|first2=Leonid|last3=Andres|first3=Bjoern|last4=Andriluka|first4=Mykhaylo|last5=Schiele|first5=Bernt|series=Lecture Notes in Computer Science |volume=9910 |doi=10.1007/978-3-319-46466-4_3|arxiv=1605.03170|s2cid=6736694}}{{Cite book|last1=Cao|first1=Zhe|last2=Simon|first2=Tomas|last3=Wei|first3=Shih-En|last4=Sheikh|first4=Yaser|title=2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) |chapter=Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields |date=July 2017|chapter-url=http://dx.doi.org/10.1109/cvpr.2017.143|pages=1302–1310|publisher=IEEE|doi=10.1109/cvpr.2017.143|arxiv=1611.08050|isbn=978-1-5386-0457-1|s2cid=16224674}}

When there are multiple people per image, two main techniques have emerged for grouping joints within each person. In the first, "bottom-up" approach, the neural network is trained to also generate "part affinity fields" which indicate the location of limbs. Using these fields, joints can be grouped limb by limb by solving a series of assignment problems. In the second, "top-down" approach, an additional network is used to first detect people in the image and then the pose estimation network is applied to each image.{{Cite book|last1=Fang|first1=Hao-Shu|last2=Xie|first2=Shuqin|last3=Tai|first3=Yu-Wing|last4=Lu|first4=Cewu|title=2017 IEEE International Conference on Computer Vision (ICCV) |chapter=RMPE: Regional Multi-person Pose Estimation |date=October 2017|chapter-url=http://dx.doi.org/10.1109/iccv.2017.256|pages=2353–2362|publisher=IEEE|doi=10.1109/iccv.2017.256|arxiv=1612.00137|isbn=978-1-5386-1032-9|s2cid=6529517}}

== 3D joint positions ==

With the advent of multiple datasets with human pose annotated in multiple views,{{Cite journal|last1=Ionescu|first1=Catalin|last2=Papava|first2=Dragos|last3=Olaru|first3=Vlad|last4=Sminchisescu|first4=Cristian|date=July 2014|title=Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments|url=http://dx.doi.org/10.1109/tpami.2013.248|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|volume=36|issue=7|pages=1325–1339|doi=10.1109/tpami.2013.248|pmid=26353306|s2cid=4244548|issn=0162-8828|url-access=subscription}}{{Cite journal|last1=Sigal|first1=Leonid|last2=Balan|first2=Alexandru O.|last3=Black|first3=Michael J.|date=2009-08-05|title=HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion|url=http://dx.doi.org/10.1007/s11263-009-0273-6|journal=International Journal of Computer Vision|volume=87|issue=1–2|pages=4–27|doi=10.1007/s11263-009-0273-6|s2cid=11279201|issn=0920-5691|url-access=subscription}} models which detect 3D joint positions became more popular. These again fell into two categories In the first, a neural network is used to detect 2D joint positions from each view and these detections are then triangulated to obtain 3D joint positions.{{Cite journal|last1=Nath|first1=Tanmay|last2=Mathis|first2=Alexander|last3=Chen|first3=An Chi|last4=Patel|first4=Amir|last5=Bethge|first5=Matthias|last6=Mathis|first6=Mackenzie Weygandt|date=2018-11-24|title=Using DeepLabCut for 3D markerless pose estimation across species and behaviors|url=http://dx.doi.org/10.1101/476531|access-date=2021-06-30|journal=bioRxiv|page=476531|doi=10.1101/476531|s2cid=92206469}} The 2D network may be refined to produce better detections based on the 3D data.{{Cite book|last1=Iskakov|first1=Karim|last2=Burkov|first2=Egor|last3=Lempitsky|first3=Victor|last4=Malkov|first4=Yury|title=2019 IEEE/CVF International Conference on Computer Vision (ICCV) |chapter=Learnable Triangulation of Human Pose |date=October 2019|chapter-url=http://dx.doi.org/10.1109/iccv.2019.00781|pages=7717–7726|publisher=IEEE|doi=10.1109/iccv.2019.00781|arxiv=1905.05754|isbn=978-1-7281-4803-8|s2cid=153312868}} Furthermore, such approaches often have filters in both 2D and 3D to refine the detected points.{{Cite journal|last1=Karashchuk|first1=Pierre|last2=Rupp|first2=Katie L.|last3=Dickinson|first3=Evyn S.|last4=Sanders|first4=Elischa|last5=Azim|first5=Eiman|last6=Brunton|first6=Bingni W.|last7=Tuthill|first7=John C.|date=2020-05-29|title=Anipose: a toolkit for robust markerless 3D pose estimation|journal=bioRxiv|volume=36 |issue=13 |doi=10.1101/2020.05.26.117325|pmid=34592148 |s2cid=219167984|doi-access=free|pmc=8498918}}{{Cite journal|last1=Günel|first1=Semih|last2=Rhodin|first2=Helge|last3=Morales|first3=Daniel|last4=Campagnolo|first4=João|last5=Ramdya|first5=Pavan|last6=Fua|first6=Pascal|date=2019-10-04|editor-last=O'Leary|editor-first=Timothy|editor2-last=Calabrese|editor2-first=Ronald L|editor3-last=Shaevitz|editor3-first=Josh W|title=DeepFly3D, a deep learning-based approach for 3D limb and appendage tracking in tethered, adult Drosophila|journal=eLife|volume=8|pages=e48571|doi=10.7554/eLife.48571|pmid=31584428|pmc=6828327|issn=2050-084X |doi-access=free }} In the second, a neural network is trained end-to-end to predict 3D joint positions directly from a set of images, without 2D joint position intermediate detections. Such approaches often project image features into a cube and then use a 3D convolutional neural network to predict a 3D heatmap for each joint.{{Cite journal|last1=Dunn|first1=Timothy W.|last2=Marshall|first2=Jesse D.|last3=Severson|first3=Kyle S.|last4=Aldarondo|first4=Diego E.|last5=Hildebrand|first5=David G. C.|last6=Chettih|first6=Selmaan N.|last7=Wang|first7=William L.|last8=Gellis|first8=Amanda J.|last9=Carlson|first9=David E.|last10=Aronov|first10=Dmitriy|last11=Freiwald|first11=Winrich A.|date=2021-04-19|title=Geometric deep learning enables 3D kinematic profiling across species and environments|url=http://dx.doi.org/10.1038/s41592-021-01106-6|journal=Nature Methods|volume=18|issue=5|pages=564–573|doi=10.1038/s41592-021-01106-6 |pmc=8530226|pmid=33875887|s2cid=233310558|issn=1548-7091}}{{Cite journal|last1=Zimmermann|first1=Christian|last2=Schneider|first2=Artur|last3=Alyahyay|first3=Mansour|last4=Brox|first4=Thomas|last5=Diester|first5=Ilka|date=2020-02-27|title=FreiPose: A Deep Learning Framework for Precise Animal Motion Capture in 3D Spaces|url=http://dx.doi.org/10.1101/2020.02.27.967620|access-date=2021-06-30|journal=bioRxiv|doi=10.1101/2020.02.27.967620|s2cid=213583372}}

== 3D shape ==

Concurrently with the work above, scientists have been working on estimating the full 3D shape of a human or animal from a set of images. Most of the work is based on estimating the appropriate pose of the skinned multi-person linear (SMPL) model{{Cite journal|last1=Loper|first1=Matthew|last2=Mahmood|first2=Naureen|last3=Romero|first3=Javier|last4=Pons-Moll|first4=Gerard|last5=Black|first5=Michael J.|date=2015-11-04|title=SMPL|url=http://dx.doi.org/10.1145/2816795.2818013|journal=ACM Transactions on Graphics|volume=34|issue=6|pages=1–16|doi=10.1145/2816795.2818013|s2cid=229365481 |issn=0730-0301|url-access=subscription}} within an image. Variants of the SMPL model for other animals have also been developed.{{Citation|last1=Badger|first1=Marc|title=3D Bird Reconstruction: A Dataset, Model, and Shape Recovery from a Single View|date=2020|url=http://dx.doi.org/10.1007/978-3-030-58523-5_1|work=Computer Vision – ECCV 2020|pages=1–17|place=Cham|publisher=Springer International Publishing|isbn=978-3-030-58522-8|access-date=2021-06-30|last2=Wang|first2=Yufu|last3=Modh|first3=Adarsh|last4=Perkes|first4=Ammon|last5=Kolotouros|first5=Nikos|last6=Pfrommer|first6=Bernd G.|last7=Schmidt|first7=Marc F.|last8=Daniilidis|first8=Kostas|series=Lecture Notes in Computer Science |volume=12363 |doi=10.1007/978-3-030-58523-5_1|pmid=35822859 |pmc=9273110 |arxiv=2008.06133|s2cid=221135758}}{{Cite book|last1=Zuffi|first1=Silvia|last2=Kanazawa|first2=Angjoo|last3=Black|first3=Michael J.|title=2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition |chapter=Lions and Tigers and Bears: Capturing Non-rigid, 3D, Articulated Shape from Images |date=June 2018|chapter-url=http://dx.doi.org/10.1109/cvpr.2018.00416|pages=3955–3963|publisher=IEEE|doi=10.1109/cvpr.2018.00416|isbn=978-1-5386-6420-9|s2cid=46907802}}{{Citation|last1=Biggs|first1=Benjamin|title=Creatures Great and SMAL: Recovering the Shape and Motion of Animals from Video|date=2019|url=http://dx.doi.org/10.1007/978-3-030-20873-8_1|work=Computer Vision – ACCV 2018|pages=3–19|place=Cham|publisher=Springer International Publishing|isbn=978-3-030-20872-1|access-date=2021-06-30|last2=Roddick|first2=Thomas|last3=Fitzgibbon|first3=Andrew|last4=Cipolla|first4=Roberto|series=Lecture Notes in Computer Science |volume=11365 |doi=10.1007/978-3-030-20873-8_1|arxiv=1811.05804|s2cid=53305772}} Generally, some keypoints and a silhouette are detected for each animal within the image, and then the parameters 3D shape model are fit to match the position of keypoints and silhouette.

= Unsupervised =

The above algorithms all rely on annotated images, which can be time-consuming to produce. To address this issue, computer vision researchers have developed new algorithms which can learn 3D keypoints given only annotated 2D images from a single view or identify keypoints given videos without any annotations.

Applications

= Assisted living =

Personal care robots may be deployed in future assisted living homes. For these robots, high-accuracy human detection and pose estimation is necessary to perform a variety of tasks, such as fall detection. Additionally, this application has a number of performance constraints. {{citation needed | date = May 2017}}

=Character animation=

Traditionally, character animation has been a manual process. However, poses can be synced directly to a real-life actor through specialized pose estimation systems. Older systems relied on markers or specialized suits. Recent advances in pose estimation and motion capture have enabled markerless applications, sometimes in real time.{{cite web|last1=Dent|first1=Steven|title=What you need to know about 3D motion capture|url=https://www.engadget.com/2014/07/14/motion-capture-explainer/|website=Engadget|date=14 July 2014 |publisher=AOL Inc|access-date=31 May 2017}}

=Intelligent driver assisting system=

Car accidents account for about two percent of deaths globally each year. As such, an intelligent system tracking driver pose may be useful for emergency alerts {{dubious|date=May 2017}}. Along the same lines, pedestrian detection algorithms have been used successfully in autonomous cars, enabling the car to make smarter decisions. {{citation needed | date = May 2017}}

=Video games=

Commercially, pose estimation has been used in the context of video games, popularized with the Microsoft Kinect sensor (a depth camera). These systems track the user to render their avatar in-game, in addition to performing tasks like gesture recognition to enable the user to interact with the game. As such, this application has a strict real-time requirement.{{cite web|last1=Kohli|first1=Pushmeet|last2=Shotton|first2=Jamie|title=Key Developments in Human Pose Estimation for Kinect|url=https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ks_book_2012.pdf|publisher=Microsoft|access-date=31 May 2017}}

=Medical Applications=

Pose estimation has been used to detect postural issues such as scoliosis by analyzing abnormalities in a patient's posture,Aroeira, Rozilene Maria C., Estevam B. de Las Casas, Antônio Eustáquio M. Pertence, Marcelo Greco, and João Manuel R.S. Tavares. “Non-Invasive Methods of Computer Vision in the Posture Evaluation of Adolescent Idiopathic Scoliosis.” Journal of Bodywork and Movement Therapies 20, no. 4 (October 2016): 832–43. https://doi.org/10.1016/j.jbmt.2016.02.004.

physical therapy, and the study of the cognitive brain development of young children by monitoring motor functionality.Khan, Muhammad Hassan, Julien Helsper, Muhammad Shahid Farid, and Marcin Grzegorzek. “A Computer Vision-Based System for Monitoring Vojta Therapy.” International Journal of Medical Informatics 113 (May 2018): 85–95. https://doi.org/10.1016/j.ijmedinf.2018.02.010.

=Other applications=

Other applications include video surveillance, animal tracking and behavior understanding, sign language detection, advanced human–computer interaction, and markerless motion capturing.

Related technology

A commercially successful but specialized computer vision-based articulated body pose estimation technique is optical motion capture. This approach involves placing markers on the individual at strategic locations to capture the 6 degrees-of-freedom of each body part.

Research groups

A number of groups and companies are researching pose estimation, including groups at Brown University, Carnegie Mellon University, MPI Saarbruecken, Stanford University, the University of California, San Diego, the University of Toronto, the École Centrale Paris, ETH Zurich, National University of Sciences and Technology (NUST),{{Cite web | url=http://rise.smme.nust.edu.pk/ |title = NUST-SMME RISE Research Center}} the University of California, Irvine and Polytechnic University of Catalonia.

Companies

At present, several companies are working on articulated body pose estimation.

  • Bodylabs: Bodylabs is a Manhattan-based software provider of human-aware artificial intelligence.

See also

References

{{Reflist}}