Object co-segmentation

{{Short description|Type of image segmentation, jointly segmenting semantically similar objects in multiple images}}

File:Samples of object co-segmentation.jpg

In computer vision, object co-segmentation is a special case of image segmentation, which is defined as jointly segmenting semantically similar objects in multiple images or video frames.

Challenges

It is often challenging to extract segmentation masks of a target/object from a noisy collection of images or video frames, which involves object discovery coupled with segmentation. A noisy collection implies that the object/target is present sporadically in a set of images or the object/target disappears intermittently throughout the video of interest. Early methods typically involve mid-level representations such as object proposals.

Dynamic Markov networks-based methods

File:The inference process of the two coupled dynamic Markov networks to obtain the joint video object discovery and segmentation.jpg File:Framework of Joint Video Object Discovery and Segmentation.png

A joint object discover and co-segmentation method based on coupled dynamic Markov networks has been proposed recently, which claims significant improvements in robustness against irrelevant/noisy video frames.

Unlike previous efforts which conveniently assumes the consistent presence of the target objects throughout the input video, this coupled dual dynamic Markov network based algorithm simultaneously carries out both the detection and segmentation tasks with two respective Markov networks jointly updated via belief propagation.

Specifically, the Markov network responsible for segmentation is initialized with superpixels and provides information for its Markov counterpart responsible for the object detection task. Conversely, the Markov network responsible for detection builds the object proposal graph with inputs including the spatio-temporal segmentation tubes.

Graph cut-based methods

Graph cut optimization is a popular tool in computer vision, especially in earlier image segmentation applications. As an extension of regular graph cuts, multi-level hypergraph cut is proposed to account for more complex high order correspondences among video groups beyond typical pairwise correlations.

With such hypergraph extension, multiple modalities of correspondences, including low-level appearance, saliency, coherent motion and high level features such as object regions, could be seamlessly incorporated in the hyperedge computation. In addition, as a core advantage over co-occurrence based approach, hypergraph implicitly retains more complex correspondences among its vertices, with the hyperedge weights conveniently computed by eigenvalue decomposition of Laplacian matrices.

CNN/LSTM-based methods

File:Overview of the coarse-to-fine temporal action localization.png File:Flowchart of the spatio-temporal action localization detector Segment-tube.png

In action localization applications, object co-segmentation is also implemented as the segment-tube spatio-temporal detector. Inspired by the recent spatio-temporal action localization efforts with tubelets (sequences of bounding boxes), Le et al. present a new spatio-temporal action localization detector Segment-tube, which consists of sequences of per-frame segmentation masks. This Segment-tube detector can temporally pinpoint the starting/ending frame of each action category in the presence of preceding/subsequent interference actions in untrimmed videos. Simultaneously, the Segment-tube detector produces per-frame segmentation masks instead of bounding boxes, offering superior spatial accuracy to tubelets. This is achieved by alternating iterative optimization between temporal action localization and spatial action segmentation.

The proposed segment-tube detector is illustrated in the flowchart on the right. The sample input is an untrimmed video containing all frames in a pair figure skating video, with only a portion of these frames belonging to a relevant category (e.g., the DeathSpirals). Initialized with saliency based image segmentation on individual frames, this method first performs temporal action localization step with a cascaded 3D CNN and LSTM, and pinpoints the starting frame and the ending frame of a target action with a coarse-to-fine strategy. Subsequently, the segment-tube detector refines per-frame spatial segmentation with graph cut by focusing on relevant frames identified by the temporal action localization step. The optimization alternates between the temporal action localization and spatial action segmentation in an iterative manner. Upon practical convergence, the final spatio-temporal action localization results are obtained in the format of a sequence of per-frame segmentation masks (bottom row in the flowchart) with precise starting/ending frames.

References

{{reflist|refs=

{{cite conference | last1=Vicente | first1=Sara | last2=Rother | first2=Carsten | last3=Kolmogorov | first3=Vladimir | title=CVPR 2011 | chapter=Object cosegmentation | publisher=IEEE | year=2011 | pages=2217–2224 | isbn=978-1-4577-0394-2 | doi=10.1109/cvpr.2011.5995530 }}

{{cite conference | last1=Chen | first1=Ding-Jie | last2=Chen | first2=Hwann-Tzong | last3=Chang | first3=Long-Wen | title=Proceedings of the 20th ACM international conference on Multimedia - MM '12 | chapter=Video object cosegmentation | publisher=ACM Press | location=New York, New York, USA | year=2012 | page=805 | isbn=978-1-4503-1089-5 | doi=10.1145/2393347.2396317 }}

{{cite journal | last1=Liu | first1=Ziyi | last2=Wang | first2=Le | last3=Hua | first3=Gang | last4=Zhang | first4=Qilin | last5=Niu | first5=Zhenxing | last6=Wu | first6=Ying | last7=Zheng | first7=Nanning | title=Joint Video Object Discovery and Segmentation by Coupled Dynamic Markov Networks | journal=IEEE Transactions on Image Processing | volume=27 | issue=12 | year=2018 | issn=1057-7149 | doi=10.1109/tip.2018.2859622 | pages=5840–5853 | url=https://qilin-zhang.github.io/_pages/pdfs/Joint_Video_Object_Discovery_and_Segmentation_by_Coupled_Dynamic_Markov_Networks.pdf | pmid=30059300| bibcode=2018ITIP...27.5840L | s2cid=51867241 | doi-access=free }}

{{cite conference | last1=Lee | first1=Yong Jae | last2=Kim | first2=Jaechul | last3=Grauman | first3=Kristen | title=2011 International Conference on Computer Vision | chapter=Key-segments for video object segmentation | publisher=IEEE | year=2011 | pages=1995–2002 | isbn=978-1-4577-1102-2 | doi=10.1109/iccv.2011.6126471 | citeseerx=10.1.1.269.2727 }}

{{cite conference | last1=Ma | first1=Tianyang | last2=Latecki | first2=Longin Jan |title=Maximum weight cliques with mutex constraints for video object segmentation | website=IEEE CVPR 2012 | year=2012 | pages=670–677 | doi=10.1109/CVPR.2012.6247735 | isbn=978-1-4673-1228-8 }}

{{cite journal | last1=Wang | first1=Le | last2=Lv | first2=Xin | last3=Zhang | first3=Qilin | last4=Niu | first4=Zhenxing | last5=Zheng | first5=Nanning | last6=Hua | first6=Gang | title=Object Cosegmentation in Noisy Videos with Multilevel Hypergraph | journal=IEEE Transactions on Multimedia | publisher=IEEE | year=2020 | volume=23 | page=1 | issn=1520-9210 | doi=10.1109/tmm.2020.2995266 | s2cid=219410031 | url=https://qilin-zhang.github.io/_pages/pdfs/Object_Cosegmentation_in_Noisy_Videos.pdf}}

{{cite journal | last1=Wang | first1=Le | last2=Duan | first2=Xuhuan | last3=Zhang | first3=Qilin | last4=Niu | first4=Zhenxing | last5=Hua | first5=Gang | last6=Zheng | first6=Nanning | title=Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation | journal=Sensors | publisher=MDPI AG | volume=18 | issue=5 | date=2018-05-22 | issn=1424-8220 | doi=10.3390/s18051657 | page=1657 | url=https://qilin-zhang.github.io/_pages/pdfs/Segment-Tube_Spatio-Temporal_Action_Localization_in_Untrimmed_Videos_with_Per-Frame_Segmentation.pdf | pmid=29789447 | pmc=5982167| bibcode=2018Senso..18.1657W | doi-access=free }} 50px Material was copied from this source, which is available under a [https://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International License].

}}

Category:Image segmentation

Category:Computer vision

Category:Applications of computer vision

Category:Image processing

Category:Machine vision

Category:Film and video technology

Category:Applied machine learning

Category:Cognition

Category:Motion in computer vision