Papers

(*: Equal contribution)

2023

AnyDoor: Zero-shot Object-level Image Customization
arxiv
Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, Hengshuang Zhao
pdf/ page/ abstract

This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations in a harmonious way. Instead of tuning parameters for each object, our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zero-shot setting requires an adequate characterization of a certain object. To this end, we complement the commonly used identity feature with detail features, which are carefully designed to maintain texture details yet allow versatile local variations (e.g., lighting, orientation, posture, etc.), supporting the object in favorably blending with different surroundings. We further propose to borrow knowledge from video datasets, where we can observe various forms (i.e., along the time axis) of a single object, leading to stronger model generalizability and robustness. Extensive experiments demonstrate the superiority of our approach over existing alternatives as well as its great potential in real-world applications, such as virtual try-on and object moving.

Open-vocabulary Panoptic Segmentation with Embedding Modulation
ICCV 2023
Xi Chen, Shuang Li, Ser-Nam Lim, Antonio Torralba, Hengshuang Zhao
pdf/ page/ abstract

Open-vocabulary image segmentation is attracting increasing attention due to its critical applications in the real world. Traditional closed-vocabulary segmentation methods are not able to characterize novel objects, whereas several recent open-vocabulary attempts obtain unsatisfactory results, i.e., notable performance reduction on the closedvocabulary and massive demand for extra data. To this end, we propose OPSNet, an omnipotent and data-efficient framework for Open-vocabulary Panoptic Segmentation. Specifically, the exquisitely designed Embedding Modulation module, together with several meticulous components, enables adequate embedding enhancement and information exchange between the segmentation model and the visual-linguistic well-aligned CLIP encoder, resulting in superior segmentation performance under both open- and closed-vocabulary settings with much fewer need of additional data. Extensive experimental evaluations are conducted across multiple datasets (e.g., COCO, ADE20K, Cityscapes, and PascalContext) under various circumstances, where the proposed OPSNet achieves state-of-theart results, which demonstrates the effectiveness and generality of the proposed approach. The code and trained models will be made publicly available.

Detecting Everything in the Open World: Towards Universal Object Detection
CVPR2023
Zhenyu Wang, Yali Li, Xi Chen, Ser-Nam Lim, Antonio Torralba, Hengshuang Zhao, Shengjin Wang
pdf/ code/ abstract

In this paper, we formally address universal object detection, which aims to detect every scene and predict every category. The dependence on human annotations, the limited visual information, and the novel categories in the open world severely restrict the universality of traditional detectors. We propose UniDetector, a universal object detector that has the ability to recognize enormous categories in the open world. The critical points for the universality of UniDetector are: 1) it leverages images of multiple sources and heterogeneous label spaces for training through the alignment of image and text spaces, which guarantees sufficient information for universal representations. 2) it generalizes to the open world easily while keeping the balance between seen and unseen classes, thanks to abundant information from both vision and language modalities. 3) it further promotes the generalization ability to novel categories through our proposed decoupling training manner and probability calibration. These contributions allow UniDetector to detect over 7k categories, the largest measurable category size so far, with only about 500 classes participating in training. Our UniDetector behaves the strong zero-shot generalization ability on largevocabulary datasets like LVIS, ImageNetBoxes, and VisualGenome - it surpasses the traditional supervised baselines by more than 4% on average without seeing any corresponding images. On 13 public detection datasets with various scenes, UniDetector also achieves state-of-the-art performance with only a 3% amount of training data.

ScribbleSeg: Scribble-based Interactive Image Segmentation
Manuscripts 2023
Xi Chen, Yau Shing Jonathan Cheung, Ser-Nam Lim, Hengshuang Zhao
pdf/ abstract

Interactive segmentation enables users to extract masks by providing simple annotations to indicate the target, such as boxes, clicks, or scribbles. Among these interaction formats, scribbles are the most flexible as they can be of arbitrary shapes and sizes. This enables scribbles to provide more indications of the target object. However, previous works mainly focus on click-based configuration, and the scribble-based setting is rarely explored. In this work, we attempt to formulate a standard protocol for scribble-based interactive segmentation. Basically, we design diversified strategies to simulate scribbles for training, propose a deterministic scribble generator for evaluation, and construct a challenging benchmark. Besides, we build a strong framework ScribbleSeg, consisting of a Prototype Adaption Module (PAM) and a Corrective Refine Module (CRM), for the task. Extensive experiments show that ScribbleSeg performs notably better than previous click-based methods. We hope this could serve as a more powerful and general solution for interactive segmentation. Our code will be made available.

2022

FocalClick: Towards Practical Interactive Image Segmentation
CVPR2022
Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian Qi, Hengshuang Zhao
pdf/ code/ abstract

Interactive segmentation allows users to extract target masks by making positive/negative clicks. Although explored by many previous works, there is still a gap between academic approaches and industrial needs: first, existing models are not efficient enough to work on low-power devices; second, they perform poorly when used to refine preexisting masks as they could not avoid destroying the correct part. FocalClick solves both issues at once by predicting and updating the mask in localized areas. For higher efficiency, we decompose the slow prediction on the entire image into two fast inferences on small crops: a coarse segmentation on the Target Crop, and a local refinement on the Focus Crop. To make the model work with preexisting masks, we formulate a sub-task termed Interactive Mask Correction, and propose Progressive Merge as the solution. Progressive Merge exploits morphological information to decide where to preserve and where to update, enabling users to refine any preexisting mask effectively. FocalClick achieves competitive results against SOTA methods with significantly smaller FLOPs. It also shows significant superiority when making corrections on preexisting masks. Code and data will be released at ClickSEG

iNL: Implicit non-local network
Neurocomputing
Yifeng Han, Xi Chen, Songjie Zhang, Donglian Qi
pdf/ abstract

The attention mechanism of computer vision represented by a non-local network improves the performance of numerous vision tasks while bringing computational burden for deployment Wang et al. (2018). In this work, we explore to release the inference computation for non-local network by decoupling the training/inference procedure. Specifically, we propose the implicit non-local network (iNL). During training, iNL models the dependency between features across long-range affinities like original non-local blocks; during inference, iNL could be reformulated as only two convolution layers but can rival non-local network. In this way, the computation complexity and the memory costs are reduced. In addition, we take a further step and extend our iNL into a more generalized form, which covers the attentions of different orders in computer vision tasks. iNL brings steady improvements on multiple benchmarks of different vision tasks including classification, detection, and instance segmentation. In the meantime, it provides a brand–new perspective to understand the attention mechanism in deep neural networks.

2021

Conditional Diffusion for Interactive Segmentation
ICCV2021
Xi Chen, Zhiyan Zhao, Feiwu Yu, Yilei Zhang, Manni Duan
pdf/ code/ abstract

In click-based interactive segmentation, the mask extraction process is dictated by positive/negative user clicks; however, most existing methods do not fully exploit the user cues, requiring excessive numbers of clicks for satisfactory results. We propose Conditional Diffusion Network (CDNet), which propagates labeled representations from clicks to conditioned destinations with two levels of affinities: Feature Diffusion Module (FDM) spreads features from clicks to potential target regions with global similarity; Pixel Diffusion Module (PDM) diffuses the predicted logits of clicks within locally connected regions. Thus, the information inferred by user clicks could be generalized to proper destinations. In addition, we put forward Diversified Training (DT), which reduces the optimization ambiguity caused by click simulation. With FDM,PDM and DT, CDNet could better understand user's intentions and make better predictions with limited interactions. CDNet achieves state-of-the-art performance on several benchmarks.

2020

State-Aware Tracker for Real-Time Video Object Segmentation
CVPR2020
Xi Chen, Zuoxin Li, Ye Yuan, Gang Yu, Jianxin Shen, Donglian Qi
pdf/ code / abstract

In this work, we address the task of semi-supervised video object segmentation (VOS) and explore how to make efficient use of video property to tackle the challenge of semi-supervision. We propose a novel pipeline called State- Aware Tracker (SAT), which can produce accurate segmen- tation results with real-time speed. For higher efficiency, SAT takes advantage of the inter-frame consistency and deals with each target object as a tracklet. For more sta- ble and robust performance over video sequences, SAT gets awareness for each state and makes self-adaptation via two feedback loops. One loop assists SAT in generating more stable tracklets. The other loop helps to construct a more robust and holistic target representation. SAT achieves a promising result of 72.3% J&F mean with 39 FPS on DAVIS2017-Val dataset, which shows a decent trade-off be- tween efficiency and accuracy.

A Unified Algorithm for Object Tracking and Segmentation and its Application on Intelligent Video Surveillance for Transformer Substation
Proceedings of the CSEE
Xi Chen, Yifeng Han, Yunfeng Yan, Donglian Qi, Jianxin Shen
pdf/ abstract

Considering the diversified requirements in intelligent video surveillance for transformer substation, a unified algorithm for object tracking and segmentation was proposed which is able to track and segment humans, vehicles, and many other foreign objects with real-time speed. Based on SiamRPN, an efficient segmentation branch was designed to get high quality mask for target object. In addition, in order to enhance tracking accuracy with the result of segmentation, a mask quality scoring method was proposed. Besides, the proposed method adopts templet updating strategy, which makes the method more robust for long sequences. This algorithm not only reaches high accuracy and robustness in the practical task of video surveillance for transformer substation, but also gets high performance on VOT2018 benchmark.

2019

Real-time Human Segmentation using Pose Skeleton Map
CCC2019
Xi Chen, Ziqiang Zhou, Ying Ying, Donglian Qi
pdf/ abstract

In this paper, an algorithm for real-time human segmentation is proposed. This algorithm uses connection relations of human joints provided by pose estimation as prior knowledge which brings striking enhancement for the accuracy of human segmentation. High-quality segmentation results can be produced with real-time speed dealing with large range of human pose under complicated scenes.

Boundary-Aware Network for Fast and High-Accuracy Portrait Segmentation
Manuscripts 2019
Xi Chen, Donglian Qi, Jianxin Shen
pdf/ abstract

Compared with other semantic segmentation tasks, portrait segmentation requires both higher precision and faster inference speed. However, this problem has not been well studied in previous works. In this paper, we propose a lightweight network architecture, called Boundary-Aware Network (BANet) which selectively extracts detail information in boundary area to make high-quality segmentation output with real-time( >25FPS) speed. In addition, we design a new loss function called refine loss which supervises the network with image level gradient information. Our model is able to produce finer segmentation results which has richer details than annotations.