Data-driven Hierarchical Structure Kernel for Multiscale Part Based Object Recognition
Botao Wang, Hongkai Xiong, Xiaoqian Jiang, and Yuan F. Zheng
Abstract: Detecting generic object categories in images and videos is a fundamental issue in computer vision. However, it faces the challenges from inter- and intra-class diversity, as well as distortions caused by viewpoints, poses, deformations, etc. To solve object variations, this paper constructs a structure kernel and proposes a multi-scale part-based model incorporating the discriminative power of kernels. The structure kernel would measure the resemblance of part-based objects in three aspects: (1) The global similarity term to measure the resemblance of the global visual appearance of relevant objects; (2) The part similarity term to measure the resemblance of the visual appearance of distinctive parts; (3) The spatial similarity term to measure the resemblance of the spatial layout of parts. In essence, the deformation of parts in the structure kernel is penalized in a multi-scale space with respect to horizontal displacement, vertical displacement, and scale difference. Part similarities are combined with different weights, which are optimized efficiently to maximize the intra-class similarities and minimize the inter-class similarities by the normalized stochastic gradient ascent algorithm. Moreover, the parameters of the structure kernel are learned during the training process with regard to the distribution of the data in a more discriminative way. With flexible part sizes on scale and displacement, it can be more robust to the intra-class variations, poses and viewpoints. Theoretical analysis and experimental evaluations demonstrate that the proposed multi-scale part-based representation model with structure kernel exhibits accurate and robust performance, and outperforms state-of-the-art object classification approaches.
Citation: Botao Wang, Hongkai Xiong, Xiaoqian Jiang, and Yuan F. Zheng, "Joint Inference of Objects and Scenes with Efficient Learning of Text-Object-Scene Relations", IEEE Transactions on Image Processing (TIP), vol. 23, no. 4, pp. 1765-1778, April 2014.
Institute of Media, Information, and Network (MIN Lab)