Signature-based videos’ visual similarity detection and measurement

Bekhet, Saddam (2016) Signature-based videos’ visual similarity detection and measurement. PhD thesis, University of Lincoln.

23701 PhD Final Version.pdf
[img] PDF
23701 PhD Final Version.pdf - Whole Document
Restricted to Repository staff only
Available under License Creative Commons Attribution.

Item Type:Thesis (PhD)
Item Status:Live Archive


The quantity of digital videos is huge, due to technological advances in video capture,
storage and compression. However, the usefulness of these enormous volumes
is limited by the effectiveness of content-based video retrieval systems (CBVR) that
still requires time-consuming annotating/tagging to feed the text-based search. Visual
similarity is the core of these CBVR systems where videos are matched based on their
respective visual features and their evolvement across video frames. Also, it acts as an
essential foundational layer to infer semantic similarity at advanced stage, in collaboration
with metadata. Furthermore, handling such amounts of video data, especially
the compressed-domain, forces certain challenges for CBVR systems: speed, scalability
and genericness. The situation is even more challenging with availability of nonpixelated
features, due to compression, e.g. DC/AC coefficients and motion vectors,
that requires sophisticated processing. Thus, a careful features’ selection is important
to realize the visual similarity based matching within boundaries of the aforementioned
challenges. Matching speed is crucial, because most of the current research is biased
towards the accuracy and leaves the speed lagging behind, which in many cases affect
the practical uses. Scalability is the key for benefiting from these enormous available
videos amounts. Genericness is an essential aspect to develop systems that is applicable
to, both, compressed and uncompressed videos.
This thesis presents a signature-based framework for efficient visual similarity
based video matching. The proposed framework represents a vital component for
search and retrieval systems, where it could be used in three possible different ways:
(1)Directly for CBVR systems where a user submits a query video and the system retrieves
a ranked list of visually similar ones. (2)For text-based video retrieval systems,
e.g. YouTube, when a user submits a textual description and the system retrieves a
ranked list of relevant videos. The retrieval in this case works by finding videos that
were manually assigned similar textual description (annotations). For this scenario,
the framework could be used to enhance the annotation process. This is achievable
by suggesting an annotations-set for the newly uploading videos. These annotations
are derived from other visually similar videos that can be retrieved by the proposed
framework. In this way, the framework could make annotations more relevant to video
contents (compared to the manual way) which improves the overall CBVR systems’
performance as well. (3)The top-N matched list obtained by the framework, could be
used as an input to higher layers, e.g. semantic analysis, where it is easier to perform
complex processing on this limited set of videos.
The proposed framework contributes and addresses the aforementioned problems,
i.e. speed, scalability and genericness, by encoding a given video shot into a single
compact fixed-length signature. This signature is able to robustly encode the shot
contents for later speedy matching and retrieval tasks. This is in contrast with the
current research trend of using an exhaustive complex features/descriptors, e.g. dense
trajectories. Moreover, towards a higher matching speed, the framework operates over
a sequence of tiny images (DC-images) rather than full size frames. This limits the
need to fully decompress compressed-videos, as the DC-images are exacted directly
from the compressed stream. The DC-image is highly useful for complex processing,
due to its small size compared to the full size frame. In addition, it could be generated
from uncompressed videos as well, while the proposed framework is still applicable
in the same manner (genericness aspect). Furthermore, for a robust capturing of the
visual similarity, scene and motion information are extracted independently, to better
address their different characteristics. Scene information is captured using a statistical
representation of scene key colours’ profiles, while motion information is captured
using a graph-based structure. Then, both information from scene and motion are
fused together to generate an overall video signature. The signature’s compact fixedlength
aspect contributes to the scalability aspect. This is because, compact fixedlength
signatures are highly indexable entities, which facilitates the retrieval process
over large-scale video data.
The proposed framework is adaptive and provides two different fixed-length video
signatures. Both works in a speedy and accurate manner, but with different degrees of
matching speed and retrieval accuracy. Such granularity of the signatures is useful to
accommodate for different applications’ trade-offs between speed and accuracy. The
proposed framework was extensively evaluated using black-box tests for the overall
fused signatures and white-box tests for its individual components. The evaluation
was done on multiple challenging large-size datasets against a diverse set of state-ofart
baselines. The results supported by the quantitative evaluation demonstrated the
promisingness of the proposed framework to support real-time applications.

Keywords:Computer science, Video
Subjects:G Mathematical and Computer Sciences > G400 Computer Science
Divisions:College of Science > School of Computer Science
ID Code:23701
Deposited On:04 Aug 2016 12:36

Repository Staff Only: item control page