Aishy Amer, 
Object and Event Extraction for Video Processing and Representation 
in On-Line Video Applications, Ph.D. thesis, 
INRS-Télécommunications, Université du Québec, December 2001.


As the use of video becomes increasingly popular, and wide spread through, 
for instance, broadcast services, Internet, and security-related applications, 
providing means for fast, automated, and effective techniques to represent 
video based on its content, such as objects and meanings, are important 
topics of research. In a surveillance application, for instance, object 
extraction is necessary to detect and classify object behavior, and with 
video databases, effective retrieval must be based on high-level features 
and semantics. Automated content representation would significantly 
facilitate the use and reduce the costs of video retrieval and 
surveillance by humans. 

Most video representation systems are based on low-level quantitative 
features or focus on narrow domains. There are few representation 
schemes based on semantics; most of these are context-dependent 
and focus on the constraints of a narrow application and they 
lack, therefore, generality and flexibility. Most systems 
assume simple environments, for example, without object occlusion or noise.

The goal of this thesis is to provide a stable content-based video 
representation rich in terms of generic semantic features and moving 
objects. Objects are represented using quantitative and qualitative 
low-level features. Generic semantic features are represented using 
events and other high-level motion features. To achieve higher 
applicability, content is extracted independently of the type 
and the context of the input video.

The proposed system is aimed at three goals: flexible content representation, 
reliable stable processing that foregoes the need for precision, and low 
computational cost. The proposed system targets video of real environments 
such as those with object occlusions and artifacts.

To achieve these goals, three processing levels are proposed: 
video enhancement to estimate and reduce noise, video analysis to 
extract meaningful objects and their spatio-temporal features, and 
video interpretation to extract context-independent semantics such 
as events. The system is modular, and layered from low-level to 
middle-level to high-level where levels exchange information. 
Results from a lower level are integrated to support higher levels. 
Higher levels support lower levels  through memory-based feedback loops.

The reliability of the proposed system is demonstrated by 
extensive experimentation on various indoor and outdoor video shots. 
Reliability is due to noise adaptation and due to correction or 
compensation of estimation errors at one step by processing at 
subsequent steps where higher-level information is available.
The proposed system provides a response in real-time for 
applications with a rate of up to 10 frames per second on a shared computing 
machine. This response is achieved by dividing each processing level into 
simple but effective tasks and avoiding complex operations.