Aishy Amer, Object and Event Extraction for Video Processing and Representation in On-Line Video Applications, Ph.D. thesis, INRS-Télécommunications, Université du Québec, December 2001. As the use of video becomes increasingly popular, and wide spread through, for instance, broadcast services, Internet, and security-related applications, providing means for fast, automated, and effective techniques to represent video based on its content, such as objects and meanings, are important topics of research. In a surveillance application, for instance, object extraction is necessary to detect and classify object behavior, and with video databases, effective retrieval must be based on high-level features and semantics. Automated content representation would significantly facilitate the use and reduce the costs of video retrieval and surveillance by humans. Most video representation systems are based on low-level quantitative features or focus on narrow domains. There are few representation schemes based on semantics; most of these are context-dependent and focus on the constraints of a narrow application and they lack, therefore, generality and flexibility. Most systems assume simple environments, for example, without object occlusion or noise. The goal of this thesis is to provide a stable content-based video representation rich in terms of generic semantic features and moving objects. Objects are represented using quantitative and qualitative low-level features. Generic semantic features are represented using events and other high-level motion features. To achieve higher applicability, content is extracted independently of the type and the context of the input video. The proposed system is aimed at three goals: flexible content representation, reliable stable processing that foregoes the need for precision, and low computational cost. The proposed system targets video of real environments such as those with object occlusions and artifacts. To achieve these goals, three processing levels are proposed: video enhancement to estimate and reduce noise, video analysis to extract meaningful objects and their spatio-temporal features, and video interpretation to extract context-independent semantics such as events. The system is modular, and layered from low-level to middle-level to high-level where levels exchange information. Results from a lower level are integrated to support higher levels. Higher levels support lower levels through memory-based feedback loops. The reliability of the proposed system is demonstrated by extensive experimentation on various indoor and outdoor video shots. Reliability is due to noise adaptation and due to correction or compensation of estimation errors at one step by processing at subsequent steps where higher-level information is available. The proposed system provides a response in real-time for applications with a rate of up to 10 frames per second on a shared computing machine. This response is achieved by dividing each processing level into simple but effective tasks and avoiding complex operations.