Application of Semi-Automated Video Scene Coding for Reducing Manual ROI Marking in Eye-Tracking-Based ASD Studies

Friday, May 12, 2017: 10:00 AM-1:40 PM
Golden Gate Ballroom (Marriott Marquis Hotel)
A. Atyabi1, A. Naples2 and F. Shic3, (1)Seattle Children’s Research institute University of Washington, Seattle, WA, (2)Child Study Center, Yale University School of Medicine, New Haven, CT, (3)Center for Child Health, Behavior and Development, Seattle Children's Research Institute, Seattle, WA
Background:  Defining dynamic regions of interest (ROIs) is an essential component in eye-tracking-based studies of autism that use dynamic videos. In dynamic videos, the content and ROIs change moment-by-moment. In order to track high level constructs such as moving faces, or people, it is necessary to adjust ROIs in a time-dependent manner. This labor intensive and time consuming task is usually done manually.

Objectives:  To generate a semi-automated approach capable of identifying and marking any type of ROI in any video sequence automatically and with minimal human operator involvement.

Methods: The proposed mechanism for ROI tracking utilizes Speed Up Robust Features, a scale and rotation invariant object detector/descriptor method. 1)the algorithm detects ROIs new locations/orientations in the frame and defines new ROIs (objects not captured by ROIs) 2)the operator review the results and re-evaluates weakly detected ROIs using new manually markups. Nine video files, used for an activity monitoring eye tracking paradigm in toddlers with ASD (Shic et al., 2013), that contained 9 to 16 ROIs are considered. These videos included two people (P1 & P2) surrounded with multiple objects conducting a controlled activity. The ROIs included 1) dynamic: head(H),body(B), upper/lower-face(UF&LF) regions and 2) static: images on the wall(I) and toys on the ground(T) in the scene.


The ROIs that are identified with potentially faulty detections and the percentage of frames that their faulty detections occurred in are presented in following table. It should be noted that the results represent the first pass of the algorithm only while additional runs with corrected base ROI can help to re-detect/re-track the ROIs in question for frames with potentially faulty detections. Moreover, although the below ROIs are automatically marked as less successfully detected, a closer investigation indicated that a fair percentage of the potentially faulty detections are likely to be acceptable without any need for re-tracking.

It is noteworthy that majority of inaccurately tracked/detected ROIs are located within facial regions. This is likely due to 1) smaller size of these regions which makes the task of finding robust representative features more difficult and 2) the differences between the viewing angle and orientation of the head in the first frame (the basis of the ROI detection) and the later frames in the video. Data from Thirty-six 4-8 year-olds, ASD n = 14 (11 males,MDQ=88.36,SD=19.84) and non-ASD n=22 (13 males, MDQ=108.91,SD=12.38) shows that there are similarities between eye tracking result variables computed through ROIs developed through this semi-automated process and standard ROI specification techniques.

Conclusions:  We have designed a computational method for ROI tracking which dramatically lowers the time burden of defining ROIs for complex, dynamic videos. This approach, which does not lead to identical ROIs as manual coding, nonetheless shows similar performance on certain measures of outcome. Further work will aim to expand the scope and flexibility of our region tracker and to explore the use of this technique on fully unconstrained video inputs, such as that acquired from head mounted eye trackers in natural interactions.