A large number of digital humanities projects focuses on text. This medial limitation may be attributed to the abundance of well-established quantitative methods applicable to text. Cultural Studies, however, analyse cultural expressions in a broad sense, including different non-textual media, physical artefacts, and performative actions. It is, to a certain extent, possible to transcribe these multi-medial phenomena in textual form; however, this transcription is difficult to automate and some information may be lost. Thus, quantitative approaches which directly access media-specific information are a desideratum for Cultural Studies. Visual media constitute a significant part of cultural production. In our paper, we propose Deep Watching as a way to analyze visual media (films, photographs, and video clips) using cutting-edge machine learning and computer vision algorithms. Unlike previous approaches, which were based on generic information such as frame differences (Howanitz 2015), color distribution (Burghardt/Wolff 2016) or used manual annotation altogether (Dunst/Hartel 2016), Deep Watching allows to automatically identify visual information (symbols, objects, persons, body language, visual configuration of the scene) in large image and video corpora. To a certain extent, Tilton and Arnold’s Distant-Viewing Toolkit uses a comparable approach (Tilton/Arnold 2018). However, by means of our customized training of state-of-the-art convolutional neural networks for object detection and face recognition we can, in comparison to this toolkit, automatically extract more information about individual frames and their contexts.
Bernhard Bermeitinger, Sebastian Gassner, Siegfried Handschuh, Gernot Howanitz, Erik Radisch, Malte Rehbein
14 Oct 2019