Unpacking Video To Text for Analysis

Video recording is so wonderful for capturing transient information which few can appreciate all at once. But an analytical approach to appreciating and understanding the contents of video material requires segregating incidents and labeling components for subsequent reference. Previously, unpacking the contents of the Infant Peggy Study videos has been too difficult for me. During the current wave of language model based analysis and interpretation of digitized speech, the “impossibly difficult tasks”, though still “quite demanding,” may be possible.

The goal of our IPS25 initiative is to render accessible for analysis the contents of approximately 1000 video clips of one child’s development, primarily within family settings, captured weekly over the six years from her 18th week. (The first 3 years of video were regular; the latter three were less so. Those video clips are accessible here, now, as segregated “video panes” collected within “panels of panes” based on their capture in the same day’s videotape sessions.) The focus of the study was on the child’s developing knowledge of the physical world and on her communication, as witnessed both in language and action.

IPS25 is adding two additional representations to the original video records. First is transcription of speech. Second is description of action. Can AI do that? Surely not for a pre-verbal infant! Exactly so, which is why the task is now ‘quite demanding, but may be possible.’ Let me sketch our current process then conclude with a few caveats.

After review of current commercial video transcription service offerings, we’ve settled on use of the HappyScribe.com facilities as most apt for our materials. Following a period of testing, in March we began operating under a yearly subscription. Here’s the process for each clip:
1. we upload a copy of the clip to the HappyScribe account:
2. Happy Scribe generates an automatic transcription on acquisition of the file.
a. this includes discrimination of two “speakers” of the generated text
b. the text is separated by speaker-turns into paragraphs, with video time-codes available
3. we export and save a .docx copy of the initial auto transcription with time codes. (Auto copy)
4. we Fix the Auto transcription at HappyScribe, correcting and expanding the speaker list and correcting assignments of generated text.
5. We also Fix the verbal misinterpretations of the transcriber programs and augment the auto transcription by adding our transcription of infant verbalizations and correcting errors made by the programs. There are cases where no one can tell what an utterance was.
6. While Fixing the auto transcription, we add action descriptions based on video observation.
7. After exporting this edited transcription (Fixed copy) we submit it for generation of subtitles.
8. HappyScribe returns the text as subtitle lines, in an audio-video editor, formatted according to choices we’ve made with:
a. speaker identifications stripped out of the text lines and
b. with correspondences between text and audio wave-form beneath the video window.
9. We restore speaker identification to the text lines, essential for clarity and to present the subtitles as an on-screen script.
10. We edit the subtitles to purge any errors not previously found.
11. We re-edit the correspondences of text display and audio-video wave form.
12. We export two .docx formats of the transcript/subtitles, one with Timecodes, one without.
13. We export a re-remdered copy of the video with subtitles text superposed.
14. We exit HappyScribe processing for this video clip.
15. On the local computer, we edit the exported transcript into text segments we judge as micro-episodes within the action of the video, labeling them in alphabetic sequence.
16. We copy the time code data from the other formatted exported transcript and attach it to the alphabetic sequence identified of the appropriate episode’s leading line.
17. We access our website online at the hosting facility and upload the re-rendered video into the LC3 IPS collection, with a modified title reflecting its subtitled content, so that is added to the collection and does not replace the original video file.
18. We access this website in our browser in administrative mode and proceed to add an additional web-page with a tabular format, into whose previously prepared WordPress code we insert the text of the transcript-episodes, adding the time codes for reference to points in the video stream.
19. We fill out indicative information (agents in the action, person on Camera, others offstage) and specify the link to the subtitled video file at the host location. This page is the ClipNotes page.
20. We link the clipNotes page to the Panel for that videotape’s session. From that Panel’s pane for the original video of the transcribed video, we link to its ClipNotes page. Finally, we capture the ClipNotes page’s address and edit that into the local i\Index of links for the entire collection.

Generally, editing and processing averages about an hour per minute of video. The two primary reasons are the age and utterance interpretation of the infant Peggy and our addition of action / behavior descriptions to the transcriptions. What then are the values of the HappyScribe facility for my project? There are a number. The automatic transcriptions of adult and sibling actors are pretty good. This permits quick scoping of the contents into useful text-based structures. The two-phase editing is important, The transcription editor processing permits construction of a sound scheme for the finished product. Working through secondary edits in the audio-video editor is what gives appropriate control for the best interpretation of the micro-context. To me, results of this process may not compel conviction of correctness, but it IS the best I can do given the tools I have to work with. There is another outcome that I prize. The outcomes, the process, even specific interpretations of utterances and actions are subject to direct criticism, refutation and advancement of alternatives. My intention is to open the web posts to comments at an as-yet undetermined time.

Is the effort worth while? Obviously I believe it is. You can judge for yourself. Look at original video clips of the two threads “withScurry” and “withRobby.” Compare what you understand of the content of initial clips with what you understand after viewing the subtitled Clips and ClipNotes. Your decision. Let me know what you think. Thanks.

Analyst

Caveats:
In Shakespeare’s Tempest, when Miranda saw the first young man of her experience, she said:
Miranda: “Ah, Brave New World, that has such people in it.”
Her Dad: ” ‘Tis New to thee.”
I’ve been engaged in “Experimental Epistemology” (AI’s progenitor) for fifty years.

Today’s Large Language Models, fed by the contents of the World Wide Internet and operating at the speed of multi-mega-flops produce amazing outputs.
A first caveat: the transcripts of HappyScribe are made-up stories based on how the programs map input streams of audio into comprehension structures, based on statistical frequencies of words and phrases in succession and on proprietary structures of interpretation. Fine product. Useful output,
I understand the game, in general. It’s what I also do as an analyst, though details of my commitments and procedures differ.
A second caveat: the criteria employed in programs are designed to work for most cases. HappyScribe recognizes machine produced, automatic transcripts are accurate often but not entirely so; the firm offers backup by human transcription where accuracy is essential. As Analyst, I am providing that human backup for transcript productions made truly difficult because of these factors: varied language production mastery among speakers; the determination that action descriptions must be included to make sense of what transpires; the recognition that extra-corpus information and knowledge about the specific participants are often required for interpretation.
A third caveat for this particular case study: the specific agenda of this effort is to make sense of how particular experiences impact changes in the knowledge and capabilities of an individual. That is, we are not applying rules we know; we are searching for information to help us find what is yet unknown.