I am thrilled to be a part of the awesome Red Hen Lab community! Thank you for selecting me and giving me a chance to contribute to the Red Hen codebase.

This post describes my journey after being selected as a Google Summer of Code (GSoC) student associated with Red Hen Lab and FrameNet Brasil. I plan to summarize my progress at the end of every week until the end of the summer.

Here’s the abstract of my project:

The project aims to develop a prototype that is capable of meaning construction using multi-modal channels of communication. Specifically, for a co-speech gestures dataset, using the annotations obtained manually coupled with the metadata obtained through algorithms, we devise a mechanism to disambiguate meaning considering the influence of all the different modalities involved in a particular frame. Since only a handful of annotated datasets have been made available by Red Hen, we leverage semi-supervised learning techniques to annotate additional unlabeled data. Furthermore, since each frame could have multiple meaning interpretations possible, we use human annotators to annotate a subset of our validation set and report our performance on that set.

My mentors are: Suzie Xi{:target="_blank“}, Mark Turner{:target=”_blank“}, Javier Valenzuela{:target=”_blank“}, Anna Wilson{:target=”_blank“}, Maria Hedblom{:target=”_blank“}, Francis Steen{:target=”_blank“}, Tiago Torrent{:target=”_blank“} (primary), Frederico Belcavello{:target=”_blank“}, Inés Olza{:target=”_blank"}.

Stay tuned!

title: “Week 10 (July 20 - July 26)” layout: single classes: wide permalink: /blog/gsoc-2021/report/week-10/ excerpt: "" modified: last_modified_at: 2021-07-12 —

We decide to choose “from-to” with a proxmity window of 4 word tokens between “from” and “to” as the initial template of lexical trigger to map it to the construal dimension, Prominence. In addition, we also identify “first-second”, “firstly-secondly” and “here-then”, but could not find much relevant hand gestures in the PATS dataset.

Consider a sample video of a talk show host, Jimmy Fallon, taken from the PATS dataset with a pre-defined start and end time:

{% include video id="JY-Nhs__4xk?start=145&end=160" provider=“youtube” %}

Transcript: “with Mexico that players can either travel from the u.s. to Mexico by plane or just walked past the wall that still won’t be built it’s up to you you can choose”

The video frames corresponding to the “from-to” lexical trigger are shown below:

<h3 style="border-bottom: 1px solid; margin: 0 0 8px 0;">Frame 1</h3>
<div style="position: relative; width: 100%; padding-top: 56.25%;">
<iframe src="https://streamable.com/e/3qs7r5?loop=0" frameborder="0" width="100%" height="100%" allowfullscreen style="width:100%;height:100%;position:absolute;left:0px;top:0px;overflow:hidden;"></iframe>
</div>
<p style="margin: 10px 0 0 0;"></p>
<table>
<thead>
  <tr>
    <th>Handedness</th>
    <th>Axis</th>
    <th>Shape</th>
    <th>Direction</th>
    <th>Gesture</th>
  </tr>
Both Hands Horizontal Straight Diagonal right up Yes Lexical prompt: “travel from the”
<h3 style="border-bottom: 1px solid; margin: 0 0 8px 0;">Frame 2</h3>
<div style="position: relative; width: 100%; padding-top: 56.25%;">
<iframe src="https://streamable.com/e/1hy1sl?loop=0" frameborder="0" width="100%" height="100%" allowfullscreen style="width:100%;height:100%;position:absolute;left:0px;top:0px;overflow:hidden;"></iframe>
</div>
<p style="margin: 10px 0 0 0;"></p>
<table>
<thead>
  <tr>
    <th>Handedness</th>
    <th>Axis</th>
    <th>Shape</th>
    <th>Direction</th>
    <th>Gesture</th>
  </tr>
Both Hands Horizontal Straight Leftward Yes Lexical prompt: “u.s. to”
<h3 style="border-bottom: 1px solid; margin: 0 0 8px 0;">Frame 3</h3>
<div style="position: relative; width: 100%; padding-top: 56.25%;">
<iframe src="https://streamable.com/e/ako3bg?loop=0" frameborder="0" width="100%" height="100%" allowfullscreen style="width:100%;height:100%;position:absolute;left:0px;top:0px;overflow:hidden;"></iframe>
</div>
<p style="margin: 10px 0 0 0;"></p>
<table>
<thead>
  <tr>
    <th>Handedness</th>
    <th>Axis</th>
    <th>Shape</th>
    <th>Direction</th>
    <th>Gesture</th>
  </tr>
Both Hands Horizontal Straight Diagonal left down Yes Lexical prompt: “Mexico by”
<h3 style="border-bottom: 1px solid; margin: 0 0 8px 0;">Frame 4</h3>
<div style="position: relative; width: 100%; padding-top: 56.25%;">
  <iframe src="https://streamable.com/e/u1c6wn?loop=0" frameborder="0" width="100%" height="100%" allowfullscreen style="width:100%;height:100%;position:absolute;left:0px;top:0px;overflow:hidden;"></iframe>
</div>
<p style="margin: 10px 0 0 0;"></p>
Handedness Axis Shape Direction Gesture
Both Hands Horizontal Straight Rightward Yes
Lexical prompt: “plane”

The problem statement can be framed as:

Given a video input, identify whether a hand gesture is present corresponding to the hand gestures portrayed by the speaker during the enunciation of the “from-to” lexical trigger in the training video frames. If it is, then classify the video frame with the different gesture types (handedness, axis, shape, direction). If it is not, then classify the video frame as “No Gesture”.

We look to segment a video into video frames of equal time-duration of 500 ms. Furthermore, to create a set of True Positive (TP) instances, we extract the video frames corresponding to the start and ending portions of the lexical trigger. To create a set of True Negative (TN) instances, we extract the video frames and annotate the ones having hand gestures unrelated to the ones found in the TP set. We perform the annotations using the Red Hen Rapid Annotator tool.

The classification model comprises of mainly three units: positional embedding to enable the model access to the pixel order information, Transformer encoder to process the source sequence, and a max-pooling layer to keep the most important feature.

The Model architecture is shown below:

Model Architecture
Model Architecture

Since a video frame can be accompanied by multiple hand gesture types, it makes sense to treat it as a multi-label classification problem as the labels are not mutually exclusive.

I spend the last week of the GSoC period on packaging the final product into a singularity container (which is a requirement set by Red Hen). To build a container, we first need a definition file:

Bootstrap: docker From: tensorflow/tensorflow:latest-gpu

%labels AUTHOR nickilmaveli@gmail.com

%post apt-get update && apt-get -y install git ffmpeg libsm6 libxext6 -y cd / && git clone https://github.com/Nickil21/joint-meaning-construal.git pip3 install pandas opencv-python numpy tables joblib imageio openpyxl flask jinja2 git+https://github.com/tensorflow/docs

%runscript cd /joint-meaning-construal/ && python3 detect_gesture.py

We can then build the image using the Sylabs Cloud Builder by uploading the definition file. The build takes about 15-20 mins to complete. Once the image is built, the steps to run inside the singularity container are as follows:

  1. Log on to the CWRU HPC OnDemand Web Portal.
  2. Click on Clusters and choose “rider Shell Access” from the drop-down menu. This should redirect you to a Terminal console window of the HPC server.
  3. Enter the gallina home directory:

    [nxm526@hpc4 ~]$ cd /mnt/rds/redhen/gallina/home/nxm526/
  4. Load the Singularity module:

    [nxm526@hpc4 nxm526]$ module load singularity
  5. Pull with Singularity:

    [nxm526@hpc4 nxm526]$ singularity pull library://nickil21/default/image:latest
  6. Move the Singularity image inside the project folder:

    [nxm526@hpc4 nxm526]$ mv image_latest.sif joint-meaning-construal/singularity/
  7. Enter the project root folder:

    [nxm526@hpc4 nxm526]$ cd joint-meaning-construal/
  8. Execute the command within a container:

    [nxm526@hpc4 nxm526]$ singularity exec ./singularity/image_latest.sif python detect_gesture.py <path_to_video_file>
  9. The output files would be present inside the static/uploads/ folder. Retrieve the output files using the CWRU HPC OnDemand Web Portal.

    layout: single classes: wide title: “Week 1 (May 18 - May 24)” permalink: /blog/gsoc-2021/report/week-1/ excerpt: "" last_modified_at: 2021-05-30 —

In this week, Mark creates CWRU (Case Western Reserve University) username accounts for all the students and mentors. Mine is nxm526. To receive all official HPC messages, Mark provides us with a case.edu{:target="_blank"} account.

To see if we can log in to the CWRU HPC server after connecting through the CWRU VPN:

$ ssh nxm526@rider.case.edu Warning: Permanently added the ECDSA host key for IP address ‘129.22.100.157’ to the list of known hosts. $ nxm526@rider.case.edu’s password:

To access the HOME directory:

$ [nxm526@hpc3 home]$ cd /home/nxm526/ $ [nxm526@hpc3 ~]$ ll total 0

After exchanging a couple of slack messages, Tiago, my primary mentor, schedules a brief discussion on the project with me and the other mentors on May 27th. The meeting takes place via a Zoom call. During the session, we introduce ourselves. The mentors specifically highlight areas in which their strength lies and how I could seek their advice depending on the domain of the problem to leverage their expertise. There is also a separate slack channel project_construal_2021{:target="_blank"} to update all the mentors at once at every stage of the project.

Following are some of the key takeaways from the meeting:

The next day, we have our first introductory meeting with all the 12 GSoC selected students under Red Hen and the mentors. The meeting agenda is to introduce all the mentor(s)-mentee(s) and get to know the cohort better. Each student spoke about their project, their assigned mentors and came up with questions that could benefit others in navigating the project more smoothly.

layout: single classes: wide title: “Week 3 (June 1 - June 7)” permalink: /blog/gsoc-2021/report/week-3/ excerpt: "" modified: last_modified_at: 2021-06-07 — Inorder to comply with the section 108 of the U.S. Copyright Act{:target="_blank“}, it is necessary that we email access@redhenlab.org{:target=”_blank"} requesting access to the Red Hen data and tools. Finally, I get the access upon submitting the research as well as the contribution proposal.

Due to the space storage constraints in the default HOME directory, it is not advisable to keep files over there. To store files having large sizes, gallina home, which is a directory on gallina (a Red Hen server) needs to be set up.

To check if gallina home is properly set in CWRU HPC:

$ [nxm526@hpc4 home]$ pwd /mnt/rds/redhen/gallina/home $ [nxm526@hpc4 home]$ ls -al nxm526 total 20 drwxrwsr-x 2 nxm526 mbt8 2 Jun 7 18:30 . drwxrwsr-x 91 mbt8 mbt8 91 Jun 7 18:30 .. — layout: single classes: wide title: “Week 4 (June 8 - June 14)” permalink: /blog/gsoc-2021/report/week-4/ excerpt: "" modified: last_modified_at: 2021-06-15 — Since this being the start of the coding period, I did get in touch with my primary mentor, Tiago. We mutually agree that querying the Red Hen dataset that have a particular gesture type can be a good way to investigate construal meaning relationships between the different linguistic elements. To narrow down the numerous possibilities of gesture types, we only consider hand gestures for our ablation.

We choose the following parameters:

Gesture Type Values
Body part left hand, right hand, both hands
Axis vertical, horizontal/lateral
Direction upward, downward, leftward, rightward, diagonal right up, diagonal left up, diagonal right down, diagonal left down
Shape straight, arced

To begin with, we want the algorithm to be capable of understanding only the Prominence dimension which is relevant in the case of gestures and more widespread.

title: “Week 5 (June 15 - June 21)” layout: single classes: wide permalink: /blog/gsoc-2021/report/week-5/ excerpt: "" modified: last_modified_at: 2021-06-22 — In this week, I schedule a Zoom meeting on June 16th with Tiago to discuss about the next steps. Here’s a gist of the discussion that took place:

Objective

We want to map the Frames into a particular construal dimension. Even though the Terminal nodes may be somewhat random in meaning, using the FrameNet’s rich network-based parsing mechanisms, we can perhaps leverage the upper levels in the graph to map it to a specific construal dimension.

Method

  1. For the Red Hen dataset, we need to retrieve Frames that have the gesture types we want to query for. To obtain these Frames, we can use a timestamp interval of +/- 3 seconds from the annotated gesture types. The Frame is made up of two components - an audio transcription in the form of a textual output and a Frame Metadata.
  2. The FrameNet API can be used to model the Frame Metadata and carry out the Frame-based Semantic parsing. For instance, we can leverage the rich APIs available to derive relationships between all the FrameNet frames in the network. FrameNet has top-level concepts as Event, Relation, State, Entity, Locale, and Process. Most of the individual Frames in the network inherit these top-level concepts and are related to one another using a relationship such as Using, Precedes, Subframe, etc.
  3. Once we have some form of a relationship among Frames, we need to create a representation of these Frames.
  4. After getting the representations, we can cluster using a simple K-Means Clustering algorithm.
  5. We can use these clusters to segment the Frames into Lexical units.
  6. The Lexical units can be mapped to a representation using a contextual word embeddings method.
  7. These representations can be fed as input to the Generator, which outputs a sequence. Finally, we feed this to the Discriminator.
  8. Next, we can feed the textual output (audio transcription) to the Discriminator.
  9. In the last step, Adversarial training happens using GANs to determine if the final outcome is a real or a fake transcription. The Discriminator is trained to classify whether samples are from the Generator or from the real data distribution. The objective of the Generator is to produce samples that are indistinguishable by the Discriminator.
  10. Through the Human-In-Loop intervention, we can create synthetic datasets to evaluate whether the correct construal dimension is associated with a Frame or not.
  11. The idea is to start with maybe two types of construal dimension – Prominence and Configuration and analyze if we are doing better than a random guess. Later, we can move on to cover the other dimensions.

Resources

Timeline

For the first evaluation period, we hope to have the following components ready:

For the final evaluation period, we hope to complete:

The following script basically summarizes how to interact with the ELAN files to obtain the annotations according to the Tier Type/Name/ID. Finally, we save the gesture types between the start and end times containing a clause which is a transcribed text.

import pympi
import pandas  as pd

# Tier names we want to query for containing the gesture types
tier_names = ['clauses', 'Handshape', 'Movement direction', 'Handedness', 'Axis']

list_of_dfs = []
# Initialize the elan file
eafob = pympi.Elan.Eaf("input/sample.eaf")
# Loop over all the defined tiers that contain orthography
for tier in tier_names:
    # If the tier is not present in the elan file spew an error and
    # continue. This is done to avoid possible KeyErrors
    if tier not in eafob.get_tier_names():
        print('WARNING!!!')
        print('One of the ortography tiers is not present in the elan file')
        print('namely: {}. skipping this one...'.format(tier))
    # If the tier is present we can loop through the annotation data
    else:
        lst = []
        for annotation in eafob.get_annotation_data_for_tier(tier):
            d = {}
            d['start_time'] = annotation[0]
            d['end_time'] = annotation[1]
            d[tier] = annotation[2]
            d['gesture_phases'] = annotation[3]
            lst.append(d)
        df = pd.DataFrame(lst)
        list_of_dfs.append(df)

data = pd.concat([d.set_index(['start_time', 'end_time', 'gesture_phases']) for d in list_of_dfs],
                  axis=1)
data.reset_index(inplace=True)
data['axis'] = None
data.sort_values(['start_time', 'end_time'], inplace=True)
data.drop_duplicates(inplace=True)
data.to_csv("output.tsv", index=False, sep="\t")

One thing to note is that there could be a possibility of a mismatch between a Frame and its transcription due to the tagging being performed at a granular level of timestamp and a single annotator(presumably) doing all the tagging. Anyway, here’s how the top 10 rows of output.tsv looks like:

# start_time end_time gesture_phases clauses Handshape Movement direction Handedness axis
1 10872.0 11259.0 str I had– flat LAB Left
2 13387.0 13733.0 str She had been to Disneyland here. 1-2 stretched down Left
3 14259.0 14716.0 str And I had an appearance 1-2 stretched, 3-5 bent PT Left
4 15254.0 15515.0 str same as before (appearance) 1-2 stretched, 3-5 bent left Left
5 18061.0 18167.0 str which was weird 1-4 touching, 5 stretched down Left
6 18309.0 18494.0 str going down to 1-4 touching, 5 stretched down Left
7 25329.0 25797.0 str this was very pricess, Tinkle Bell, Snow White flat up Both
8 26211.0 26979.0 str SA(Snow White) flat up Both
9 27854.0 str We did the whole thing 1-2 connected down Left
10 28854.0 29892.0 str the r–, the lunch in the princess castle (illustration) flat LAB Left

The only issue we face here is that there is only a handful amount of ELAN annotations available inside the Red Hen repository. The ones with the relevant hand gesture tagging is even less, so we have no option but to look for alternative sources that could meet our purposes.

title: “Week 7 (June 29 - July 5)” layout: single classes: wide permalink: /blog/gsoc-2021/report/week-7/ excerpt: "" modified: last_modified_at: 2021-07-06 —

As a result of not having relevant hand gestures dataset to start the experimentation phase, I reach out to the mailing list of International Society of Gesture Studies and do manage to get a few responses.

So far, I collect the following datasets:

Tiago and I have a chat on Zoom. In the meeting, we discuss about utilizing the already existing annotation tool, Rapid Hen Annotator, to quickly manually tag the pre-defined gesture parameters from segmented videos of Ellen interview dataset. We choose about 30 video segments to begin with. The idea is to later ask the annotators a template of boolean questions (yes/no) corresponding to a particular construal dimension, for eg., “does the highlighted text invoke ordering of items in a sequence?”, “does the highlighed text depict a time lapse?”, “does the highlighed text signify levels of importance?” and so on.

title: “Week 8 (July 6 - July 12)” layout: single classes: wide permalink: /blog/gsoc-2021/report/week-8/ excerpt: "" modified: last_modified_at: 2021-07-12 —