I am thrilled to be a part of the awesome Red Hen Lab community! Thank you for selecting me and giving me a chance to contribute to the Red Hen codebase.
This post describes my journey after being selected as a Google Summer of Code (GSoC) student associated with Red Hen Lab and FrameNet Brasil. I plan to summarize my progress at the end of every week until the end of the summer.
Here’s the abstract of my project:
The project aims to develop a prototype that is capable of meaning construction using multi-modal channels of communication. Specifically, for a co-speech gestures dataset, using the annotations obtained manually coupled with the metadata obtained through algorithms, we devise a mechanism to disambiguate meaning considering the influence of all the different modalities involved in a particular frame. Since only a handful of annotated datasets have been made available by Red Hen, we leverage semi-supervised learning techniques to annotate additional unlabeled data. Furthermore, since each frame could have multiple meaning interpretations possible, we use human annotators to annotate a subset of our validation set and report our performance on that set.
My mentors are: Suzie Xi{:target="_blank“}, Mark Turner{:target=”_blank“}, Javier Valenzuela{:target=”_blank“}, Anna Wilson{:target=”_blank“}, Maria Hedblom{:target=”_blank“}, Francis Steen{:target=”_blank“}, Tiago Torrent{:target=”_blank“} (primary), Frederico Belcavello{:target=”_blank“}, Inés Olza{:target=”_blank"}.
title: “Week 10 (July 20 - July 26)” layout: single classes: wide permalink: /blog/gsoc-2021/report/week-10/ excerpt: "" modified: last_modified_at: 2021-07-12 —
We decide to choose “from-to” with a proxmity window of 4 word tokens between “from” and “to” as the initial template of lexical trigger to map it to the construal dimension, Prominence. In addition, we also identify “first-second”, “firstly-secondly” and “here-then”, but could not find much relevant hand gestures in the PATS dataset.
Consider a sample video of a talk show host, Jimmy Fallon, taken from the PATS dataset with a pre-defined start and end time:
{% include video id="JY-Nhs__4xk?start=145&end=160" provider=“youtube” %}
Transcript: “with Mexico that players can either travel from the u.s. to Mexico by plane or just walked past the wall that still won’t be built it’s up to you you can choose”
The video frames corresponding to the “from-to” lexical trigger are shown below:
<h3 style="border-bottom: 1px solid; margin: 0 0 8px 0;">Frame 1</h3>
<div style="position: relative; width: 100%; padding-top: 56.25%;">
<iframe src="https://streamable.com/e/3qs7r5?loop=0" frameborder="0" width="100%" height="100%" allowfullscreen style="width:100%;height:100%;position:absolute;left:0px;top:0px;overflow:hidden;"></iframe>
</div>
<p style="margin: 10px 0 0 0;"></p>
<table>
<thead>
<tr>
<th>Handedness</th>
<th>Axis</th>
<th>Shape</th>
<th>Direction</th>
<th>Gesture</th>
</tr>
<h3 style="border-bottom: 1px solid; margin: 0 0 8px 0;">Frame 2</h3>
<div style="position: relative; width: 100%; padding-top: 56.25%;">
<iframe src="https://streamable.com/e/1hy1sl?loop=0" frameborder="0" width="100%" height="100%" allowfullscreen style="width:100%;height:100%;position:absolute;left:0px;top:0px;overflow:hidden;"></iframe>
</div>
<p style="margin: 10px 0 0 0;"></p>
<table>
<thead>
<tr>
<th>Handedness</th>
<th>Axis</th>
<th>Shape</th>
<th>Direction</th>
<th>Gesture</th>
</tr>
<h3 style="border-bottom: 1px solid; margin: 0 0 8px 0;">Frame 3</h3>
<div style="position: relative; width: 100%; padding-top: 56.25%;">
<iframe src="https://streamable.com/e/ako3bg?loop=0" frameborder="0" width="100%" height="100%" allowfullscreen style="width:100%;height:100%;position:absolute;left:0px;top:0px;overflow:hidden;"></iframe>
</div>
<p style="margin: 10px 0 0 0;"></p>
<table>
<thead>
<tr>
<th>Handedness</th>
<th>Axis</th>
<th>Shape</th>
<th>Direction</th>
<th>Gesture</th>
</tr>
<h3 style="border-bottom: 1px solid; margin: 0 0 8px 0;">Frame 4</h3>
<div style="position: relative; width: 100%; padding-top: 56.25%;">
<iframe src="https://streamable.com/e/u1c6wn?loop=0" frameborder="0" width="100%" height="100%" allowfullscreen style="width:100%;height:100%;position:absolute;left:0px;top:0px;overflow:hidden;"></iframe>
</div>
<p style="margin: 10px 0 0 0;"></p>
Handedness | Axis | Shape | Direction | Gesture |
---|---|---|---|---|
Both Hands | Horizontal | Straight | Rightward | Yes |
The problem statement can be framed as:
Given a video input, identify whether a hand gesture is present corresponding to the hand gestures portrayed by the speaker during the enunciation of the “from-to” lexical trigger in the training video frames. If it is, then classify the video frame with the different gesture types (handedness, axis, shape, direction). If it is not, then classify the video frame as “No Gesture”.
We look to segment a video into video frames of equal time-duration of 500 ms. Furthermore, to create a set of True Positive (TP) instances, we extract the video frames corresponding to the start and ending portions of the lexical trigger. To create a set of True Negative (TN) instances, we extract the video frames and annotate the ones having hand gestures unrelated to the ones found in the TP set. We perform the annotations using the Red Hen Rapid Annotator tool.
The classification model comprises of mainly three units: positional embedding to enable the model access to the pixel order information, Transformer encoder to process the source sequence, and a max-pooling layer to keep the most important feature.
The Model architecture is shown below:
Since a video frame can be accompanied by multiple hand gesture types, it makes sense to treat it as a multi-label classification problem as the labels are not mutually exclusive.
I spend the last week of the GSoC period on packaging the final product into a singularity container (which is a requirement set by Red Hen). To build a container, we first need a definition file:
Bootstrap: docker From: tensorflow/tensorflow:latest-gpu
%labels AUTHOR nickilmaveli@gmail.com
%post apt-get update && apt-get -y install git ffmpeg libsm6 libxext6 -y cd / && git clone https://github.com/Nickil21/joint-meaning-construal.git pip3 install pandas opencv-python numpy tables joblib imageio openpyxl flask jinja2 git+https://github.com/tensorflow/docs
%runscript cd /joint-meaning-construal/ && python3 detect_gesture.py
We can then build the image using the Sylabs Cloud Builder by uploading the definition file. The build takes about 15-20 mins to complete. Once the image is built, the steps to run inside the singularity container are as follows:
Enter the gallina home directory:
[nxm526@hpc4 ~]$ cd /mnt/rds/redhen/gallina/home/nxm526/
Load the Singularity module:
[nxm526@hpc4 nxm526]$ module load singularity
Pull with Singularity:
[nxm526@hpc4 nxm526]$ singularity pull library://nickil21/default/image:latest
Move the Singularity image inside the project folder:
[nxm526@hpc4 nxm526]$ mv image_latest.sif joint-meaning-construal/singularity/
Enter the project root folder:
[nxm526@hpc4 nxm526]$ cd joint-meaning-construal/
Execute the command within a container:
[nxm526@hpc4 nxm526]$ singularity exec ./singularity/image_latest.sif python detect_gesture.py <path_to_video_file>
static/uploads/
folder. Retrieve the output files using the CWRU HPC OnDemand Web Portal.layout: single classes: wide title: “Week 1 (May 18 - May 24)” permalink: /blog/gsoc-2021/report/week-1/ excerpt: "" last_modified_at: 2021-05-30 —
In this week, Mark creates CWRU (Case Western Reserve University) username accounts for all the students and mentors. Mine is nxm526. To receive all official HPC messages, Mark provides us with a case.edu{:target="_blank"} account.
To see if we can log in to the CWRU HPC server after connecting through the CWRU VPN:
$ ssh nxm526@rider.case.edu Warning: Permanently added the ECDSA host key for IP address ‘129.22.100.157’ to the list of known hosts. $ nxm526@rider.case.edu’s password:
To access the HOME directory:
$ [nxm526@hpc3 home]$ cd /home/nxm526/ $ [nxm526@hpc3 ~]$ ll total 0
After exchanging a couple of slack messages, Tiago, my primary mentor, schedules a brief discussion on the project with me and the other mentors on May 27th. The meeting takes place via a Zoom call. During the session, we introduce ourselves. The mentors specifically highlight areas in which their strength lies and how I could seek their advice depending on the domain of the problem to leverage their expertise. There is also a separate slack channel project_construal_2021{:target="_blank"} to update all the mentors at once at every stage of the project.
Following are some of the key takeaways from the meeting:
Aim at making small advancements that resembles a minuscule improvement when comparing to the existing systems. Since the topic of detecting meaning from a multimodal input has not been explored extensively due to the difficulty in understanding several interpretations possible, making even a small contribution in the right direction is challenging.
Even creating an alogrithm for captioning of hand gestures in a 2D co-ordinate space is tricky due to multiple interpretations of the scene understanding by annotators/end-users. As a result, the Inter-Annotator Agreement tends to be quite low in this scenario.
layout: single classes: wide title: “Week 3 (June 1 - June 7)” permalink: /blog/gsoc-2021/report/week-3/ excerpt: "" modified: last_modified_at: 2021-06-07 — Inorder to comply with the section 108 of the U.S. Copyright Act{:target="_blank“}, it is necessary that we email access@redhenlab.org{:target=”_blank"} requesting access to the Red Hen data and tools. Finally, I get the access upon submitting the research as well as the contribution proposal.
Due to the space storage constraints in the default HOME directory, it is not advisable to keep files over there. To store files having large sizes, gallina home, which is a directory on gallina (a Red Hen server) needs to be set up.
To check if gallina home is properly set in CWRU HPC:
$ [nxm526@hpc4 home]$ pwd /mnt/rds/redhen/gallina/home $ [nxm526@hpc4 home]$ ls -al nxm526 total 20 drwxrwsr-x 2 nxm526 mbt8 2 Jun 7 18:30 . drwxrwsr-x 91 mbt8 mbt8 91 Jun 7 18:30 .. — layout: single classes: wide title: “Week 4 (June 8 - June 14)” permalink: /blog/gsoc-2021/report/week-4/ excerpt: "" modified: last_modified_at: 2021-06-15 — Since this being the start of the coding period, I did get in touch with my primary mentor, Tiago. We mutually agree that querying the Red Hen dataset that have a particular gesture type can be a good way to investigate construal meaning relationships between the different linguistic elements. To narrow down the numerous possibilities of gesture types, we only consider hand gestures for our ablation.
We choose the following parameters:
Gesture Type | Values |
---|---|
Body part | left hand, right hand, both hands |
Axis | vertical, horizontal/lateral |
Direction | upward, downward, leftward, rightward, diagonal right up, diagonal left up, diagonal right down, diagonal left down |
Shape | straight, arced |
title: “Week 5 (June 15 - June 21)” layout: single classes: wide permalink: /blog/gsoc-2021/report/week-5/ excerpt: "" modified: last_modified_at: 2021-06-22 — In this week, I schedule a Zoom meeting on June 16th with Tiago to discuss about the next steps. Here’s a gist of the discussion that took place:
We want to map the Frames into a particular construal dimension. Even though the Terminal nodes may be somewhat random in meaning, using the FrameNet’s rich network-based parsing mechanisms, we can perhaps leverage the upper levels in the graph to map it to a specific construal dimension.
For the first evaluation period, we hope to have the following components ready:
For the final evaluation period, we hope to complete:
The following script basically summarizes how to interact with the ELAN files to obtain the annotations according to the Tier Type/Name/ID. Finally, we save the gesture types between the start and end times containing a clause which is a transcribed text.
import pympi
import pandas as pd
# Tier names we want to query for containing the gesture types
tier_names = ['clauses', 'Handshape', 'Movement direction', 'Handedness', 'Axis']
list_of_dfs = []
# Initialize the elan file
eafob = pympi.Elan.Eaf("input/sample.eaf")
# Loop over all the defined tiers that contain orthography
for tier in tier_names:
# If the tier is not present in the elan file spew an error and
# continue. This is done to avoid possible KeyErrors
if tier not in eafob.get_tier_names():
print('WARNING!!!')
print('One of the ortography tiers is not present in the elan file')
print('namely: {}. skipping this one...'.format(tier))
# If the tier is present we can loop through the annotation data
else:
lst = []
for annotation in eafob.get_annotation_data_for_tier(tier):
d = {}
d['start_time'] = annotation[0]
d['end_time'] = annotation[1]
d[tier] = annotation[2]
d['gesture_phases'] = annotation[3]
lst.append(d)
df = pd.DataFrame(lst)
list_of_dfs.append(df)
data = pd.concat([d.set_index(['start_time', 'end_time', 'gesture_phases']) for d in list_of_dfs],
axis=1)
data.reset_index(inplace=True)
data['axis'] = None
data.sort_values(['start_time', 'end_time'], inplace=True)
data.drop_duplicates(inplace=True)
data.to_csv("output.tsv", index=False, sep="\t")
One thing to note is that there could be a possibility of a mismatch between a Frame and its transcription due to the tagging being performed at a granular level of timestamp and a single annotator(presumably) doing all the tagging. Anyway, here’s how the top 10 rows of output.tsv
looks like:
# | start_time | end_time | gesture_phases | clauses | Handshape | Movement direction | Handedness | axis |
---|---|---|---|---|---|---|---|---|
1 | 10872.0 | 11259.0 | str | I had– | flat | LAB | Left | |
2 | 13387.0 | 13733.0 | str | She had been to Disneyland here. | 1-2 stretched | down | Left | |
3 | 14259.0 | 14716.0 | str | And I had an appearance | 1-2 stretched, 3-5 bent | PT | Left | |
4 | 15254.0 | 15515.0 | str | same as before (appearance) | 1-2 stretched, 3-5 bent | left | Left | |
5 | 18061.0 | 18167.0 | str | which was weird | 1-4 touching, 5 stretched | down | Left | |
6 | 18309.0 | 18494.0 | str | going down to | 1-4 touching, 5 stretched | down | Left | |
7 | 25329.0 | 25797.0 | str | this was very pricess, Tinkle Bell, Snow White | flat | up | Both | |
8 | 26211.0 | 26979.0 | str | SA(Snow White) | flat | up | Both | |
9 | 27854.0 | str | We did the whole thing | 1-2 connected | down | Left | ||
10 | 28854.0 | 29892.0 | str | the r–, the lunch in the princess castle (illustration) | flat | LAB | Left |
title: “Week 7 (June 29 - July 5)” layout: single classes: wide permalink: /blog/gsoc-2021/report/week-7/ excerpt: "" modified: last_modified_at: 2021-07-06 —
As a result of not having relevant hand gestures dataset to start the experimentation phase, I reach out to the mailing list of International Society of Gesture Studies and do manage to get a few responses.
So far, I collect the following datasets:
title: “Week 8 (July 6 - July 12)” layout: single classes: wide permalink: /blog/gsoc-2021/report/week-8/ excerpt: "" modified: last_modified_at: 2021-07-12 —