We decide to rely on external datasets due to the paucity of annotated data in-house. We finally settle on the PATS dataset which consists of aligned pose, audio, and transcripts and can be useful for our purpose. There a total of 25 speakers: 15 talk show hosts, 5 lecturers, 3 YouTubers, and 2 televangelists. For our experimentation purposes, we use a subset consisiting of 7 speakers.
Here’s a small demo of the PATS dataset: