This paper addresses engagement recognition based on four multimodal listener behaviors - backchannels, laughing, eye-gaze, and head nodding. Engagement is an indicator of how much a user is interested in the current dialogue. Multiple third-party annotators give ground truth labels of engagement in a human-robot interaction corpus. Since perception of engagement is subjective, the annotations are sometimes different between individual annotators. Conventional methods directly use integrated labels, such as those generated through simple majority voting, and do not consider each annotator’s recognition. We propose a two-step engagement recognition where each annotator’s recognition is modeled and the different anno-tators’ models are aggregated to recognize the integrated label. The proposed neural network consists of two parts. The first part corresponds to each annotator’s model which is trained with the corresponding labels independently. The second part aggregates the different annotators’ models to obtain one integrated label. After each part is pre-trained, the whole network is fine-tuned through back-propagation of prediction errors. Experimental results show that the proposed network outperforms baseline models which directly recognize the integrated label without considering differing annotations.
Engagement Recognition in Spoken Dialogue via Neural Network by Aggregating Different Annotators' Models
K. Inoue,Divesh Lala,K. Takanashi,Tatsuya Kawahara
Published 2018 in Interspeech
ABSTRACT
PUBLICATION RECORD
- Publication year
2018
- Venue
Interspeech
- Publication date
2018-09-02
- Fields of study
Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-44 of 44 references · Page 1 of 1
CITED BY
Showing 1-11 of 11 citing papers · Page 1 of 1