(英) |
Study on the Personalization of Sign Language
Using Video Data of Japanese Sign Language
Zixuan DAI†, Shinji SAKO††
Graduate School of Engineering, Nagoya Institute of Technology
Gokisocho, Showa-ku, Nagoya, 466-0061 Japan
E-mail: †cnq14068@ict.nitech.ac.jp, ††s.sako@nitech.ac.jp
Abstract In recent years, there have been growing expectations for technology to automatically recognize signs and generate sign language CG. It is believed that individuality is expressed in sign languages as in spoken languages, but not many studies have been conducted on the individuality of sign languages. In this study, we examined whether the individuality of sign movement can be analyzed for video data of Japanese Sign Language, referring to previous studies on the analysis of individuality and anonymity of signs in motion-capture data of French Sign Language. Preliminary results using 2D data have demonstrated that kinematic features contain sufficient information to identify individual signers, supporting our hypothesis. In addition, we extended the analysis by incorporating pose estimation with 3D data for further validation.
Key words Sign Language, Individuality, Normalization
1. Introduction
Sign language is an essential communication tool for the Deaf community, playing a more significant role than just a linguistic function. It is not only a form of interaction but also a clear reflection of cultural identity and personal expression. Unlike spoken language, this visual language conveys information through gestures, body posture, facial expressions, and eye movements, transmitting a broad range of nonverbal information. In addition to facilitating daily communication, sign language fosters social integration, promotes educational equality, and preserves cultural heritage within the Deaf community. These roles connect sign language closely with the identity and social belonging of the Deaf community.
Despite the global recognition and use of sign languages, they exhibit significant variations across different regions, reflecting the diverse cultural and linguistic backgrounds. Current research mainly focuses on the structural and linguistic aspects of sign language, while relatively little attention has been given to the variations between individual signers. However, these differences are a vital component of sign language, as the subtle nuances in each signer's movements provide a distinct layer of personal expression. Each gesture can reveal unique personal information, such as identity or gender. Studies have shown that skeletal movements alone, without additional context, can be sufficient to identify a person. Furthermore, it is also said that gender characteristics are expressed in sign language movements, and changing these attributes may cause discomfort. This underscores the importance of understanding both individual and gender-specific characteristics in sign language gestures, as they play a key role in both communication and personal identity expression.
These individual differences hold academic significance, especially in areas such as anonymization, the development of personalized services for automatic sign language recognition, and the customization of educational tools for the Deaf community. Japanese Sign Language (JSL), with its unique structure and expressions, remains underexplored in the context of personalized expression and individual identification through sign language movements, thereby limiting our understanding of its linguistic complexity and potential technological applications.
The Japanese Personal Information Protection Act defines “personal information” as data that can uniquely identify an individual through physical characteristics or other unique identifiers. The unique gesture patterns of individual signers qualify as personally identifiable information (PII), raising important ethical concerns about anonymity and privacy for the Deaf community. Personalized research in sign language movements, therefore, is not only a matter of technological advancement but also a critical step in safeguarding the privacy of signers. In spoken language, voice carries linguistic information—such as the speaker's intention, attitude, and emotion—and paralinguistic and non-linguistic elements, expressed through prosodic features like strength, pitch, and rhythm. Research by Hashimoto et al. [1], which investigated personality traits through voice features, highlighted how fundamental frequency and spectrum influence the perception of personality. Such methodologies prompt a parallel exploration in sign language, where the dynamics of gesture and expression may similarly reveal individual traits.
As a visual language, sign language also contains paralinguistic and non-verbal information that complements predetermined linguistic information, such as changes in gaze, facial direction, and the magnitude and speed of movements. It has been confirmed that people can identify signers not only through static physical attributes but also through dynamic features [2].
Research on embodied personality, such as Cutting and Kozlowski’s study on gait recognition [3] and Furuichi et al.’s exploration of dance movement identification [4], demonstrates the potential of movement-based identification. Although sign language movement information has not been formally recognized as a personal identity code, these investigations hint at the potential of sign language movements to convey identifiable information.
This study builds on the work of Félix et al. (2021) [5], which analyzed individuality and anonymity in French Sign Language (LSF) using motion capture data. We aim to explore whether it is possible to analyze the individuality of sign language movements in Japanese sign language video data. By comparing the effectiveness of using both 2D and 3D data for signer identification, this study also evaluates the impact of depth information on the recognition accuracy of individual signers. The inclusion of depth data from 3D data provides a more detailed understanding of the signer’s movements, potentially revealing finer distinctions between gestures that may be less discernible in 2D data.
2. Related Research
This chapter delves into Félix Bigand's groundbreaking research on the identification of sign language users through machine learning analysis of motion statistics. The study's novelty lies in its approach to understanding how motion features can reveal the identity of sign language users, employing a machine learning model trained to distinguish signers based on their unique motion characteristics. Utilizing the FSL Motion Capture Corpus, the study involved six signers who described 25 images, from which motion data were captured via 27 body markers. These markers were then used to derive 19 virtual markers to precisely describe major joint movements with the pelvis as the reference point. To ensure that the model is not just identifying the signer based on their profile size height, proportion of body parts shapes, etc., the researcher normalized the data based on the signer's size, shape, and posture.
Feature extraction was meticulously carried out, emphasizing the position and velocity of markers and calculating the first four statistical moments (mean, standard deviation, skewness, and kurtosis) and the covariance between velocities to capture the statistical distribution and interaction between different body parts.
Principal Component Analysis (PCA) further processed these statistics, allowing a linear classifier to successfully identify signers with high accuracy, demonstrating that beyond physical appearance, individuals' kinematic signatures are unique and identifiable.
This study not only showcases the potential of machine learning in signer identification but also challenges our understanding of motion and identity in sign language, opening avenues for further research in motion analysis and signer anonymity.
3. Research Purpose and Methodology
This study aims to build upon the foundational research of Félix Bigand, who used 3D motion capture data to distinguish individual signers in FSL. While his research demonstrated the power of 3D motion data for capturing the individuality of signers, the reliance on such specialized equipment and datasets poses significant challenges in terms of accessibility and widespread applicability, particularly in the context of JSL, for which no comparable 3D motion capture dataset exists. Recognizing the value of 3D data, this study pursues two objectives: first, to evaluate whether 2D datasets [6], which are more readily available, can similarly discern the subtle personalized and identity-related cues present in signer movements; and second, to explore the impact of depth information by using 3D pose estimation tools to extract depth data from existing 2D videos. This approach aims to bridge the gap between the richness of 3D data and the practicality of 2D data, assessing how depth information affects the accuracy of signer identification.
Bigand's study had limitations in sample size and diversity, as it involved only six signers from a specific demographic profile using FSL. This study uses data from 11 signers, each contributing 50 video samples, sourced from a comprehensive JSL dataset provided by SoftBank Corp. This sample was intentionally selected to ensure demographic diversity, covering three distinct age groups—youth, middle-aged, and elderly—while maintaining gender balance. This selection aims to examine the kinematic features across a broad spectrum of JSL users, thereby enhancing the study's inclusivity and generalizability.
3.1 2D Key Point Extraction
For 2D key point extraction, this study uses the YOLOv8(You Only Look Once) object detection system for the extraction of 11 key points (Figure 1) from the video data ((1) nose, (2) left eye, (3) right eye, (4) left ear, (5) right ear, (6) left shoulder, (7) right shoulder, (8) left elbow, (9) right elbow, (10) left wrist, (11) right wrist). In sign language, finger movements are critical for conveying information. The selected key points play essential roles in sign language gestures and expressions, providing a comprehensive framework for analyzing signer movements. YOLO's real-time detection capabilities enable precise extraction, forming the basis for calculating essential kinematic features.
By employing a more universally accessible two-dimensional dataset, this approach aims to democratize motion capture technology, enhancing its applicability across a variety of contexts and making sign language identification technology more universally accessible. This endeavor not only emphasizes learning and adaptation from previous research but also explores new possibilities in data dimensionality and analysis techniques.
3.2 3D Key Point Extraction
For 3D key point extraction, this study uses MediaPipe's Holistic model, which provides comprehensive landmark detection for the entire body, including pose, face, and hands. Specifically, the Holistic model detects key regions as follows:
(1) Pose: 33 key points covering the major joints and skeletal structure of the body.
(2) Face: 468 key points capturing detailed facial features.
(3) Hand: 21 key points for each hand.
This study focuses on a subset of key points relevant to the kinematic analysis of sign language movements. A total of 61 key points (Figure 2) were selected, including:
(1) 19 pose key points, covering critical landmarks such as the eyes, nose, shoulders, elbows, and wrists.
(2) 42 hand key points, with 21 key points for each hand.
Using MediaPipe Holistic, the x, y, and z coordinates for these 61 key points are extracted from the video data. The z-coordinate, which represents depth information, adds a third dimension to the 2D video frames, enabling more precise analysis of the signers' movements in 3D space. However, it should be noted that while the model provides z-axis data, the depth estimates are not highly accurate at this stage, offering only a rough approximation of depth. This additional depth information allows for a more detailed understanding of the subtle variations in signer movements, enhancing the overall accuracy of the kinematic analysis.
To effectively analyze and compare both 2D and 3D key point data, this study builds on previous methodologies by exploring different dimensionality reduction techniques. In addition to Principal Component Analysis (PCA), used in previous research, we also explore Uniform Manifold Approximation and Projection (UMAP) and t-Distributed Stochastic Neighbor Embedding (t-SNE). These methods reduce the dimensionality of the extracted key point data while preserving the most relevant features for identifying individual signers.
By comparing these techniques, we aim to determine the most effective method for representing the nuanced variations in signer movements across both 2D and 3D data.
3.3 Normalizations
Our dataset consists of two-dimensional, front-facing video captures of subjects performing seated actions, each lasting approximately two seconds. For each signer, the average posture from the first frame is used as their reference posture. The controlled environment and consistent camera perspective result in minimal variability in subject orientation and distance from the camera. This uniformity simplifies the normalization process, focusing on adjustments that enhance comparability without compromising the kinematic features of interest.
3.3.1 Size Normalization
Size normalization is implemented to eliminate the absolute differences in body size among signers, such as height and body shape, ensuring that the analysis focuses on kinematic features rather than static physical dimensions. This study compared two methods for calculating the size factor, each with a distinct approach to scaling the data.
Method 1: Shoulder Width
This method uses the distance between the left and right shoulders in the first frame as the size reference. This shoulder width serves as a stable measure for each signer. By scaling all key points based on this size reference, we normalize the size differences among signers, making their movements more comparable without being influenced by variations in body dimensions.
Method 2: Linear Regression Based on Reference Postures
The second method uses each signer's individual reference posture and the global reference posture to compute a size factor through linear regression. This size factor is then applied to scale all key points for each signer, effectively normalizing differences in body size. This approach offers a more detailed and context-aware scaling by considering the overall posture, beyond just the shoulder width.
Two types of error analyses were conducted to evaluate each method's effectiveness in removing static features:
(1) Inter-Signer Static Posture Difference (Figure 3): We calculated the Euclidean distance between signers’ static postures after normalization to assess how well each method reduced static differences.
(2) Global Posture Error (Figure 4): We calculated the Euclidean distance between each signer's average normalized posture and the global average posture. This provides a direct measure of how well the methods align the signers' postures to the global reference posture.
The comparative analysis demonstrated that the second method, based on linear regression, performed better in removing static size differences and aligning signers’ postures.
3.3.2 Shape Normalization
Shape normalization is intended to eliminate variations in body proportions between signers, which ensures that differences in movements are not affected by physical attributes. After applying size normalization, shape normalization aligns each signer's posture with the global average posture, while maintaining the relative proportions of different body parts.
The global reference p |