SignTutor: An Interactive System for Sign Language Tutoring
Transkript
SignTutor: An Interactive System for Sign Language Tutoring
Feature Article SignTutor: An Interactive System for Sign Language Tutoring Oya Aran, Ismail Ari, Lale Akarun, and Bülent Sankur Boğaziçi University Alexandre Benoit and Alice Caplier GIPSA-Lab Pavel Campr University of West Bohemia in Pilsen Ana Huerta Carrillo Technical University of Madrid François-Xavier Fanard Université Catholique de Louvain Language learning can only advance with practice and corrective feedback. The interactive system, SignTutor, evaluates users’ signing and gives multimodal feedback to help improve signing. S ign language recognition (SLR) is a multidisciplinary research area involving pattern recognition, computer vision, natural language processing, and linguistics. The concept presents a multifaceted problem not only because of the complexity of the visual analysis of hand gestures, but also due to the highly multimodal nature of sign languages. Although sign languages are wellstructured languages with a phonology, morphology, syntax, and grammar, they are different from spoken languages. The structure of a spoken language makes use of words sequentially, whereas a sign language makes use of several body movements in parallel. The linguistic characteristics of sign language are different from those of spoken languages due to the existence 1070-986X/09/$25.00 c 2009 IEEE of several context-affecting components, such as facial expressions and head movements in addition to hand movements.1,2 Practice can significantly enhance the learning of a language when there is validation and feedback. This is true for both spoken and sign languages. For spoken languages, students can evaluate their own pronunciation and improve to some extent by listening to themselves. Similarly, sign language teachers suggest that their students practice in front of a mirror. An SLR system lets students practice, validate, and evaluate their signing. Such systems, including our SignTutor, can be instrumental in assisting sign language education, especially for non-native signers. With our SignTutor system, we aim to teach the basics of sign language interactively. SignTutor automatically evaluates the student’s signing through visual feedback and information about the success of the performed sign. The SignTutor interactive platform enables users to watch and learn new signs, practice them, and validate their performance. SignTutor communicates to students with various feedback modalities: text message, recorded video of the user, video of the segmented hands, or avatar animation. A demonstration video of SignTutor can be downloaded at http://www.cmpe.boun.edu.tr/pilab/pilabfiles/ demos/signtutor_demo_DIVX.avi. Automatic sign language recognition A brief survey of sign language grammars illustrates the challenges faced in developing automated learning and tutoring systems. Sign language phonology makes use of hand shape, place of articulation, and movement. The morphology uses directionality, aspect, and numeral incorporation, while the syntax uses spatial localization and agreement as well as facial expressions. The whole message is contained not only in hand gestures and shapes (manual signs) but also in facial expressions and head/ shoulder motion (nonmanual signs). As a consequence, the language is intrinsically multimodal. Hence, SLR is a complex task that uses hand-shape recognition, gesture recognition, face-and-body-part detection, and facialexpression recognition as basic building blocks.3 Pioneering research on hand-gesture recognition and on SLR has mainly used instrumented gloves, which provide accurate data for hand position and finger configuration. These systems require users to wear cumbersome devices on Published by the IEEE Computer Society Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply. 81 IEEE MultiMedia their hands. However, humans would prefer systems that operate in their natural environment. Since the mid 1990s, improvements in camera hardware have enabled real-time, vision-based, hand-gesture recognition,4 which does not use instrumented gloves and instead requires one or more cameras connected to the computer. While vision-based SLR systems provide a more user-friendly environment, they also introduce several new challenges, such as the detection and segmentation of the hand and finger configuration, or handling occlusions. Because signs consist of dynamic elements, a recognition system that focuses only on the static aspects of the signs has a limited vocabulary. Hence, for recognizing hand gestures and signs, we must use methods capable of modeling the gestures’ inherent temporal characteristics. Researchers have used several methods such as neural networks, hidden Markov models (HMM), dynamic Bayesian networks, finitestate machines, or template matching.5 Among these methods, HMMs have been used the most extensively and proven successful in several kinds of SLR systems. Initial studies on vision-based SLR focused on limited-vocabulary systems, which could recognize 40 to 50 signs. These systems were capable of recognizing isolated signs and also continuous sentences, but with constrained sentence structure.6 One study addressed the scalability problem, which adopted an approach based on identifying sign phonemes and components rather than the whole sign.7 The advantage of identifying components is the decrease in the number of subunits that the system needs to model; subunits are then used to represent all the signs in the vocabulary. With componentbased systems, large-vocabulary SLR can be achieved to a certain degree. For example, in an instrumented glove-based system, research has recognized a database of 5,119 signs.8 Another important aspect of sign language is the use of nonmanual components concomitant with hand gestures. Most of the SLR systems in the literature concentrate on handgesture analysis only. However, without integrating nonmanual signs, it’s not possible to interpret the meaning of all these signs. There are only a limited number of studies that integrate nonmanual and manual cues for SLR.3 Current multimodal SLR systems either integrate lip motion and hand gestures, or only classify and incorporate either the facial expression9 or the head motion.10 Recognizing unconstrained, continuous sign sentences is another challenging problem in SLR. The significance of coarticulation effects necessitates the use of language models similar to those used in speech recognition. These complex problems require adequate expertise and technology, such as high-speed cameras that have higher resolution than those commercially available. An ideal automatic SLR system should accurately recognize a large vocabulary set enacted in continuous sentences. Such a system should operate in real time and be able to handle different lighting and environmental conditions. It should exploit both the manual and nonmanual signs and the grammar and syntax of the sign language. There are many potential application areas of SLR systems that require sign-to-text translation. These applications include human computer interaction; public information access, such as kiosks; or translation and dialog systems for human-to-human communication. An interesting application area is communicating sign data over a channel. The sign is captured on the sender side and the SLR output is sent to the receiving side where it’s synthesized and displayed on an avatar. This scheme is more flexible and more bandwidth efficient than simply sending the captured sign videos. SignTutor overview One of the key factors of SignTutor is that it integrates hand-motion and hand-shape analysis with head-motion analysis to recognize signs that include both hand gestures and head movements. This factor is important because head movements are one aspect of sign language that most students find hard to perform simultaneously with hand gestures. To offer students a learning advantage, in SignTutor we focus mostly on the signs that have similar hand gestures and that are mainly differentiated by head movements. In view of the prospective users and use environments of SignTutor, we have opted for a vision-based, user-friendly system compatible with easy-toobtain equipment, such as webcams. The system operates with a single camera focused on the user’s upper body. Figure 1a (next page) shows the graphical user interface of Sign Tutor. The system follows three steps for teaching a new sign: training, practice, 82 Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply. Training and feedback. For the training phase, SignTutor Figure 1. SignTutor GUI. (a) Training, provides a prerecorded reference video for each Information sign. The users select a sign from the list of possi- practice, information, and synthesis panels. ble signs and watch the prerecorded instruction (b) Feedback examples. video of that sign until they are ready to practice. In the practice phase, users are asked to perform the selected sign; their performance is recorded by the webcam. For an accurate analysis, users Practice are expected to face the camera, with full upper body in the camera field of view. The recording Synthesis (a) continues until both hands move beyond the camera’s field of view. SignTutor analyzes the hand motion, hand shape, and head motion in the recorded video and compares it with the selected sign. We have integrated several ways for the system to give users feedback regarding the quality of the enacted sign. The system offers feedback on student success with two components— manual (hand motion, shape, and position) and nonmanual (head and facial feature motion)— along with the sign name (see Figure 1b). Users can watch the video of their original performance. If they performed the sign properly, they can watch an avatar perform an animated version of their performance. SignTutor modules SignTutor stages consists of a face and hand detector stage, followed by the analysis stage, and the final sign classification, as Figure 2 shows. The critical part of SignTutor is the analysis and recognition subsystem, which receives the camera input, detects and tracks the hand, extracts features, and classifies the sign. Detection (b) Hand detection and segmentation Although skin color features can be applied for hand segmentation in controlled illumination, segmentation becomes problematic when skin regions overlap and occlude each other. In sign language, hand positions are often near the face and sometimes have to be in front of the face. Hand detection, segmentation, and occlusion problems are simplified when users wear colored gloves. The use of a simple marker as a colored glove makes the system adaptable to changing background and illumination conditions. For each glove color, we train its respective histogram of color components using several images. We chose the hue, saturation, and value (HSV) color space due to its ability to be useful despite changing illumination conditions.11 Analysis Figure 2. SignTutor Classification system flow: detection, analysis, and Hand motion (center of mass, velocity) Hand detection and segmentation (area, orientation, ...) (with respect to face and other hand) Hidden Markov model classifier JanuaryMarch 2009 Hand position Sign language video classification steps. Hand shape Final decision Face detection Head motion (neutral, no, yes, turn, ...) Eye, eyebrow, mouth motion (open, close, up, down, ...) 83 Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply. The HSV components are divided into bins. At each bin of the histogram, we calculate the number of occurrences of pixels that correspond to that bin, and normalize the histogram. To detect hand pixels in a scene, we find the histogram bin it belongs to and apply thresholding. To ensure connectivity, we apply a double thresholding set at low and high values and consider a pixel a hand pixel if its histogram value is higher than the high threshold. We still label the pixel as a glove (hand) if it’s between the two thresholds, provided one or more of the neighbor pixels is labeled a glove pixel. We assume the final hand region as the largest connected component of the detected pixels. IEEE MultiMedia Analysis of hand motion The system processes the motion of each hand by tracking its center of mass and estimating in every frame the position and velocity of each segmented hand. However, the hand trajectories can be corrupted by segmentation noise. Moreover, hands might occlude each other or there might be sudden changes in lighting (for example, a room light turned on or off), which could result in detection and segmentation failures. Thus, we use two independent Kalman filters, one for each hand, to smooth the estimated trajectories. We use a constant velocity motion model to approximate each hand’s motion, neglecting acceleration. When the system detects a hand in the video, it initializes the corresponding Kalman filter. Before each sequential frame, the Kalman filter predicts the new hand position and updates the filter parameters with the hand position measurements found by the hand-segmentation step. We calculate the hand-motion features from the posterior states of the corresponding Kalman filter, using x- and y-coordinates of center of mass and velocity.12 When there is a detection failure due to occlusion or bad lighting, we only use the Kalman filter prediction without updating the filter parameters. The system assumes that the hand is out of the camera’s view if it detects no hand segment for a number of consecutive frames. We further normalize the trajectories to obtain translation and scale invariance. To do so, we use a normalization strategy similar to that used in other work.12 We calculate the normalized trajectory coordinates via minimummaximum normalization. We handle this translation normalization by calculating the midpoints of the range of x- and y-coordinates, denoted as xm, y m. We select the scaling factor, d, as the spread maximum in x- and y-coordinates, because scaling horizontal and vertical displacements with different factors disturbs the shape. The normalized trajectory coordinates, (<x 0 1 ;y 0 1 >;. . .;<x 0 t ;y 0 t>;. . .;<x 0 N ;y 0 N >) such that 0 x0 t, y0 t 1, are then calculated as follows: x0t ¼ 0:5 þ 0:5ðxt xm Þ=d y 0t ¼ 0:5 þ 0:5ðy t y m Þ=d Because signs can be two-handed, we must normalize both hand trajectories. However, normalizing the trajectory of the two hands independently might result in data loss. To solve this problem, we jointly calculate the midpoint and the scaling factor of the left and right hand trajectories. Following the trajectory normalization, we translate the left and right hand trajectories with a starting position at (0,0). Extracting features from a 2D hand shape Hand shape and finger configuration contribute significantly to sign identification because each sign has a specific related movement of the head, hands, and hand postures. Moreover, there are signs that solely depend on hand shape. We intend our system to work with a single, low-resolution camera whose field of view covers the user’s upper body and is not directly focused on the hands. In this setup, we face several difficulties: low resolution (less than 80 80 pixels) of hand images, segmentation errors due to blurring caused by fast movement of the hands, the same binary image, that is, a silhouette, from more than one hand posture. These problems constrain us to using only low-level features that are resistant to segmentation errors and work well with low-resolution images. To accommodate these requirements, we use simple appearance-based shape features calculated from the hand silhouettes and select the features to reflect differences in hand shape and finger postures. The shape features need to be scale-invariant so that hands with similar shape but a different size result in the same 84 Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply. Table 1. Hand shape features. Feature number Invariant Hand shape 2 3 4 Best fitting ellipse. 8 9 10 11 12 13 15 16 17 18 19 Area filters: white indicate areas without hand and green areas with hand. • • • • • • • sin (2∗α) α = angle of ellipse major axis • • cos (2∗α) α = angle of ellipse major axis • Elongation (ratio of ellipse major/minor axis length) Percentage of NW (north-west) area filled by hand • • Percentage of N area filled by hand • Percentage of NE area filled by hand Percentage of E area filled by hand • • Percentage of SE area filled by hand • Percentage of S area filled by hand Percentage of SW area filled by hand • • Percentage of W area filled by hand • Ratio of hand pixels outside/inside ellipse Ratio of hand/background pixels inside ellipse 5 14 Scale Rotation Best fitting ellipse width Best fitting ellipse height Compactness (perimeter2/area) 1 6 7 Feature Total area (pixels) Bounding box width Bounding box height • features by their range, that is, by using Fn ¼ (Fmin)/(maxmin) where min is the minimum value of the feature in the training dataset and max is the maximum value. Any value exceeding the [0,1] interval is truncated. Head-movement analysis Once we detect the face, we determine rigid head motions—such as head rotations and head nods—by using an algorithm inspired by the human visual system.13 First, we apply a filter following the model of the human retina.14 This filter enhances moving contours with an outer plexiform layer and cancels static ones with an inner plexiform layer. This prefiltering mitigates any illumination changes and noise. Second, we compute the fast Fourier transform of the filtered image in the log polar domain as a model of the primary visual cortex.15 This step extracts two types of features: the quantity of motion and motion event alerts. In parallel, an optic flow algorithm extracts both orientation and velocity information only on the motion event alerts issued by the visual cortex stage.16 Thus, after each motion alert, we estimate the head velocity at each frame. Figure 3 shows a block diagram of the algorithm. After these three steps, the head analyzer outputs three features per frame: the quantity of JanuaryMarch 2009 feature values. However, because our system uses a single camera, we don’t have depth information, except for the foreshortening due to perspective. To maintain some information about the z-coordinate (depth), we don’t scale-normalize five of the 19 features. Prior to the calculation of hand-shape features, we take a mirror reflection of the right hand and analyze both hands in the same geometry, with thumb to the right. Table 1 lists all 19 features. These features are not invariant to viewpoint; users are required to sign facing the camera for an accurate analysis. The classifier is tolerant of small rotations that can naturally occur while signing. Seven of the features (1, 2, 4, 5, 6, 7, and 8) are based on using the best fitting ellipse (in the least-squares sense) to the hand silhouette. The inclination angle assumes values in the range [0, 360]. To represent the four-fold symmetry of the ellipse, we use sin(2*a) and cos(2*a) as features, where a is in the range [0, 180]. Features 9 through 16 are based on using area filters. We divide the hand’s bounding box into eight areas in which we calculate percentage of hand pixels. We normalize all 19 hand-shape features into values between 0 and 1. We obtain this normalization by dividing the percentage of features (9 through 16) by 100, and the cardinal number • 85 Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply. This smoothing has the effect of mitigating the noise between different performances of a sign and creating a slightly smoother pattern. Video stream input Outer plexiform layer Retina filtering computed on all the frames • Contours enhancer Segmentation step Inner plexiform layer Face detection Preprocessing of sign sequences The video sequences obtained contain frames where the signer is not performing any sign (beginning and terminating parts) as well as transition frames. We eliminate these frames by considering the hand-segmentation step’s result as follows: • Moving contours extraction Virtual cortex processing on the face area Motion events Fast Fourier transform total energy computation Outputs Temporal motion evaluation Optic flow computation Temporal vertical velocity evaluation Temporal horizontal velocity evaluation Eliminate all frames at the beginning of the sequence until a hand is detected. Figure 3. Algorithm for rigid head motion data extraction. motion and the vertical and horizontal velocities, as Figure 4 shows. These three features provided by the headmotion analyzer can vary with different performances of the same sign. Moreover, head motion is not directly synchronized with hand motion. To handle the inter- and intrasubject differences, we apply weighted average smoothing to head motion features, with a ¼ 0.5. The smoothed head motion feature vector at time i, Fi, is calculated as Fi ¼ aFi þ (1 a) Fi1. Quantity of motion Quantity of motion 9 7 5 3 1 0 20 40 60 80 Frame 68: low energy (low/no motion) 7 5 3 1 0 20 Frame 60 100 120 Frame 0.4 Frame 68: false alarm 0.4 Velocity Velocity 0.2 0.2 0 −0.2 0 0 −0.2 20 40 60 80 0 20 Frame (a) 60 100 120 Frame (b) Figure 4. Head analyzer output for two example videos: (a) with horizontal head motion and (b) with vertical head motion: First row shows the IPL energy, that is, the quantity of head motion at each frame. The second row shows the amount of velocity at each frame, where the red line indicates the horizontal velocity and the blue line indicates the vertical velocity. If no hand is detected during the sequence for less than N frames, the shape information is copied from the last frame where there was still detection to the current frame. If no hand is detected for more than N frames, we assume the sign is finished. We eliminate the remaining frames including the last N frames. After these prunings, we delete transition frames from the start and end of the sequence. Cluster-based classification At the classification phase, we use HMMs to model each sign and classify with maximum likelihood criterion. HMMs are used in many areas, such as speech recognition and bioinformatics, for modeling variable-length sequences and dealing with temporal variability within similar sequences.17 HMMs are therefore preferred in gesture and sign recognition to successfully handle changes in the speed of the performed sign or slight changes in the spatial domain. We use a sequential-fusion method for combining the sign’s manual and nonmanual parts. The strategy relies on the fact that there might be similar signs that differ slightly and cannot be classified accurately. These signs form a cluster; their intracluster classification must be handled specially. Thus, we based our sequential fusion method on two successive classification steps: intercluster and intracluster classifications. Because we want our system to be as general as possible and open to new signs, we don’t use any prior knowledge about 86 Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply. the sign clusters. We let the system discover potential sign clusters that are similar in manual gestures but that differ in nonmanual motions. Instead of using rule-based programming, we extract the cluster information as a part of the recognition system. We make the base decision using a general model, HMMManual&Nonmanual (M&N), which uses both hand and head features in the same feature vector. However, this approach suffers from high dimensionality and does not effectively use the head information. We use the head information in a dedicated model, HMMN, which determines sign likelihoods and gives the final decision. The system uses two-fold, trained HMM models. First, the system trains the HMM models—for both HMMM&N and HMMN. Next, it extracts the cluster information via the joint confusion matrix of HMMM&N. Summing the validation sets’ confusion matrices in a cross-validation stage forms this joint confusion matrix. The system then investigates the misclassifications using the joint confusion matrix. If the system correctly classifies all samples of a sign class, the sign class cluster only contains itself. Otherwise, for each misclassification, the system marks that sign as potentially belonging to the cluster. The marked signs become cluster members when the number of misclassifications exceeds a given threshold. We use 10 percent of the total number of validation examples as the threshold value. The fusion strategy for a test sample is as follows: Calculate the likelihoods of HMMM&N for each sign class and select the sign class with the maximum likelihood as the base decision. coordinates), 19 hand-shape features, and three head-motion features per frame. In addition to relying on these features, we use the hands’ relative position with respect to the face’s center of mass, yielding two additional features (distance in vertical and horizontal coordinates) per hand, per frame. We normalize these distances by the face height and width, and use the normalized and preprocessed sequences to train the HMM models. Each HMM model is a continuous, four-state, leftto-right model and is trained for each sign, using the Baum-Welch algorithm. Visual feedback As one of the feedback modalities, SignTutor provides a simple animated avatar that mimics the users’ performance for the selected sign. The synthesis is based on the features extracted in the analysis part. Currently the avatar only mimics the users’ head motion and animates the hand motions from a library. Our head-synthesis system is based on the MPEG-4 Facial Animation Standard.18 As input, it receives the motion energy, vertical and horizontal head-motion velocities, and target sign. It then filters and normalizes this data to compute the head motion during the considered sequence. The processing result is expressed in terms of facial action points, which the system feeds into the animation player. For head animation, we use XFace, a 3D-talking-head-animation player compliant with MPEG-4.19 We based the hands- and arms-synthesis system on OpenGL and prepared an animation explicitly for each sign. We merge the head animation with the hands and arms animation to form an avatar with a full upper body. System accuracy evaluation Send the selected sign and its cluster information to HMMN. Combine the likelihoods of HMMM&N and The classification module receives all the features calculated in the hand- and headanalysis modules as input. For each hand, there are four hand-motion features (position and velocity in vertical and horizontal JanuaryMarch 2009 HMMN with the sum rule and select the sign with the maximum likelihood as the final decision. We evaluate the recognition performance of our system on a dataset of 19 signs from ASL. We select these signs to emphasize nonmanual signs, the effect of small modifications of manual signs, and handhead coordination. There are eight base signs that represent words and their systematic variations in the form of nonmanual signs, or inflections, in the signing of the same manual sign. 3 Table 2 lists the signs we used for evaluation. For each sign, we have recorded five repetitions from eight subjects. In Table 2, we show base signs and their variants with 87 Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply. Table 2. ASL signs used in SignTutor. Hand motion Base sign Clean Variant Clean variation • Very clean Afraid Fast Drink Open (door) Here Study Look at Head motion Afraid Very afraid Fast Very fast To drink Drink (noun) To open door (noun) [Somebody] is here Is [somebody] here? [Somebody] is not here Study Study continuously Study regularly Look at Look at continuously Look at regularly • • • • • • • • • • • • • • • • • the type of variation, either a hand-motion variation or a nonmanual sign, or both. For the rest of this article, we assume that the base sign and its variants form a semantic cluster; we call these clusters base sign clusters. IEEE MultiMedia Classifiers We compare three different classifiers to show the effect of using the nonmanual information with different fusion techniques on the classification accuracy: classification using only manual information, feature fusion of manual and nonmanual information, and sequential fusion of manual and nonmanual information. For classification using only manual information, we perform this classification via HMMs that are trained only with the hand gesture information related to the signs. Because hands form the basis of the signs, these models need to be very accurate in terms of classification. However, absence of the head-motion information precludes correct classification when signs differ only in head motion. We denote these models as HMMM. We can combine the manual and nonmanual information in a single feature vector to jointly model the head and hand information. Because there is no direct synchronization between hand and head motions, these models need to provide better performance than HMMM. However, using head information results in a slight performance increase. We indicate these models as HMMM&N. With sequential fusion of manual and nonmanual information, our aim, as explained in the previous section, is to apply a two-tier, cluster-based, sequential-fusion strategy. The first step identifies the performed sign’s cluster within a general model, HMMM&N, and resolves the confusion inside the cluster at the second step with a dedicated model, HMMN, that uses only head information. The head motion is complementary to the sign, and thus can’t be used alone to classify the signs. However, in the sequential fusion methodology, indicated as HMMM&N!HMMN, head information is used to perform intracluster classification. We report the results of our sequential fusion approach with different clusters, first on base sign clusters and then on automatically identified clusters based on the joint confusion matrix. The merit of a recognition system is its power to generalize a learned concept to new instances. In SLR, there are two kinds of new instances for any given sign: signing of somebody whose videos are in the training set and signing of a new person. For experimental purposes, the former is signer-dependent and the latter is signer-independent. To determine the real performance of an SLR system, we must measure it in a signer-independent experiment. To report the base sign accuracy, we assume that a classification decision is correct if the classified sign and the correct sign are in the same base sign cluster. The base sign accuracy is important for the success of our sequential fusion method based on sign clusters. The overall accuracy reports the actual accuracy of the classifier over all the signs in the dataset. These accuracies are reported on two protocols: the signer-independent protocol and the signer-dependent protocol. Signer-independent protocol In the signer-independent protocol, we make up the test sets from performance instances of a group of subjects in the dataset, and train the system with sample signs from the rest of the signers. For this purpose, we apply an eight-fold cross-validation, where each fold test set consists of instances from one of the 88 Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply. Table 3. Signer-independent test results. Classifier/ Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7 Subject 8 Average HMMM/base sign 100.0 100.0 100.0 100.0 96.84 100.0 100.0 100.0 99.61 HMMM&N/base sign 100.0 100.0 100.0 100.0 97.89 100.0 100.0 100.0 99.74 HMMM&N →HMMN/ base sign HMMM/overall 100.0 100.0 100.0 100.0 97.89 100.0 100.0 100.0 99.74 accuracy type 66.32 77.90 60.00 71.58 57.90 81.05 52.63 70.53 67.24 HMMM&N/overall 73.68 91.58 71.58 81.05 62.11 81.05 65.26 77.89 75.53 HMMM&N →HMMN/ overall for base sign clusters 82.11 72.63 73.68 88.42 56.84 80.00 81.05 71.58 75.79 HMMM&N →HMMN/ overall for automatically identified clusters 85.26 76.84 77.89 89.47 63.16 80.00 88.42 75.79 79.61 signers and the training set consists of instances from the other signers. In each fold there are 665 training instances and 95 test instances. Table 3 gives the results of the signer-independent protocol. The base sign accuracies of each of the three classifiers in each fold are 100 percent except for the fifth signer, which is slightly lower. This performance result shows that a sequential classification strategy is appropriate, where specialized models are used in a second step to handle any intracluster classification problem. We can deduce the need for head-feature data from the high increase in overall accuracy with the contributions of nonmanual features. With HMMM&N, the accuracy increases to 75.5 percent, which contrasts with the accuracy of HMMM at 67.2 percent. A further increase in accuracy is obtained by using our sequential fusion methodology with automatically defined clusters. We report the accuracy of the sequential fusion with the base sign clusters to show that using those semantic clusters results in an accuracy 4 percent lower than automatic clustering. Signer-dependent protocol In the signer-dependent protocol, we put examples from each subject in both of the test and training sets, although they never share the same sign instantiation. For this purpose, we apply a five-fold cross-validation where at each fold and for each sign we place into the training set four out of the five repetitions of each subject and the remaining one into the test set. In each fold there are 608 training examples and 152 test examples. Table 4 shows the results of the signer-dependent protocol. The base sign accuracies of each of the three classifiers are similar to the signer-independent results, except the overall accuracies are much higher. Additionally, the sequential fusion technique does not contribute significantly any longer, probably as a result of the ceiling effect and the already high accuracies. Table 4. Signer-dependent test results. Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Average HMMM/base sign 99.34 100.0 100.0 92.76 92.76 92.76 100.0 100.0 100.0 89.47 95.39 95.39 98.68 98.68 98.68 89.47 93.42 92.76 100.0 100.0 100.0 94.08 96.05 95.39 100.0 100.0 100.0 92.76 94.08 94.08 99.61 99.74 99.74 91.71 94.34 94.08 92.76 95.39 92.76 96.05 94.08 94.21 HMMM&N/base sign HMMM&N →HMMN/base sign HMMM/overall HMMM&N/overall HMMM&N →HMMN/overall for base sign clusters HMMM&N →HMMN/overall for automatically identified clusters JanuaryMarch 2009 Classifier/accuracy type 89 Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply. 5 Score 4 3 S1 S2 S3 2 S4 S5 S6 Average 1 Computer use level I would like to use SignTutor to assist sign language learning I found SignTutor easy to use Most people could learn to use SignTutor quickly User study questionnaire results. We have conducted a user study to measure SignTutor’s real-life performance and to assess the overall system’s usability. Our subjects were volunteer students, two males and four females, taking an introductory Turkish Sign Language course given at Boğaziçi University. Two of the students (one male and one female) were from the computer-engineering department and the rest were from the foreign language education department. All subjects were highly motivated to take part in the experiment and were excited about SignTutor when they were first told about the properties of the system. We performed the experiments in two sessions. In each session, we asked subjects to practice three signs. We conducted the second session after a time lapse of at least one week following the first session. Before the first session, we presented the SignTutor interface to the subjects in the form of a small demonstration. At the second session, we expected the users to start using the SignTutor without any introductory presentation. For each subject we measured the time spent on each task, with each task being defined as the learning, practicing, and evaluating of one of the three signs. At the end of the experiment, we interviewed the subjects and asked them to fill out a short questionnaire. We asked three questions in the questionnaire and had the subjects score their responses using five levels, from strongly disagree (1) to strongly agree (5). As Figure 5 shows, the average score for all questions is IEEE MultiMedia Figure 5. Usability more than four, indicating the subjects’ favorable views regarding the system’s usability. In session one, we asked the subjects to practice three signs: afraid, very afraid, and drink (as a noun). We selected these three signs because the first two are variants of the same base sign performed by two hands; the third is a completely different sign performed by a single hand. We measured the number of seconds for each task; Figure 6a (next page) shows the results. The subjects practiced each sign until they received a positive feedback. The average number of attempts was around two. The average time was 85 seconds for the first task and decreases to 60 seconds for the second and third tasks. These results show that the subjects were more comfortable using the system after the first task. In session two, we asked the subjects to practice another set of signs: clean, very clean, and to drink. As in the first session, the first two signs are variants of the same base sign performed by two hands; the third is a completely different sign performed by a single hand. Out of the six subjects who participated in the first session, only five participated in the second. In this session, the subjects started using the system without any help or explanation about usage. Figure 6b shows the results, which reflect that the subjects recalled the system usage without any difficulty and that the average time spent completing a task decreased with respect to the first session. The average number of attempts was again around two for the first two signs, similar to the first session, which shows that the subjects performed a new sign correctly after roughly two attempts. For the third sign, all the subjects succeeded in their first attempt. During the interviews, the subjects made positive comments about the system in general. They found the interface user friendly and the system easy to use. All of the subjects indicated that using SignTutor during sign language learning helped them to learn and recall the signs. The subjects found the practice feature, especially its ability to analyze both hand and head gestures, important. They commented that the possibility of watching the training videos and their own performance together with the animation made it easier to understand and perform the signs. They noted, however, that it would be nice to have more constructive feedback about 90 Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply. 120 S1 S2 S3 S4 S5 S6 Average 3 2 Average time (seconds) Trials per task 4 Figure 6. Task analysis S1 S2 S3 S4 S5 S6 Average 100 80 60 40 for (a) session one and (b) session two. For each session, the number of trials and the average time on a task is plotted. 20 0 1 Afraid Very afraid Drink (noun) Afraid Very afraid Drink (noun) (a) 120 S1 S2 S3 S4 S5 Average 3 2 1 Clean Very clean To drink Average time (seconds) Trials per task 4 S1 S2 S3 S4 S5 Average 100 80 60 40 20 0 Clean Very clean To drink (b) Conclusion There are several avenues through which this proof-of-concept tutor system can be improved. Both signer-independent and signer-dependent performance results indicate that invariance properties of the features must be further investigated. The analysis of nonmanual signals can be improved and other nonmanual signals, such as facial expressions and body posture, can be incorporated into the system. We can’t expect much aid from signer adaptation because students are novices,20 but we can help students in the use of SignTutor by providing them with sign videos from other teachers, text-based descriptions of the sign, pictures of the hand shapes used during the sign, and an interactive 3D animated avatar. The 3D avatar, an advanced avatar based on the one used in this study, lets users watch the sign from different viewpoints. Finally, we are considering giving more instructive feedback to trainees by offering hints, in addition to the present error message, about the correct position of hands with respect to the body and each other. MM JanuaryMarch 2009 their own performance, explaining what is wrong and what should be corrected. One of the subjects found the decisions of SignTutor too strict, and noted that the parameters for determining a correct sign could be relaxed. The subjects had no difficulty in using the glove but commented that it would be better to use the system without any gloves. One of the subjects suggested adding a picture, possibly a 3D image, of the hand shape used during the sign, in addition to the sign video. In theory, this picture could help students understand the exact hand shape performed during signing. During the experiments, we were able to observe the system performance in a real-life setting. An important system requirement is that the camera view should contain the user’s upper body. In addition, the distance between the camera and the user should be at least one meter. We performed the experiments in an office environment without using any special lighting device. The hand segmentation requires a reasonably illuminated environment, and the user’s clothes should be of a different color than that of the gloves. We found that when these conditions are met, the system works accurately and efficiently. Acknowledgments We developed most of this work in a joint project during the Enterface 2006 Workshop on Multimodal Interfaces, which is supported 91 Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply. by EU FP6 Network of Excellence SIMILAR, the European Taskforce Creating HumanMachine Interfaces SIMILAR to HumanHuman Communication. This work has also been supported by Tubitak project 107E021 and Boğaziçi University project BAP-03S106. References 1. S.K. Liddell, Grammar, Gesture, and Meaning in American Sign Language, Cambridge Univ. Press, 2003. 2. W.C. Stokoe, ‘‘Sign Language Structure: An Outline of the Visual Communication Systems of the American Deaf,’’ J. Deaf Studies and Deaf Education, vol. 10, no. 1, 2005, pp. 3-37. 3. S.C.W. Ong and S. Ranganath, ‘‘Automatic Sign Language Analysis: A Survey and the Future 12. O. Aran and L. Akarun, ‘‘Recognizing Two Handed Gestures with Generative, Discriminative and Ensemble Methods via Fisher Kernels,’’ Proc. Int’l Workshop Multimedia Content Representation, Classification, and Security, LNCS 4105, SpringerVerlag, 2006, pp. 159-166. 13. I. Fasel et al., Machine Perception Toolbox, API Specification for MPT-1.0, UCSD Machine Perception Lab.; http://mplab.ucsd.edu/grants/project1/ free-software/MPTWebSite/API/. 14. W. Beaudot, The Neural Information Processing in the Vertebrate Retina: A Melting Pot of Ideas for Artificial Vision, doctoral dissertation, Dept. Computer Science, INPG (France), 1994. 15. A. Benoit and A. Caplier, ‘‘Head Nods Analysis: Interpretation of Non Verbal Communication Gestures,’’ Proc. IEEE Int’l Conf. Image Processing Beyond Lexical Meaning,’’ IEEE Trans. Pattern (ICIP), vol. 3, IEEE Press, 2005, pp. 425-428. Analysis and Machine Learning, vol. 27, no. 6, 16. A.B. Torralba and J. Herault, ‘‘An Efficient Neuro- 2005, pp. 873-891. 4. F.K.H. Quek, ‘‘Toward a Vision-Based Hand Gesture Interface,’’ Proc. Virtual Reality Software and Technology Conf., G. Singh, S.K. Feiner, and D. Thalmann, eds., World Scientific, 1994, pp. 17-31. 5. Y. Wu and T.S. Huang, ‘‘Hand Modeling, Analysis, and Recognition for Vision Based Human Computer Interaction,’’ IEEE Signal Processing Mag., vol. 18, no. 3, 2001, pp. 51-60. 6. T. Starner and A. Pentland, Realtime American Sign Language Recognition from Video Using Hidden Markov Models, tech. report, MIT Media Laboratory, 1996. 7. C. Vogler and D. Metaxas, ‘‘ASL Recognition Based on a Coupling between HMMs and 3D Motion Analysis,’’ Proc. Int’l Conf. Computer Vision (ICCV), IEEE CS Press, 1998, p. 363. 8. G. Fang, W. Gao, and D. Zhao, ‘‘Large-Vocabulary morphic Analog Network for Motion Estimation,’’ IEEE Trans. Circuits and Systems-I, special issue on bio-inspired processors and CNNs for vision, vol. 46, no. 2, 1999, pp. 269-280. 17. L.R. Rabiner, ‘‘A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,’’ Proc. IEEE, vol. 77, no. 2, 1989, pp. 257285. 18. I.S. Pandzic and R. Forchheimer, MPEG-4 Facial Animation: The Standard, Implementation and Applications, Wiley, 2002. 19. K. Balci, ‘‘Xface: Open Source Toolkit for Creating 3D Faces of an Embodied Conversational Agent,’’ Proc. 5th Int’l Symp. SmartGraphics, LNCS 3638, Springer-Verlag, 2005, pp. 263-266. 20. S.C. Ong, S. Ranganath, and Y. Venkatesha, ‘‘Understanding Gestures with Systematic Varia- Continuous Sign Language Recognition Based on tions in Movement Dynamics,’’ Pattern Recogni- Transition-Movement Models,’’ IEEE Trans. Systems, tion, vol. 39, no. 9, 2006, pp. 1633-1648. Man and Cybernetics, pt. A, vol. 37, no. 1, 2007, pp. 1-9. 9. K.W. Ming and S. Ranganath, ‘‘Representations Oya Aran is a PhD candidate at Boğaziçi University working on dynamic-hand-gesture and sign-language for facial expressions,’’ Proc. Int’l Conf. Control recognition. Her research interests include computer Automation, Robotics, and Vision, vol. 2, IEEE vision, pattern recognition, and machine learning. Press, 2002, pp. 716-721. Aran has an MS in computer engineering from Boğaziçi 10. U.M. Erdem and S. Sclaroff, ‘‘Automatic Detection of Relevant Head Gestures in American Sign University, Istanbul. She is a student member of the IEEE. Contact her at [email protected]. IEEE MultiMedia Language Communication,’’ Proc. Int’l Conf. Pattern Recognition, vol. 1, IEEE CS Press, 2002, Ismail Ari is an MS student in computer engineering pp. 460-463. at Boğaziçi University working on facial-feature track- 11. S. Jayaram et al., ‘‘Effect of Color Space Transfor- ing and recognition for sign language. His research mation, the Illuminance Component, and interests include computer vision, image processing, Color Modeling on Skin Detection,’’ IEEE Conf. and computer graphics. Ari has a BS in computer Computer Vision and Pattern Recognition (CVPR), engineering from Boğaziçi University. Contact him IEEE CS Press, 2004, pp. 813-818. at [email protected]. 92 Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply. Lale Akarun is a professor of computer engineering Alice Caplier is a permanent researcher in the Image at Boğaziçi University. Her research interests include and Signal Department of the GIPSA-Lab. Her face recognition, modeling, animation, human activ- research interests include human gesture analysis ity, and gesture analysis. Akarun has a PhD in electri- and interpretation based on video data. Caplier has cal engineering from Polytechnic University, New a PhD in image processing from the Institut National York. She is a senior member of the IEEE. Contact Polytechnique de Grenoble. Contact her at alice. her at [email protected]. [email protected]. Bülent Sankur is a professor in the department of Pavel Campr is a PhD candidate and teaching assis- electric and electronic engineering at Boğaziçi Uni- tant in the Department of Cybernetics, University of versity. His research interests include digital signal West Bohemia, Pilsen, Czech Republic. His research processing, image and video compression, biometry, interests include sign language recognition and com- cognition, and multimedia systems. Sankur has a puter vision. Campr has an MS in cybernetics from PhD in electrical and systems engineering from the University of West Bohemia. Contact him at Rensselaer Polytechnic Institute, New York. He is on [email protected]. the editorial boards of three journals on signal processing and a member of European Association for Sig- Ana Huerta Carrillo works as an engineer on mobile nal Processing’s Administrative Committee. Contact systems at Telefónica. Her research interests include him at [email protected]. computer graphics and animation. Huerta Carrillo has an MS in electrical engineering from the Speech Alexandre Benoit is a post-doctoral member of the Technology Group at the Technical University of Image Video Communication team at the Institute Madrid. Contact her at [email protected]. de Recherce en Communications et en Cybernetique de Nantes (IRCCYN), France, where he works on François-Xavier Fanard is a PhD candidate in the image quality perception. His research interests communication and remote sensing laboratory (TELE) include visual perception and image processing, ret- at the Université Catholique de Louvain, Louvain-la- ina and cortex models, and their use in computer Neuve, Belgium. His research interests include facial vision. Benoit has a PhD from the Images and Signals synthesis using the MPEG-4 facial animation standard Department of the Grenoble Image, Parole, Signal and MPEG-4 for Advanced Video Coding video com- Automatique (GIPSA)-Lab, France. Contact him at pression. Fanard has a BS from the Université Catholi- [email protected]. que de Louvain. Contact him at [email protected]. JanuaryMarch 2009 93 Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
Benzer belgeler
Journal of Naval Science and Engineering 2015, Vol. 11, No.3, pp. 1
This filter enhances moving contours with an
outer plexiform layer and cancels static ones
with an inner plexiform layer. This prefiltering
mitigates any illumination changes and noise.
Second, we ...