[article | discuss (0) | ]
The Chinese University of Hong Kong
This paper explores the potential for the use of speech recognition technology to test the oral proficiency of learners of English as a second language. The study described here follows from an earlier study in which subjects read a text aloud into the speech recognition software Dragon NaturallySpeaking and extends that study to include the analysis and scoring of the output of the speech recognition software by means of a computer program. Results indicate that the computer program discriminated well between the native and non-native speakers and also correlated highly with the results of traditional analyses. The paper concludes with the note of circumspection by pointing out that the current implementation of speech recognition technology is still far from being readily usable primarily because the technology is speaker-dependent and has to be trained to recognize each person's voice.
Speech Recognition, English Language, Assessment, Parsing
The purpose of the current study is to examine the extent to which it is feasible to assess students' oral abilities in English by parsing the output text produced by Speech Recognition technology (SR) from which scores can then be derived. However, before the parsing of the output text is even considered, the accuracy of texts produced by SR technology needs to be determined. If the output text is unreliable, it follows that any attempt at analysis of that output text will also be unreliable. Manufacturers
of SR software make great claims for the accuracy of the software. For example, Dragon Systems' promotional literature claims an accuracy rate of 95% or better.1 To verify these claims and ultimately to ascertain the viability of speech recognition software with second language speakers, the author conducted two experiments—one with native speakers and a subsequent one with non-native speakers (see Coniam, in press a). These experiments showed that the claims of accuracy made by the SR software manufacturers are far from being achieved. The calculation of accuracy appears to be based on how accurately words are transcribed from spoken forms, which is only one benchmark of accuracy. The experiments mentioned above analyzed the accuracy of texts produced by native speakers and non-native speakers not only for word accuracy but also in terms ranging from minor sense groups (e.g., prepositional phrases, noun phrases, and verb phrases) to major sense groups (t-units, i.e., nuclear sentences consisting of verb-subject-[complement]). The output produced by native speakers was significantly more accurate than that produced by very competent non-native speakers. This consistency of results indicates that SR has potential for language assessment but may not yet be a completely viable means of assessment. Because the language output produced during the experiments forms the basis of the linguistic analysis described here, the experiments are described in more detail below.
The computer industry has been exploring SR as a mode of input to the computer in preference to that of the keyboard since IBM's initial experiments in the 1950s. However, it is only in the last decade that SR has been developed to the point at which it is more of a reality than science fiction. The new generation of SR claims to be able to process continuous speech and no longer requires pauses between words. Continuous speech recognition certainly holds greater promise for man-machine interaction than discrete word speech recognition (see, e.g., Kempainen, 1997). For the present, however, as Noyes and Starr (1997) have illustrated by the use of SR with disabled users and users in the avionics industry, the low accuracy produced by current programs means that SR is still a long way from being either practical or acceptable.
One exception to the generally low accuracy rates is the PhonePass system which appears to have high reliability rates for assessment purposes (Townshend et al., 1998). The system consists of an extensive database of native and non-native spectrographic wave patterns for a pre-established series of questions. Test scores are calculated essentially in terms of correlations between subjects' responses and the sample wave patterns contained in the database. The impetus of the study described here is that
subjects' interpreted SR output is analyzed syntactically and scored on the basis of that syntactic analysis.
Part of the attraction of SR software for English language teaching lies in the fact that it does not simply produce a phonemic transcription but attempts to interpret input as recognizable English structures. This interpretative effort is important in the case of second language learners since what might be called standard deviant forms for a particular language group (e.g., /l/ ~ /n/ or /v/ ~ /w/) do not appear in incorrect forms in the output. For example, when one of the subjects was asked to enunciate clearly the phrase Peter was an angly man as an example of /l/ versus /r/ in Chinese, the SR software produced Peter was an angry man. The value of such an interpretation lies in the fact that there is a strong collocational association between angry and man but not at all between angly and man. For second language speakers who share common phonological problems particular to a specific language group, the software's attempt to make collocational interpretations means that non-English word forms do not appear in the output. This advantage underlies much of the software's potential as a teaching and assessment tool.
THE PREVIOUS STUDIES
The software used in previous studies was Dragon Systems' Dragon NaturallySpeaking (DNS) because it was one of the first programs to use continuous voice recognition technology. The program was installed on a 166 MHz PC with 32 MB of RAM. The DNS microphone parameters were set at the same levels for all subjects; Volume was set at normal and Quality to average. Because SR programs are still speaker-dependent, they have to be adapted to a single speaker's voice. The subjects therefore had to undergo a training routine. The training program proposed by Dragon Systems in which subjects read a text (or texts) of their own selection while the program adapts to their voice is supposed to take about 30 minutes. One of the training texts supplied for DNS is an excerpt from Arthur C. Clarke's novel 3001 consisting of approximately 3,850 words in 230 sentences. To maximize the program's recognition potential, all subjects were asked to read the whole excerpt from 3001, taking approximately 60 to 90 minutes rather than the standard 30 minutes.
The extract from 3001 was selected as training material because an excerpt from Clarke's 2001 is also included as training text material in the software package, and the use of the 3001 extract permitted a comparison with the similar 2001 passage. Since the 2001 passage is supplied with DNS, a substantial amount of the vocabulary, which might otherwise fall outside the recognition expectations of the program, was already part of the program's lexicon. For example, the lexicon includes infrequent words
such as hull, mimic, and heuristically, as well as proper nouns such as Bowman. The passage from 2001 is shorter, consisting of 1,050 words in 44 sentences. Although some words were unfamiliar to the second language speakers (e.g., algorithmic, hull leakage), their unfamiliarity did not represent a problem since the teachers were accustomed to reading aloud tasks. In addition, because the unfamiliar words were technical and therefore likely to be unknown to the speakers in their native language, there was little probability of biased results in the experiment.
Subjects were first asked to read through the two passages so they would not have to complete the task as an unseen dictation. Subjects were directed to maintain the features of natural speech and asked to read the 1,050-word extract at a speed of approximately 100 words a minute. This speed necessarily incorporates concatenation and weak morphophonemic forms of natural speech in the output text.2 In cases in which subjects stumbled in a sentence, they were asked to read the sentence again.
Twenty subjects took part in the experiment (ten native speakers of English and ten non-native speakers). All were secondary or tertiary teachers of English with a minimum of a master's degree in English. The nonnative speakers were all very competent speakers of English and would have scored at least a 7 on the IELTS oral scale.3 The phonological features of the non-native speakers were also examined to determine the extent to which phonological features of a particular language group may account for certain types of errors made by SR software.4
Methods of Analysis in the Previous Studies
Grammatical accuracy was calculated according to four different categories of analysis, starting from the macro level, t-units, clauses, phrases, and individual words (see Hunt, 1970; Halliday, 1994). Manual analysis of speakers' output texts involved considerable calculation of errors in the approximately 1,600 units of analysis for each text. (For details, see Coniam, in press b.) Neither proper nouns nor punctuation were scored, but genitive constructions (e.g., ship's) were since these structures constitute an integral part of the meaning of noun phrases. No deductions were made for program overgeneration, i.e., the inclusion of words in subjects' output texts that were not in the target text. Tables 1 and 2 present a summary of the results of the analysis of the subjects' texts.
Table 1 shows substantial differences of the mean scores between the groups of native speakers and non-native speakers. T-tests revealed significant differences between the groups for each of the four categories of analysis (p < .01).5 To highlight the differences between the two groups, Table 2 presents two points of comparison: the means of the native speaker group with the maximums of the non-native speaker group and the means of the non-native speaker group with the minimums of the native speaker group. The highest non-native speaker maximum does not reach the mean of the native speaker group on any category of analysis, from t-units to individual words. Three of the four categories show a differential of some 2% between the non-native speaker maximum scores and the mean of the native speakers, and the fourth category (clauses) a gap of 7.6%. Conversely, the lowest native speaker scores are lower than the mean of the non-native speaker group in only two of the four categories. While the low accuracy figures for the non-native speakers indicate that SR technology needs to be further refined, the consistency of the results of the analyses suggests that it may have potential for use as an assessment tool, at least with non-native speakers at higher levels of proficiency.
THE CURRENT STUDY
Whereas the previous studies involved analysis of speakers' texts by hand, the goal of the current study is to analyze the output texts by a computer program specifically designed for this purpose. Figure 1 presents a screen shot of the program.
Figure 1 Speech Recognition Analysis Program
The screen contains two text fields. The target text is loaded in the upper text field, and the text to be analyzed in the lower field. The pull down menu offers users various options. For example, the Test pull down menu includes an option to save analyzed texts as database files, and the Options menu allows users to adjust the marking scheme and to set proper nouns to be analyzed or ignored. An option, to be discussed below, is also available here to help clean up subjects' texts prior to analysis. When users are ready to begin analyzing a text, they click on the “Compare” button.
The Parsing Algorithm
The parsing algorithm used in this study was developed originally for scoring short strings of words for computerized dictation tests (Coniam, in press b). Scoring is calculated on an individual word basis and takes three factors into account
1. the correct position of words in the input string;
2. the correct inflection of words; and
3. the lexical or grammatical nature of words (i.e., content words versus function words).
The parsing that the program performs is quite minimal and makes use of two components: a lexicon and a simple parsing algorithm.
The lexicon consists of three databases. The main database is a reduced version of a word list obtained from analysis of the 1995 Bank of England corpus and consists of some 50,000 most frequent word types in English. The lexicon also contains two smaller databases: a) a numerical database consisting of the main cardinal and ordinal numbers with their corresponding digits and b) an irregular verb database. The irregular verb database enables the parsing algorithm to handle past tense or past participle forms which do not appear in the main word list (e.g., ate, eaten).
2. Simple Parsing Algorithm
Words are analyzed according to their permitted grammatical inflections as listed in Table 3.
The program begins by parsing the words in the original target string. It takes each target word, strips away possible inflections, and searches for the word in the database containing word class and word frequency information. If the word is found in the database, the word's base form is established along with the word's possible word classes. The program then parses the subject's input string in a similar way. It strips possible inflections off each word and searches for a match between the base form of the word in the subject's input and the base form in the target string. Finally, the program calculates a score by comparing the words in the subject's input string against the inflection and position of the words in the target string.
A SAMPLE ANALYSIS
Sample analyses from the earlier computer dictation test are described below in order to view the parsing process in its proper perspective and to understand the adaptations that had to be made to the algorithm for the analysis of the SR output files in the current study. Consider the following sentence in which the words in italics form the target string against which student input will be compared.
In this sentence, eight words which need to be typed in are missing.
On the computer dictation test, the total score for a string is calculated by adding up the score for each word and dividing that score by the number of words in the target string. Table 4 illustrates the different weights applied to lexical and grammatical words and to the degrees of correctness for grammatical inflections and position in the string.
Now consider two subjects' input strings and the entries in Table 5 which illustrate the analysis and scores for these two strings.
Target: which need to be typed in are missing
Subject 1: which is needing to typing are missing
Subject 2: what is needly to be typest misses
Subject 1 received 1 point for which (correct form and correct position), 0 for is because the word is not in the target string, 0.4 for needing because the word is incorrectly inflected and not in the correct position, 0.5 for to since the form of the word is correct, but it is a grammatical word in the wrong position, 0.6 for typing since it is in the correct position but incorrectly inflected, 0.5 for are since it is a grammatical word in the wrong position, and, finally, 0.8 for missing since it is in the correct form but in the wrong position. Subject 1's score is therefore a total of 3.8. Dividing this number by 8 (the number of words in the target string) gives a final score of 48%.
For Subject 2, the program extracted need as the base form of needly, but the two entries for need in the lexicon are marked as noun and verb. (See Table 3 above.) The score for needly is therefore zero. Similarly for typest, type does not have an entry in the lexicon marked as an adjective, and hence the -est inflection on type also receives a score of zero. The word misses receives a score of 0.4 because it has the correct base form, but it is wrongly inflected and in the wrong position. Subject 2's final score is therefore 1.4 out of 8, i.e., 18%.
Although the scoring method may seem ad hoc, it does reflect a measure of objectivity in evaluating a subject's performance. The difference between Subject 1's score of 48% and the 18% score of Subject 2's less intelligible answer serves to indicate the difference in the meaning/acceptability of the two sentences. The results also accord moderately well with subjective judgments of acceptability made by a number of native speakers.
The program did not score proper nouns in these analyses since the setting for scoring proper nouns was left at its default value to ignore them. The program contains a Span option, however, so that proper nouns can in fact be scored if required. The Span setting causes the program to look to the words to the left and right of the word currently being processed (default value of 5 words on either side). The program will then
score proper nouns if they fall within the span.
As a final note, it is not anticipated that subjects will be able to read long preliminary training texts without making errors. (The test text from 2001 contained 1,000 words.) With texts of such length, subjects will need to reread sentences when they make mistakes. In the study described here, subjects were told that if they made an error and felt that they needed to reread a sentence, they were to say “error,” abandon the reading of the current sentence, and begin it again. Because the study described here is experimental in nature, a certain amount of clean up in the subject's text had to be done manually. However, an editable list of words and phrases that signal that the reading of a sentence has gone wrong and should therefore be excluded from the analysis has now been incorporated into the program. The inclusion of this list in the program is the first step towards reducing post-editing. If a great deal of post-editing has to be done, computer based analysis of texts could well be viewed as redundant or at the least unnecessarily time consuming.
RESULTS AND DISCUSSION
The analytic procedures of the program are illustrated below in the detailed analysis of one of the sentences taken from a non-native speaker's output text.
It was mounted like a gunsight on the rim of the ship's long-range antenna, and checked that the great parabolic bowl was rigidly locked upon its distant target.
It was mounted lighted coincide on remote issues of long-range antenna and check that the great pair aboard bowl was rigidly lots upon its distant targets.
Table 6 presents the analysis of the sentence.
The analysis of individual words specifies whether the word is completely correct, whether it has the correct base form but is wrongly inflected, and then assigns a score on the basis of 1. As mentioned earlier, the original scoring system for the listening tests involved short strings and deducted points for words that were correct in form but in the wrong position. Positional deductions had to be relaxed in the analysis of the SR output described here because the program was called upon to analyze complete sentences in which many words can be out of position in the subject string.6 Consequently, correctly inflected words in the wrong position receive 0.9 and an incorrectly inflected word in the wrong position 0.6, compared to 0.8 and 0.4, respectively, in the earlier listening test. A grammatical word
in the wrong position receives a score of 0.5. This slightly lower weighting is an attempt to take into account the high frequency of grammatical words and the fact that they may occur in almost any given sentence.
Overall, the subject scored 56.6%. Even though the subject's sentence did not encapsulate the full meaning of the target sentence, it captured several essential elements, It was mounted ... on ... long-range-antenna; check that the great ... bowl ... upon its distant targets. The analysis of the sentence also shows that the positional relativity of words is lost after the third word and is never fully regained. The new relaxed scoring procedures for the position of words are designed to mitigate this effect. It should be noted that it might be possible to attempt to realign the words in the two strings every so often, say, after some number of correct words have been analyzed. (See, for example, the guidelines proposed by Melamed  in the Blinker project for the alignment of words between parallel texts.)
As a point of comparison, it is interesting to consider the computer's score for this sentence in the context of the scores created by manual analysis. The scores of the four syntactic categories by manual analysis and the overall computer score are listed in Table 7.
The score of 56.6% is close to the score of the phrase level analyses. This correspondence is reasonable, given that the program's analytic procedures operate at a level slightly above that of the word. Table 8 presents the means of the native speaker and non-native speaker groups by manual analysis (taken from Table 1 above) and the means of these groups calculated by computer analysis.
As can be seen in Table 8, the results of the computer analysis generally corroborates the results of the manual analyses. The difference between the means of the two groups scored by computer analysis, 7.4%, compares favorably to the differentials at phrase level and word level analysis. The fact that a t-test shows that the mean of the native speaker group is significantly higher than that of the non-native speaker group (p < .05) again illustrates the potential of the program.
Table 9 contains the final set of results. The data consist of the individual scores for the subjects in the non-native speaker group. In addition to discriminating between native speakers and non-native speakers, the data in Table 9 show that the software discriminated between subjects in the same group.
The scores based on the computer analysis approximate the phrase level and the word level scores of the manual analyses and are closest to those at the phrase level. As mentioned earlier, this result is understandable since the analysis performed by the computer program does not operate simply at the word level. The fact that words are analyzed for correctness of form and position reflects at least a moderate degree of syntactic analysis.
Subject J has the highest scores in all of the manually analyzed categories and also in the computer analysis. Alternatively, Subject A has the lowest scores on all the manually analyzed categories as well as in the computer analysis. Although the sample size is very small, correlations between the five variables were all highly significant (p < .001) and suggest that the computer analysis is as statistically reliable as the manually analysis of subjects' output texts.
This study has illustrated the use of a computer program to analyze and score subjects' output transcribed by means of SR technology. It has shown that a relatively unsophisticated parsing algorithm linked to an extensive lexicon can produce results as reliable as those performed by hand. The computer generated scores correlated highly with all four manually scored categories of analysis with the highest correlation at the phrase level. This finding can be attributed to the fact that the analysis conducted by the computer program includes more than a simple word level analysis. The program not only determines whether a word is correctly inflected and in its correct position in the sentence but also differentiates between lexical and grammatical words.
The computer program's analysis of the texts produced by native speakers and non-native speakers indicated that the native speakers had significantly higher scores than the non-native speakers and corroborated the t-test results of the manual analyses. The significant correlations found between the subject's computer scores and manual scores emphasize the essential reliability of the computer procedures.
The non-native speakers who participated in this study were all very proficient speakers of English, IELTS Band 7 and above. It remains to replicate the experiment with slightly less proficient subjects (with scores around IELTS Band 5, for example) to determine the extent of the accuracy and reliability of the computer analysis. Band 5 may well be a threshold below which the program's interpretive/collocational algorithms would struggle to produce sensible analyses and scores.
With regard to the SR software itself, the technology has not yet been developed to the point at which it is readily usable. A robust generic speech model is not yet available, which means that the software has to be trained to each individual speaker's voice.7 As SR technology improves, becoming more speaker independent and more accurate in its interpretation of speech (perhaps to accept the phonological features of a particular language group), the potential use of the software as a testing instrument will increase.
Evidence suggests that reading aloud is a good indicator of overall ability in English.8 The consistently accurate figures that emerged with the output from the second language group of subjects in this study suggest that the use of SR technology in computer based testing has considerable potential, for example, in situations in which oral assessment cannot be conducted in face-to-face interviews.
Design in assessment generally requires substantially greater sophistication and rigor than teaching techniques. If a procedure works for testing purposes, then conditions may be somewhat relaxed for pedagogical implementations (see Coniam, 1997; in press a).The accuracy of the texts
produced by SR procedures described here is therefore of great importance to the viability of SR both for assessment and instructional purposes.
Great claims have been made about the new generation of SR software in terms of accuracy and how users interact with computers. Some magazine writers already claim that it is a great improvement and that they use Dragon NaturallySpeaking to write articles (Manes, October, 1997; Miller, September, 1997). However, Microsoft Chief Technology Officer, Nathan Myhrvold (1997), observes in an interview with PC Magazine that Microsoft has not been making as much progress with SR as they would have liked, and comments, “I said 5 years ago that good [SR] products were still 5 years away, and today I'll say it again: Good products are still 5 years away.” Because Myhrvold addresses the use of the technology in the interpretation of the speech of native speakers, the implementation with second language learners will take considerably longer. Nonetheless, it is apparent from the results of this small scale study that the use of speech recognition technology as an assessment instrument with second language learners has substantial potential for future applications.
1 See the Dragon Systems Web site at http://www.dragonsys.com/marketing/ pcproducts.html
2 Reading in citation forms (the method used in previous generations of SR software) effectively precludes a reading speed of 100 words a minute.
3 The International English Language Testing Service (IELTS) is the UK and Australian equivalent of TOEFL. The oral scale ranges from 1 (Non User) to 9 (Expert User); an overall band score of 6 or 6.5 is required for acceptance into an academic program. (See UCLES, 1997). For reference purposes, the IELTS bands for 6 and 7 are defined as follows:
6 Competent User
Has generally effective command of the language despite some inaccuracies, inappropriacies and misunderstandings. Can use and understand fairly complex language, particularly in familiar situations.
7 Good User
Has operational command of the language though occasional inaccuracies, inappropriacies and misunderstandings in some situations. Generally handles complex language well and understands detailed reasoning.
4 A brief description of the phonological factors which characterize the spoken English of Hong Kong Cantonese speakers of English can be found in Kenworthy (1987: 128-131) and Foulds (1978).
5 Given the small sample size, the number of subjects is, strictly speaking, below the threshold for t-test reliability (Ramsey, 1980). The t-test analyses are included here for illustrative purposes.
6 In the test passage from 2001, maximum sentence length was 52 words with an average of 22 words per sentence.
7 The fact that future generations of SR software need to be speaker independent can clearly be seen by the move to have voice recognition systems working over the Internet. See, for example, the design goals for the Java speech API at http:// www.javasoft.com/marketing/collateral/speech.html.
8 See Coniam (1990) for a discussion of the correlation between fluency in reading aloud and the written papers in a public examination.
Clarke, A. C. (1968). 2001: A space odyssey. London: Harper Collins.
Clarke, A. C. (1997). 3001: The final odyssey. London: Harper Collins.
Coniam, D. (1990). Reading aloud speed and language proficiency. Hong Kong Papers in Linguistics and Language Teaching, 13, 47-53.
Coniam, D. (1997). A computerised English language proofing cloze program. Computer Assisted Language Learning, 10 (1), 41-56.
Coniam, D. (in press a). Speech recognition—accuracy in the speech-to-text process. TEXT Technology.
Coniam, D. (in press b). Interactive evaluation of listening comprehension: How the context may help. Computer Assisted Language Learning.
Foulds, D. (1978). The introduction of English pronunciation to Cantonese students. The English Bulletin, Hong Kong, 17, 1. Halliday, M. A. K. (1994). An introduction to functional grammar (2nd ed.). London: Arnold.
Hunt, K. W. (1970). Syntactic maturity in school children and adults Monographs of the Society for Research in Child Development, 35 (1, Serial No. 135). Chicago: University of Chicago Press.
Kempainen, S. (1997). Automatic speech recognition lets machines listen and comprehend. EDN, 42 (5), 73-80.
Kenworthy, J. (1987). Teaching English pronunciation. London: Longman. Manes, S. (1997). Speech recognition: Now you're talking. PC World, 67.
Melamed, D. (1998). Annotation style guide for the blinker project, Version 1.0.4 (Department of Computer and Information Science Technical Report). Philadelphia: University of Pennsylvania.
Miller, M. J. (1997). I . want . to. talk . to . you. PC Magazine, 16, 4.
Noyes, J. & Starr, A. (1997). Use of automatic speech recognition: Current and potential applications. Computing and Control Engineering Journal, 7 (5), 203-208.
Where Microsoft wants to go today: Interview with Microsoft Research and Development leader, Nathan Myhrvold. (1997, September 9). PC Magazine, 30.
Ramsey, P. (1980). Exact type 1 error rates for robust Student's T-test with unequal variances. Journal of Educational Statistics, 5, 337-349.
Townshend, B., Bernstein, J., de Jong, J., & Cascallar, E. (1998, March). Validation of an automatic spoken language test. Paper presented at the Language Testing Research Colloquium. Monterey, CA.
University of Cambridge Local Examinations Syndicate (UCLES), The British Council, IDP Education Australia. (1997). The international English language testing system: The annual review 1996/1997.
I would like to thank COBUILD of the University of Birmingham for access to the Bank of English corpus.
David Coniam is an Associate Professor in the Faculty of Education at The Chinese University of Hong Kong. He is a teacher educator and works with ESL teachers in Hong Kong secondary schools. His main publication and research interests are in computational linguistics, language testing, and language teaching methodology.
Faculty of Education
The Chinese University of Hong Kong
Phone: (852) 2609 6917
Fax: (852) 2818 6591