The influence of the examiners' characteristics on IELTS and TOEFL tests band scores and rating

این مورد را ارزیابی کنید
(0 رای)

The influence of rater characteristics and other rater background factors


( Based on a research conducted by Teachers College, Columbia University )


In complement to the studies which looked at how raters differ, studies on the effects of rater background factors attempt to explain why raters differ, with an increasing attention to the effects  of rater language background, rater expertise and rater training on raters’ cognitive processes and rating behaviors. Findings from both types of studies can be combined to provide a useful frame of reference for conceptualizing rater cognition in future research.


Rater language background (i.e., native/non-native speaking rater comparisons, matches between rater and examinee language background) has received major attention among researchers in L2 speaking assessment. A representative study that examined the cognitive differences between native and non-native speaking groups of raters was conducted by Zhang &Elder (2011, 2014), who investigated ESL/EFL teachers’ evaluation and interpretation of oral English proficiency in the national College English Test-Spoken English Test (CET-SET) of China. They found that NS raters attended to a wider range of abilities when judging candidates’ oral test performance than NNS raters. NS raters also tended to emphasize features of interaction while NNS raters were more likely to focus on linguistic resources such as accuracy. Similarly, Gui (2012) investigated whether American and Chinese EFL teachers differed in their evaluations of student oral performance in an undergraduate speech competition in China. He found that the American raters provided more specific and elaborated qualitative comments than the Chinese raters. The raters also differed in their judgment of students’ pronunciation, language usage, and speech delivery. One unique difference was related to raters’ comments on students’ nonverbal communication skills. The Chinese raters provided mostly positive comments about the gestures and other non-verbal demeanors of the students as a group, while the American raters were mostly critical. Both Zhang & Elder’s (2011, 2014) and Gui’s (2012) studies have offered some interesting revelations as to the differences in the perception of oral English proficiency and the pedagogical priorities between these two groups of raters. However, they seem to mainly focus on the aspects and features of language performance raters heed, leaving other important aspects of rater cognition, such as raters’ decision-making behaviors and rating approaches, not thoroughly attended to. Another set of limitations also exist with regard to the validity and the generalizability of these results. The first limitation lies in the homogeneity of the student samples selected in both studies. Chinese students who share the same L1 and similar educational background might undermine the generalizability of the results to other test-taker populations. There is also limitation with regard to the validity of using written comments as the major data for analysis, which might not offer a full account of raters’ in-depth rating behaviors.


The last impediment to the validity of the results from both studies, as had been discussed in precedent studies on the influence of rater language background (Brown, 1995; Kim, 2009), pertains to the possibility that variables other than rater language background, such as raters’ scoring experiences or their places of residence, could have caused the variance in ratings instead. Rater language background thus ended up in the original results as a proxy variable. This limitation has raised the question of whether language background is “a particularly meaningful category as far as predicting raters’ behavior is concerned” (Zhang & Elder, 2014, p. 320).

Another type of research on rater language background attempted to find out whether raters tend to bias in favor of test-takers whose language backgrounds are related to theirs. Researchers have looked at the influence of both rater L1 and rater L2 and seem to diverge in their opinions. Winke, Gass, & Myford (2011, 2012) investigated whether raters were influenced by the link between their L2 and test-takers’ L1 through scoring the TOEFL iBT speaking test.

Both statistical results and qualitative data analyses suggested that raters tended to assign scores that were significantly higher than expected to test takers whose L1 matches their L2 (i.e., heritage status), due to familiarity and positive personal reactions to test-takers’ accents and L1. On the contrary, Wei & Llosa (2015) examined the differences between American and Indian raters in their scores and scoring processes while rating Indian test takers’ responses to the TOEFL iBT speaking tasks. They found no statistically significant differences between Indian and American raters in their use of the scoring criteria, their attitudes toward Indian English, or in the internal consistency and severity of the scores. In-depth qualitative analysis revealed that some Indian raters even held negative attitudes toward Indian English, due to factors more complicated than their own language background. For example, the negative judgments one rater received about his native language caused him to believe that adopting standard American English is important for surviving in the United States. As a result, this rater might not have endorsed test-takers’ shared language background. The findings of this study suggest that sharing a common language background does not guarantee a positive evaluation of test-takers’ L2 speaking performance after all. However, issues regarding the small and homogeneous sample of Indian raters used might undermine the generalizability of the findings of this study, which should be further examined by including raters and test-takers of other language varieties.


So far in L2 speaking assessment, researchers have provided statistical and qualitative support for various hypotheses regarding whether raters are potentially biased toward test-takers from a similar language background. However, they have yet to examine whether deeper, underlying cognitive differences exist in raters’ scoring processes, such as their approaches to rating and their focus and feature attention, while they are evaluating the performance of testtakers with mixed language backgrounds. One of the studies that attempted to tap into those cognitive differences was conducted by Xi & Mollaun (2009, 2011), who investigated the extent to which a special training package can help raters from India to score examinees with mixed first language (L1) backgrounds more accurately and consistently. As they found out, the special training not only improved Indian raters’ consistency in scoring both Indian and non-Indian examinees, but also boosted their confidence in scoring. Those findings led to further discussion of whether raters adopted different styles of rating depending on the match between their and the examinees’ first languages. For example, after the special training, the raters from India may have employed more analytical approaches to scoring Indian examinees while engaging in more impressionistic, intuitive evaluations for examinees whose L1s were not familiar to them (Xi & Mollaun, 2009), thus balancing out their tendency to bias toward test-takers of their own language background. However, the researchers could only make hypotheses about the change in raters’ cognitive styles due to lack of direct empirical evidence (e.g., raters’ verbal protocol data), which could have served to corroborate their quantitative findings.


Apart from rater language background, rater experience and rater training are also important factors that are found to affect raters’ rating styles and behaviors in L2 speaking assessment. Among the series of studies that have explicitly examined the effects of experience on raters’ cognitive processes and rating behaviors in language testing, the vast majority were conducted in writing assessment (Barkaoui, 2010; Cumming, 1990; Delaruelle, 1997; Lim, 2011; Myford, Marr, and Linacre, 1996; Sakyi, 2003; Wolfe, 1997, 2006; Wolfe, Kao, & Ranney, 1998). Research findings in writing assessment generally seem to agree that prior teaching or testing experience influences raters' decision making processes (Davis, 2012). Experienced raters are found to score faster (Sakyi, 2003), consider a wider variety of language features (Cumming, 1990; Kim, 2011; Sakyi, 2003), and are more inclined to withhold premature judgments in order to glean more information (Barkaoui, 2010; Wolfe, 1997). In terms of rater training, the majority of the studies in both writing and speaking assessment seems to suggest that training does not completely eliminate the variability existing in either rater severity (Brown, 1995; Lumley & McNamara, 1995; Myford & Wolfe, 2000) or their scoring standards and decision making processes (Meiron, 1998; Orr, 2002; Papajohn, 2002; Winke, Gass & Myford, 2011).


In contrast to the relatively larger number of studies on rater experience and rater training in L2 writing assessment, researchers in L2 speaking assessment have only recently begun to examine the impacts of those two rater background factors on raters’ scoring processes and behaviors (Davis, 2012, 2015; Isaacs & Thompson, 2013; Kim, 2011, 2015). Kim (2015) compared rater behaviors across three rater groups (novice, developing, and expert) in the evaluation of ESL learners’ oral responses, and examined the development of rating performance within each group over time. The analysis revealed that the three groups of raters demonstrated distinct levels of rating ability and different paces of progress in their rating performance. Based on her findings, she concluded that rater characteristics should be examined extensively to improve the current understanding of raters’ different needs for training and rating. She also discussed her own conceptualization of rater characteristics and relative expertise drawing on relevant literature in writing assessment (e.g., Cumming, 1990; Delaruelle, 1997; Erdosy, 2004; Lumley, 2005; Sakyi, 2003; Weigle, 1998; Wolfe, 2006), and proposed perhaps the most up-todate framework of rating L2 speaking performance germane to those rater characteristics.


According to Kim (2011, 2015), rater expertise is composed of four concrete rater background variables (i.e., experience in rating, Teaching English to Speakers of Other Languages [TESOL] experience, rater training, and coursework). The interactions of those rater background variables influence the rating-related knowledge and strategic competence that raters utilize during scoring, also known as their rating ability. Rating performance is then accomplished by raters harnessing their rating ability in an actual rating occasion. Kim’s model is perhaps the most complicated framework of rating performance germane to rater background variables to date.


In another representative study on rater expertise in L2 speaking assessment, Davis (2012, 2015) investigated how raters of different rating proficiency scored responses from the TOEFL iBT speaking test differently prior to and following training. Considerable individual variations were seen in the frequency with which the exemplars were used and reviewed by raters, the language features mentioned during rating, and the styles of commenting by each rater (e.g., the array of topics covered and the amount of detailed explanation on specific points). The effects of training were reflected in the ways that raters gave more explicit attention to their scoring processes, and that they made fewer disorganized, or unclear comments over time. Both Kim’s (2011, 2015) and Davis’ (2012, 2015) research is comprehensive in terms of the rater background factors (i.e., rater experience interacting with training) they focused on and the research design and methods (i.e., mixed-method research design) they used to tap into the influence of those background factors. However, the data reported in their research primarily address raters’ accuracy of interpreting the rating scales and performance level descriptors (Kim, 2015), and raters’ conscious attention to specific language features while scoring (Davis, 2012), leaving other important aspects of rater cognition, such as the mental actions raters take to reach a scoring decision, not thoroughly attended to.


As a further attempt to investigate the cognitive differences between more and less experienced raters, Isaacs & Thompson (2013) examined the effects of rater experience on their judgments of L2 speech, especially regarding pronunciation. This study has discovered some fresh cognitive differences between experienced and novice raters, in terms of the (meta)cognitive strategies they use to harness their relative experience with ESL learners, their emotional reactions and attitudes toward their levels of experience, their rating focus and feature attention, their professional knowledge and TESOL vocabulary to describe L2 speech, and the relative lengths and styles of their verbal comments. Evidence from verbal protocols and posttask interviews suggested that experienced and novice raters adopted strategies to either draw on or balance out their perceived experience with L2 speech during scoring. For example, some experienced raters reported that they might have been affected by their experience with ESL learners in their comprehension and evaluation of learners’ speech in comparison to non-ESL teachers. To neutralize the influence, some even attempted to envision themselves as non-ESL trained interlocutors when assigning scores. Conversely, several novice raters expressed feelings of inadequacy to be judges due to their insufficient experience specifying and assessing learner speech. In terms of rating focus and feature attention, experienced raters were more likely to identify specific pronunciation errors through either detailed characterization or imitation/correction of student speech. Compared to their novice counterparts, they also had a more flexible range of professional knowledge of L2 pronunciation and assessment, whereas novice raters were more uniformly lacking in their command of TESOL vocabulary to the extent that they had to think of more creative terms to describe L2 speech. Experienced raters were also found to produce longer think-aloud and interview comments, since they almost unexceptionally provided anecdotal descriptions about their teaching or assessment practices. Even though the study attempted to gather evidence that shows raters diverged cognitively depending on their levels of rating experience, it is still unclear if the cognitive differences discovered were the essential ones that distinguish experienced raters from the novice ones. For example, it has not been verified if novice raters failed to articulate their perceptions of the speech due to their inadequate access to the vocabulary used by experienced raters, or rather due to the fact that experienced and novice raters were heeding qualitatively different dimensions of the speech overall, having different perceptions and interpretation of the construct and the scoring rubric, or following different approaches of rating. Therefore, it is important to examine in greater detail the factors that might have affected those raters’ judgment process while scoring.


The most commonly studied rater background factors in L2 speaking assessment so far are rater language background, rater experience and rater training. What has been little known, however, is whether other sources of rater variability, for example, those related to the difference in raters’ cognitive abilities, also affect raters’ evaluation of L2 speaking performance. In a pioneering study, Isaacs & Trofimovich (2011) investigated how raters’ judgments of L2 speech were associated with individual differences in their phonological memory, attention control, and musical ability. Results showed that raters who specialized in music assigned significantly lower scores than non-music majors for non-native like accents, particularly for low ability L2 speakers. However, the ratings were not significantly influenced by the variability in raters’phonological memory and attention control. Reassuring as it is that phonological memory and attention control are not found to induce bias in raters’ assessments of L2 speech, this study is an initial attempt to tap into raters’ cognitive abilities in relation to L2 speaking assessment, and calls for further explorations of the nature of the impacts of those abilities. One major caveat that might undermine the validity of the results here, as the researchers (Isaacs & Trofimovich, 2011) themselves have pointed out, is that phonological memory and attention control might not be as relevant to raters’ perceptual judgments of L2 speech as alternative measures such as acoustic memory and the scope of attention, which raters might have drawn on more heavily to process and evaluate L2 speech (pp. 132- 133). Apart from that, the cognitive tasks used to measure raters’ phonological memory (i.e., a serial non-word recognition task) and attention control (i.e., the trail-making test) might not be as effective as other tasks (e.g., nonword repetition or recall tasks) to yield the maximum association between those cognitive capacities and raters’ perceptual judgments of L2 speech (Isaacs & Trofimovich, 2011, p. 132). The trail-making task, for example, was used to measure attention control of listeners who evaluate language performance. However, since the nature of the task is language neutral (p. 122), it does not seem to have much connection with real-life language processing and therefore, might not be the optimal measure of attentional control in the context of this study. In terms of the methods for data analyses, apart from preliminary statistical analyses of the results of cognitive ability measures, the study could also have benefited from collection and analyses of qualitative data (e.g., raters’ verbal protocols and interview/questionnaire results) to capture more direct evidence of the effects of raters’ cognitive abilities on their rating process. This study is obviously groundbreaking in terms of its implications to investigate rater cognition in relation to the architecture of human information processing and the functionality of the brain for L2 speaking assessments. However, apart from phonological memory and attentional control, the effects of many other cognitive abilities and mechanisms should also have been taken into account, such as raters’ attention and perception, long-term memory (i.e., declarative, procedural and episodic memory which might influence raters’ mental representations of both the rubric and the L2 speech, and their rating styles and strategies), or reasoning and decision-making skills, to provide a more comprehensive picture of the important role that each component of the human cognitive architecture plays in the process of rating L2 speech. Musical ability, the factor that appeared to influence raters’ judgments of accentedness in this study, needs to be explored in greater detail to explain how individual differences in musical expertise may impact rater behavior more precisely. Not only can drawbacks be found regarding the types of cognitive abilities explored in this study and the tasks used to measure them, how those cognitive abilities might affect the evaluation of a construct of speaking ability more broadly defined is also left unexplored (p.136). For instance, researchers of this study only focused on three components of the speaking ability construct (i.e., accentedness, comprehensibility and fluency), without incorporating other elements (e.g. grammar and vocabulary), therefore largely diminishing the generalizability of the results to a wider variety of speaking tasks and oral proficiency constructs. The relatively homogenous sample of raters recruited (i.e., college majors who are untrained and inexperienced for scoring L2 speech) can also limit the generalizability of the results.


To summarize, by examining the interactions between various rater background factors and raters’ judgment processes, researchers reached generally similar conclusions about the possible effects of different rater background factors on raters’ cognitive processes and rating behaviors. Rater language background is found to be likely to affect the raters’ focus and perception of oral proficiency when they are identified as native/non-native speaking individuals.


Matches in language background between raters and examinees can also influence raters’ comprehension and evaluation of examinees’ interlanguage speech. Rater experience and rater training are also found to have impacts on raters’ scoring approaches and styles, their commenting styles, their decision-making behaviors and strategy use, their focus and attention to performance features, and their interpretation and utilization of the scoring criteria. One of the groundbreaking studies (Isaacs & Trofimovich, 2011) attempted to look into the effects of individual differences in raters’ cognitive abilities on their rating patterns and scoring process, but the results are not as convincing as expected due to a number of limitations. One major limitation among most of the studies is that they only focused on one or two isolated aspects (e.g. rater focus and feature attention) while exploring how rater background factors affect rating behaviors and cognitive processes, leaving other aspects not thoroughly explored, especially those that are directly related to raters’ cognitive processes (e.g., raters’ internal processing of information and their strategy use). Therefore, future research can improve our understanding of how various rater background factors might impact raters’ judgment process by systematically exploring those influences from a cognitive-processing perspective.



Human raters are usually engaged with the judgment of interlanguage speech that examinees produce in L2 speaking assessment. As a result, rater cognition has been extensively explored to inform our understanding of the exact nature of rater variability and help us tackle practical problems regarding score validation and rater training. As the above review has shown, existing studies in L2 speaking assessment which have contributed to the conceptualization of rater cognition can be categorized into two types: studies that examine how raters differ (and sometimes agree) in their cognitive processes and rating behaviors, and studies that explore why they differ. The first type looked at how raters tend to differ or agree in their cognitive processes and rating behaviors, mainly in terms of their focus and feature attention, their approaches to scoring, and their treatment of the scoring criteria and non-criteria relevant aspects and features of performance. This is also the type of studies that most directly describes raters’ mental processes during scoring. The second type attempted to explain why raters differ (and usually they do), through the analysis of the interactions between various rater background factors and raters’ scoring behaviors.

Regardless of disagreement in their findings, many researchers would probably argue that rater background variables, mainly composed of their language background, rating experience and training experience, can lead to individual variability and/or overtime adjustment in their judgment process when scoring L2 speech.


Reference : Teachers College, Columbia University Working Papers in TESOL & Applied Linguistics, Vol. 16, No. 1, pp. 1-24 Rater Cognition in L2 Speaking Assessment.