Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Relationships among Item Characteristics, Examine Characteristics, and Response Times on USMLE Step 1

2001·26 Zitationen·Academic Medicine

Volltext beim Verlag öffnen

Zitationen

Autoren

2001

Jahr

Abstract

In 1999, computer-based test (CBT) administration was introduced for the United States Medical Licensing Examination (USMLE).1 Currently, all three Steps use “fixed forms”: large numbers of tests are constructed to match detailed content and statistical specifications. However, use of sequential (adaptive) testing procedures is under consideration. When these procedures are used, examinees performing well are tested with progressively more difficult items; the reverse is true for examinees performing poorly. To the extent that the time required to respond to an item covaries with item difficulty, more proficient examinees may, as a result, be disadvantaged by adaptive testing procedures, and there is evidence in the literature that this occurs for some kinds of test material.2–5 The present study was undertaken as part of the effort to evaluate the desirability of introducing adaptive testing for USMLE.5,6 Specifically, the study was designed to gain a better understanding of the relationship between response times and item characteristics (word count, presence of pictures, difficulty), and the extent to which these relationships are influenced by an examinee's proficiency and English-language background (native speaker vs English as a second language). Method Context and test-administration conditions. USMLE Step 1, the basic science component of the USMLE, was the context for the study. Computer administration of Step 1 began in 1999; tests are administered throughout the year. Dozens of test forms are used, with examinees randomly assigned to forms. Test sessions are scheduled for eight hours. During this time, examinees complete seven 50-item sections; an hour is allotted for each section. Sections and items within sections are presented in random order. Within but not between sections, examinees may skip and return to items, changing answers as they wish. In addition to the seven one-hour sections, one hour is allotted for examinees to complete a tutorial, take breaks, and respond to a survey. Subjects. The study subjects were 1,985 first-time examinees from U.S. and Canadian schools who completed a subset of the test forms developed for 2000–01. Their demographic characteristics and Step 1 performance were very similar to those of examinees nationally; 96% were pursuing MD degrees, with the remainder seeking DO degrees; 91% had entered medical school in 1998; 94% planned to graduate in 2002; 56% were male; 13% spoke English as a second language; and 85% sat for Step 1 in June 2000. The Step 1 mean score for the group was 214 (standard deviation of 23), and 92% passed Step 1 on this attempt. Test material. The test material was 973 “alive” (scored) items; all were in one-best-answer format. Item characteristics were similar to those on other test forms of 2000–01. The mean p value was .76, with a standard deviation (SD) of .15; the corresponding values for Rasch-item difficulties (transformed p values used in scaling and equating test forms) were −.15 and .96. The mean word count (including words in stem and options) was 58 (SD of 26); 15% of the items involved pictorial material. The mean number of examinees' responses per item was 514 (SD of 246). Small numbers of item responses were dropped because (1) the item was not reached during the test or (2) the response time was excessive (more than 600 seconds). In all 500,121 item responses were used for analysis. Analysis. The primary analytic method was hierarchical linear regression. This procedure, in effect, fits a separate regression line for each test item, predicting response times from examinee and test characteristics. The intercepts and slopes from these item-specific regression equations are then treated as dependent variables with item characteristics used to predict variation in values across items (e.g., mean item response time predicted by item difficulty, number of words in the item, and so on). The first hierarchical linear model (HLM) fit to the data set was a persons-nested-in-items random-effects ANOVA to obtain baseline within- and between-item variation in response times. A series of random-coefficients models was then fitted to develop within-item (level 1) regression equations quantifying effects of test administration (item position) and examinee characteristics (total test scores, English as a second language) on response times. The intercept in each of these equations provided an estimate of the “time requirement” (mean response time) for the associated item. The last set of models (means- and slopes-as-outcomes) used item characteristics (Rasch item difficulty, word count, presence/absence of pictures) in (level 2) equations to predict the intercepts and slopes of level-1 equations—to predict the impact of item characteristics on response time requirements, as well as the impact of those characteristics on the relationship between response times and examinee characteristics. Results Overview. With the examinee response to an item as the unit of analysis, the mean response time was 63.7 seconds (SD 46.8 seconds), and the response time distribution was positively skewed. Response times decreased as sections progressed: responses to later items in a section were more rapid. On average, examinees with higher Step 1 scores responded slightly more quickly, although this relationship interacted with item difficulty, with more proficient examinees responding more quickly than less proficient examinees for easy items, but more slowly for hard items. On average, examinees with English as a second language (ESL) required slightly more time to respond, although the amounts varied across items in relation to word count. More difficult items, items with more words, and items with pictures required more time. HLM analyses described below quantified these relationships. ANOVA model. The random-effects one-way ANOVA (responses nested in items) produced variance components of 785 for items and 1,406 for persons within items, indicating that 36% of the variation in response times was between items, and 64% was within items. The square root of the items variance component (28.0 seconds) provides an estimate of the (true) SD in (mean) item response time; this value reflects the wide range of observed mean item response times (15 to 220 seconds). The square root of the variance component for persons within items (37.5 seconds) provides an estimate of the (pooled) within-item SD. Random-coefficients models. A series of random-coefficient models was fitted in order to identify predictors of within-item variation in response times. Based on inspection of plots, the initial model included the following predictors, the coefficients for which were all allowed to vary across items: a dummy code for language status (ESL = 1); a dummy code for first-versus-later sections (first section = 1); item position within a section; Step 1 total score; and two interaction terms. Results indicated that several predictors explained little variance, and the final model included only the language dummy code, item position, and Step 1 total score. Only the first and third predictors were allowed to vary across items; the second was constrained to have the same value for all items. Table 1 summarizes the random-coefficients model that resulted. Though this model explained only 1.2% of within-item variation in response times, coefficients in the model fit well with expectations. The grand mean for response times of native English speakers was 62.56. On average, having English as a second language increased the mean response time by 1.63 seconds. Items that appear at the beginning of a section required 6 seconds (0.1177 × 50) more to complete than items at the end of a section, and examinees with higher Step 1 scores, on average, responded to items a little more quickly (−0.0353 × 100 = 3.53 seconds per item for a 100-point difference in scores).TABLE 1: Results of Fitting a Random-coefficients Model*The random-effects portion of the table indicates large item-to-item variability in intercepts and in regression coefficients for ESL and Step 1 scores. The SD for intercepts is 28+ seconds, a measure of variability in mean item response times. The SD for the ESL coefficient is 3.0572 seconds; comparing that value with the mean ESL regression coefficient of 1.6288 indicates that for some items examinees with ESL responded more quickly than native English speakers while they responded much more slowly for other items. The SD for the Step 1 coefficient is 0.1570; comparing that with the mean Step 1 regression coefficient of −0.0353 indicates that, depending on the item, both slower and longer response times were associated with higher Step 1 scores. Means-and-slopes-as-outcomes model. The purpose of fitting a means-and-slopes-as-outcomes model was to identify item characteristics that predict item-to-item variation in mean response time (intercepts) and in the effects of ESL and Step 1 scores on item response time. The analysis treats intercepts and regression coefficients for ESL and Step 1 scores from level-1 equations as random variables, with item characteristics used to predict variation across items. For example, the rows for the intercept in the fixed-effect portion of the table indicate that the dummy code for presence of a picture, Rasch difficulty, word count, and word count squared are all predictors of mean item response time. The presence of a picture adds 12+ seconds to response time, and a logit change in item difficulty adds 14+ seconds. The relationship of mean item response time to word count is quadratic, but each word adds approximately 0.5 seconds. With regard to slopes, the ESL coefficient increases by 0.05 (seconds) for each additional word present in the item. The regression coefficient for the Step 1 total score increases if a picture is present, with additional words, and (most significantly in relation to adaptive testing) with increases in (Rasch) item difficulty. To determine the percentage of variance in intercepts explained by the model, the random-effects portions of Table 1 and 2 are used. Table 1 indicated that the variance component for intercepts was 789.5754. Using the predictors shown in Table 2, this variance component dropped to 431.0379, indicating the predictors explain 45.4% [100% × (789.5754 − 431.0379)/789.5754] of between-item variability in response times. Similar computations for the ESL dummy code and Step 1 scores indicate that the predictors in the fixed-effects portion of the table explain 30.8% and 37.8% of the between-item variability in these coefficients.TABLE 2: Results of Fitting a Means- and slopes-as-outcomes Model*To obtain a practical sense of the effect of item difficulty on response time and the magnitude of the interaction between item difficulty and proficiency, consider two items, both with 60 words and no associated pictures, that are presented in the middle of a section. One item has a p value of .35, while the other has a p value of .95. For an examinee with a Step 1 score of 150, the model in Table 2 predicts response times of 73 and 42 seconds, with the easier item requiring less time. For an examinee with a Step 1 score of 250, predicted values are 102 and 30 seconds. Thus, more-proficient examinees appear to respond to easier items more quickly, using the response time “saved” on more difficult items. Discussion and Limitations The items varied extensively in response-time requirements, with some items, on average, requiring less than 30 seconds to complete and others requiring more than 3 minutes. Roughly 45% of the true variation in mean item response times was predictable from three item characteristics: difficulty, word count, and the presence or absence of pictures. The magnitude of item-to-item variation in response times could induce differential speededness across test forms (particularly for short tests) if time requirements are not considered during test construction. During item pretesting, response-time data should be collected for this purpose, and tests and sections should be constructed to minimize differences in speededness. More proficient examinees complete easy items more quickly, using the time saved on harder items: because easy items are completed faster, the extra time is available for hard items. In adaptive (or sequential) testing, this would not be possible, since highly proficient examinees would tend to receive few easy items. Given the relatively strong relationship between item difficulty and mean response time, coupled with a positive within-item slope in the regression of response time on proficiency, differential speededness could easily occur in adaptively administered tests if the same number of items were used for examinees of different proficiency. Depending on the nature of the test material, time requirements and item difficulty may well covary, as they did in this study. As a consequence, it may be best for tests targeted at different proficiency levels to vary in numbers of items (with total testing time constant) in order to maintain comparable levels of speededness. Because test items from only a single exam program were used in the study, it is unclear whether major study results generalize to other tests or to adaptively administered tests. It seems likely that the impact of item and examinee characteristics, and interactions among them, will depend on the nature of test material, degree of speededness, and test administration procedures. This should be a fruitful and, from a practical perspective, important area for future research.

Autoren

Institutionen

National Board of Medical Examiners(US)

Themen

Innovations in Medical EducationRadiology practices and education

Volltext beim Verlag öffnen

Relationships among Item Characteristics, Examine Characteristics, and Response Times on USMLE Step 1

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen