Social:Computerized adaptive testing

From HandWiki
Short description: Form of computer-based test that adapts to the examinee's ability level

Computerized adaptive testing (CAT) is a form of computer-based test that adapts to the examinee's ability level. For this reason, it has also been called tailored testing. In other words, it is a form of computer-administered test in which the next item or set of items selected to be administered depends on the correctness of the test taker's responses to the most recent items administered.[1]

Description

CAT successively selects questions (test items) for the purpose of maximizing the precision of the exam based on what is known about the examinee from previous questions.[2] From the examinee's perspective, the difficulty of the exam seems to tailor itself to their level of ability. For example, if an examinee performs well on an item of intermediate difficulty, they will then be presented with a more difficult question. Or, if they performed poorly, they would be presented with a simpler question. Compared to static tests that nearly everyone has experienced, with a fixed set of items administered to all examinees, computer-adaptive tests require fewer test items to arrive at equally accurate scores.[2]

The basic computer-adaptive testing method is an iterative algorithm with the following steps:[3]

  1. The pool of available items is searched for the optimal item, based on the current estimate of the examinee's ability
  2. The chosen item is presented to the examinee, who then answers it correctly or incorrectly
  3. The ability estimate is updated, based on all prior answers
  4. Steps 1–3 are repeated until a termination criterion is met


As a result of adaptive administration, different examinees receive quite different tests.[4] Although examinees are typically administered different tests, their ability scores are comparable to one another (i.e., as if they had received the same test, as is common in tests designed using classical test theory). The psychometric technology that allows equitable scores to be computed across different sets of items is item response theory (IRT). IRT is also the preferred methodology for selecting optimal items which are typically selected on the basis of information rather than difficulty, per se.[3]

A related methodology called multistage testing (MST) or CAST is used in the Uniform Certified Public Accountant Examination. MST avoids or reduces some of the disadvantages of CAT as described below.[5]

Examples

CAT has existed since the 1970s, and there are now many assessments that utilize it.

  • Graduate Management Admission Test
  • MAP test from NWEA
  • SAT (beginning outside of the US in 2023 and in the US in 2024)[6]
  • National Council Licensure Examination
  • Armed Services Vocational Aptitude Battery

Additionally, a list of active CAT exams is found at International Association for Computerized Adaptive Testing,[7] along with a list of current CAT research programs and a near-inclusive bibliography of all published CAT research.

Advantages

Adaptive testing, depending on the item selection algorithm, may reduce exposure of some items because examinees typically receive different sets of items rather than the whole population being administered a single set. However, it may increase the exposure of others (namely the medium or medium/easy items presented to most examinees at the beginning of the test). [3]

Disadvantages

The first issue encountered in CAT is the calibration of the item pool. In order to model the characteristics of the items (e.g., to pick the optimal item), all the items of the test must be pre-administered to a sizable sample and then analyzed. To achieve this, new items must be mixed into the operational items of an exam (the responses are recorded but do not contribute to the test-takers' scores), called "pilot testing", "pre-testing", or "seeding".[3] This presents logistical, ethical, and security issues. For example, it is impossible to field an operational adaptive test with brand-new, unseen items;[8] all items must be pretested with a large enough sample to obtain stable item statistics. This sample may be required to be as large as 1,000 examinees.[8] Each program must decide what percentage of the test can reasonably be composed of unscored pilot test items.

Review of past items is generally disallowed, as adaptive tests tend to administer easier items after a person answers incorrectly. Supposedly, an astute test-taker could use such clues to detect incorrect answers and correct them. Or, test-takers could be coached to deliberately pick a greater number of wrong answers leading to an increasingly easier test. After tricking the adaptive test into building a maximally easy exam, they could then review the items and answer them correctly—possibly achieving a very high score. Test-takers frequently complain about the inability to review.[9]

Because of the sophistication, the development of a CAT has a number of prerequisites.[10] The large sample sizes (typically hundreds of examinees) required by IRT calibrations must be present. Items must be scorable in real time if a new item is to be selected instantaneously. Psychometricians experienced with IRT calibrations and CAT simulation research are necessary to provide validity documentation. Finally, a software system capable of true IRT-based CAT must be available. In a CAT with a time limit it is impossible for the examinee to accurately budget the time they can spend on each test item and to determine if they are on pace to complete a timed test section. Test takers may thus be penalized for spending too much time on a difficult question which is presented early in a section and then failing to complete enough questions to accurately gauge their proficiency in areas which are left untested when time expires.[11] While untimed CATs are excellent tools for formative assessments which guide subsequent instruction, timed CATs are unsuitable for high-stakes summative assessments used to measure aptitude for jobs and educational programs.

Components

There are five technical components in building a CAT (the following is adapted from Weiss & Kingsbury, 1984[2]). This list does not include practical issues, such as item pretesting or live field release.

  1. Calibrated item pool
  2. Starting point or entry level
  3. Item selection algorithm
  4. Scoring procedure
  5. Termination criterion

Calibrated item pool

Starting point

Item selection algorithm

Scoring procedure

After an item is administered, the CAT updates its estimate of the examinee's ability level. If the examinee answered the item correctly, the CAT will likely estimate their ability to be somewhat higher, and vice versa. This is done by using the item response function from item response theory to obtain a likelihood function of the examinee's ability. Two methods for this are called maximum likelihood estimation and Bayesian estimation. The latter assumes an a priori distribution of examinee ability, and has two commonly used estimators: expectation a posteriori and maximum a posteriori. Maximum likelihood is equivalent to a Bayes maximum a posteriori estimate if a uniform (f(x)=1) prior is assumed.[8] Maximum likelihood is asymptotically unbiased, but cannot provide a theta estimate for an unmixed (all correct or incorrect) response vector, in which case a Bayesian method may have to be used temporarily.[2]

Termination criterion

The CAT algorithm is designed to repeatedly administer items and update the estimate of examinee ability. This will continue until the item pool is exhausted unless a termination criterion is incorporated into the CAT. Often, the test is terminated when the examinee's standard error of measurement falls below a certain user-specified value, hence the statement above that an advantage is that examinee scores will be uniformly precise or "equiprecise."[2] Other termination criteria exist for different purposes of the test, such as if the test is designed only to determine if the examinee should "Pass" or "Fail" the test, rather than obtaining a precise estimate of their ability.[2][12]

Other issues

Pass-fail

For example, a new termination criterion and scoring algorithm must be applied that classifies the examinee into a category rather than providing a point estimate of ability. There are two primary methodologies available for this. The more prominent of the two is the sequential probability ratio test (SPRT).[13][14] This formulates the examinee classification problem as a hypothesis test that the examinee's ability is equal to either some specified point above the cutscore or another specified point below the cutscore. Note that this is a point hypothesis formulation rather than a composite hypothesis formulation[15] that is more conceptually appropriate. A composite hypothesis formulation would be that the examinee's ability is in the region above the cutscore or the region below the cutscore. A confidence interval approach is also used, where after each item is administered, the algorithm determines the probability that the examinee's true-score is above or below the passing score.[16][17] For example, the algorithm may continue until the 95% confidence interval for the true score no longer contains the passing score. At that point, no further items are needed because the pass-fail decision is already 95% accurate, assuming that the psychometric models underlying the adaptive testing fit the examinee and test. This approach was originally called "adaptive mastery testing"[16] but it can be applied to non-adaptive item selection and classification situations of two or more cutscores (the typical mastery test has a single cutscore).[17]


The item selection algorithm utilized depends on the termination criterion. Maximizing information at the cutscore is more appropriate for the SPRT because it maximizes the difference in the probabilities used in the likelihood ratio.[18] Maximizing information at the ability estimate is more appropriate for the confidence interval approach because it minimizes the conditional standard error of measurement, which decreases the width of the confidence interval needed to make a classification.[17]

Practical constraints of adaptivity

A simple method for controlling item exposure is the "randomesque" or strata method. Rather than selecting the most informative item at each point in the test, the algorithm randomly selects the next item from the next five or ten most informative items. This can be used throughout the test, or only at the beginning.[3] Another method is the Sympson-Hetter method,[19] in which a random number is drawn from U(0,1), and compared to a ki parameter determined for each item by the test user. If the random number is greater than ki, the next most informative item is considered.[3]

Wim van der Linden and colleagues[20] have advanced an alternative approach called shadow testing which involves creating entire shadow tests as part of selecting items. Selecting items from shadow tests helps adaptive tests meet selection criteria by focusing on globally optimal choices (as opposed to choices that are optimal for a given item).

Multidimensional

Given a set of items, a multidimensional computer adaptive test (MCAT) selects those items from the bank according to the estimated abilities of the student, resulting in an individualized test. MCATs seek to maximize the test's accuracy, based on multiple simultaneous examination abilities (unlike a computer adaptive test – CAT – which evaluates a single ability) using the sequence of items previously answered (Piton-Gonçalves Aluísio).

See also

References

  1. "National Council on Measurement in Education". http://www.ncme.org/ncme/NCME/Resource_Center/Glossary/NCME/Resource_Center/Glossary1.aspx?hkey=4bb87415-44dc-4088-9ed9-e8515326a061#anchorA. 
  2. 2.0 2.1 2.2 2.3 2.4 2.5 Weiss, D. J.; Kingsbury, G. G. (1984). "Application of computerized adaptive testing to educational problems". Journal of Educational Measurement 21 (4): 361–375. doi:10.1111/j.1745-3984.1984.tb01040.x. 
  3. 3.0 3.1 3.2 3.3 3.4 3.5 Thissen, D.; Mislevy, R.J. (2000). "Testing Algorithms". in Wainer, H.. Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence Erlbaum Associates. 
  4. Green, B.F. (2000). "System design and operation". in Wainer, H.. Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence Erlbaum Associates. 
  5. See the "2006 special issue of Applied Measurement in Education". https://www.tandfonline.com/doi/abs/10.1207/s15324818ame1903_1.  or "Computerized Multistage Testing". https://www.routledge.com/Computerized-Multistage-Testing-Theory-and-Applications/Yan-Davier-Lewis/p/book/9781466505773/.  for more information on MST.
  6. Knox, Liam (5 March 2024). "College Board launches digital SAT". https://www.insidehighered.com/news/admissions/traditional-age/2024/03/05/college-board-launches-digital-sat. 
  7. "International Association for Computerized Adaptive Testing (IACAT)". http://www.iacat.org/. 
  8. 8.0 8.1 8.2 Wainer, H.; Mislevy, R.J. (2000). "Item response theory, calibration, and estimation". in Wainer, H.. Computerized Adaptive Testing: A Primer. Mahwah, NJ: Lawrence Erlbaum Associates. 
  9. Lawrence M. Rudner. "An On-line, Interactive, Computer Adaptive Testing Tutorial". EdRes.org/scripts/cat. http://edres.org/scripts/cat/catdemo.htm. 
  10. "Requirements of Computerized Adaptive Testing". FastTEST Web. http://www.fasttestweb.com/ftw-docs/CAT_Requirements.pdf. 
  11. "GMAT Tip: Adapting to a Computer-Adaptive Test". Bloomberg. April 3, 2013. http://www.businessweek.com/articles/2013-04-03/gmat-tip-adapting-to-a-computer-adaptive-test. 
  12. Lin, C.-J.; Spray, J.A. (2000), Effects of item-selection criteria on classification testing with the sequential probability ratio test. (Research Report 2000-8), Iowa City, IA: ACT, Inc., https://files.eric.ed.gov/fulltext/ED445066.pdf 
  13. Wald, A. (1947). Sequential analysis. New York: Wiley. https://archive.org/details/in.ernet.dli.2015.90255. 
  14. Reckase, M. D. (1983). "A procedure for decision making using tailored testing". in Weiss, D. J.. New horizons in testing: Latent trait theory and computerized adaptive testing. New York: Academic Press. pp. 237–255. ISBN 0-12-742780-5. https://archive.org/details/newhorizonsintes0000unse. 
  15. Weitzman, R. A. (1982). "Sequential testing for selection". Applied Psychological Measurement 6 (3): 337–351. doi:10.1177/014662168200600310. https://archive.org/details/sim_applied-psychological-measurement_summer-1982_6_3/page/336. 
  16. 16.0 16.1 Kingsbury, G. G.; Weiss, D. J. (1983). "A procedure for decision making using tailored testing". in Weiss, D. J.. A comparison of IRT-based adaptive mastery testing and a sequential mastery testing procedure. New York: Academic Press. pp. 257–283. ISBN 0-12-742780-5. https://archive.org/details/newhorizonsintes0000unse. 
  17. 17.0 17.1 17.2 Eggen, T. J. H. M; Straetmans, G. J. J. M. (2000). "Computerized adaptive testing for classifying examinees into three categories". Educational and Psychological Measurement 60 (5): 713–734. doi:10.1177/00131640021970862. https://archive.org/details/sim_educational-and-psychological-measurement_2000-10_60_5/page/712. 
  18. Spray, J. A.; Reckase, M. D. (5–7 April 1994). "The selection of test items for decision making with a computerized adaptive test". Annual Meeting of the National Council for Measurement in Education. New Orleans, LA. https://files.eric.ed.gov/fulltext/ED372078.pdf. 
  19. Sympson, B.J.; Hetter, R.D. (1985). "Controlling item-exposure rates in computerized adaptive testing". Annual conference of the Military Testing Association. San Diego. 
  20. van der Linden, W. J.; Veldkamp, B. P. (2004). "Constraining item exposure in computerized adaptive testing with shadow tests". Journal of Educational and Behavioral Statistics 29 (3): 273–291. doi:10.3102/10769986029003273. https://research.utwente.nl/en/publications/constraining-item-exposure-in-computerized-adaptive-testing-with-shadow-tests(2a8d1a25-43fd-4d2b-8cc3-47dbefed16b4).html. 

Additional sources

Further reading