Andrea Cappelli

August 15, 2019

How to Improve Assessments with Multiple-Choice Questions

Multiple-choice questions are used primarily in educational and training assessments. However, characterizing a user’s skills also requires characterizing the tests they take. Questions are usually thought of in terms of difficulty, but there are several parameters that can help capture their properties and overall quality.

In this article, we’ll provide a brief overview of assessing a user based on items (e.g., students with high skill correctly answer difficult questions, which in turn are those that only highly skilled students typically can answer, etc.). Also, we show a few of the properties that make for a good and bad questions to help improve assessments with multiple-choice questions.

We created a Cloud Knowledge Evaluation that uses the principles that we discuss in this article. It tests your current skills to reveal your strengths and weaknesses, and system estimates your knowledge level based on the difficulty of the questions and the number of correct answers. To see for yourself, take our Cloud Knowledge Evaluation a couple times to see how the multiple-choice questions change based on your answers.

Cloud Knowledge Evaluation (Multiple-Choice Questions Example)

Skill assessment

Skill assessment is the process by which the ability of a person performing a task is measured. This is key information in several concrete use cases, such as:

Students or professionals eager to know where their knowledge stands with respect to the job market and where there’s room for improvement.
Company managers investigating the strong and weak points of their team to find who has the skills for a given project or which areas the team needs improvement on.
People involved in the recruiting process to standardize the measurement of a candidate’s ability.
Automated e-learning systems aimed at providing the most appropriate learning material to users of a platform.

In practical terms, assessing a skill means quantifying an individual’s proficiency and computing a score. This score allows you to compare a person’s level to that of another, or that person’s current level to a goal (e.g., the future level the person wants to reach). In order to reliably do this, we have to choose appropriate tools for the job.

Multiple-choice questions and basic assessment

One solid, long-standing way to measure an individual’s skill is to use an appropriate set of challenges (often referred to as items) and to calculate the score on the basis of the degree of success achieved. For this purpose, the most popular strategy is using a multiple-choice question (MCQ) exam and computing the score proportionally to the percentage of correct answers over the total. This approach — known as Classical Test Theory (CTT) — is very transparent and simple to implement, and it works perfectly when every examinee is required to answer the same set of questions.

However, there are several cases where we might want to vary the questions. Consider a person taking a test twice: a preliminary assessment at on-boarding and a final exam after training to track improvement. Ideally, there should be no overlap between the questions during the first and second attempt. In other cases, we might even want to use a different number of questions, such as a quick, 10-question preliminary assessment and an exhaustive 30-question final exam.

The goal is to compare two users (or the same user in two different attempts) that answered two different question pools about the same topic. Consider the following example: Two users are tested on their proficiency about Java and they are asked 20 questions each, but the questions are taken from two different sets. If both users answer 10 of 20 correct answers, can we state that they have the same level of proficiency? In order to be able to compare the two users, we need to know which type of questions they were asked. For instance, if one user was asked 20 “easy” questions while the other one answered to 20 “hard” questions, we can likely declare that the latter is more proficient than the former. Of course, the typical scenario is not as easy to be solved and it requires some formal tools to numerically compare the two users.

Let’s try to better define this concept and see which properties of the questions should be taken into account in the assessment.

Multiple-choice question difficulty

What’s a difficult question? A common definition is “one a big percentage of people fail at.” But there is a caveat: What if a question was answered incorrectly by practically every examinee, but those same people turned out to be all novices in the topic? The risk is that we are estimating the question more difficult than it really is.

Now let’s rephrase our problem and improve our definition of difficulty:

We expect an unskilled examinee to be able to correctly answer some of the easiest questions, likely failing most of the more difficult ones.
On the other hand, we expect an expert examinee to be able to correctly answer most of the easiest questions and some of the most difficult ones.
However, for any reason it’s possible for an expert to fail some easy questions and, similarly, for a novice user to correctly guess some very difficult questions.

These considerations suggest to tackle the problem using the concept of probability. Whenever an individual “encounters” a test item, the probability of succeeding against it depends on both the user’s skill and the item’s difficulty.

A reasonable and very common function that’s used to model this behavior (known as sigmoid or logistic function) is shown in the graph below.

Multiple-Choice Questions Difficulties Graph

Each curve represents the probability of answering correctly to a given question as a function of the examinee skill, two curves for two questions with different difficulty parameter. The more a curve is shifted on the right, the harder it is.

For example, a novice examinee will have less than 10% chance of correctly answering the hard question, but close to 30% for the easy one. Similarly, the expert examinee is expected to correctly answer the easy question (90+% probability), and to have good chances at the hard one (75%).

All of these concepts paved the way to introduce one of the most reliable, standard psychometric techniques: The Item Response Theory (IRT). Let’s see what it consists of.

Item response theory

IRT is widely used to serve assessments and estimate examinees’ skills. It treats skill and difficulty as homogeneous quantities, so that they can be represented on the same scale (say 0 to 1000, but the range doesn’t really matter as long as the resulting probabilities are kept invariant).

In order to estimate the skill of users, IRT needs to know the difficulties of the questions answered by them. However, in order to estimate the difficulty of a question, it needs to know the skill of the users that answered to such a question, as we previously mentioned.

IRT tackles this “loop” problem by “calibrating” the question pool. To do this, it learns the optimal set of person skills and question difficulties that best explains the past answers given by the users. The result is a long list of estimated skills (good to have but not necessary for future use) and calibrated difficulties, which are instrumental to evaluate new examinees.

Once the question pool has been calibrated, when an individual takes the exam and answers to a list of questions, we need to estimate the user’s proficiency. IRT looks for the single skill score that better explains the pattern of item successes and failures that the examinee provided.

For concreteness, let’s have a look at these two cases that represent the answer pattern by an examinee in a 10-question exam. X’s are for incorrect answers, and check marks are for correct answers. The x-axis just shows numerical identifiers for the questions. The y-axis reports each question’s difficulty.

Let’s focus on the left case first. Where would you place your estimate of the student’s skill?

No matter the precise location, you probably chose somewhere between 400 and 500 — between the most difficult correct answer and the easiest wrong answer. Good news is, that’s what the IRT would estimate, too!

The case on the right is just a fuzzier, more realistic case: It’s not always possible to draw a “skill line” that perfectly separates correct and wrong answers, given an examinee’s performance, because the examinee might succeed at a tough question by chance or get distracted and fail at an easy one. You’d probably end up with an estimate similar to the previous one, although maybe less confident than before. But that’s where using the IRT estimate criterion comes in handy to provide a quantitative answer based on data, and confidence is part of the result the technique provides.

Multiple-choice question discrimination

Beyond the difficulty, questions have other properties that affect the assessment.

Think about a question that is typically failed by any user, no matter the proficiency. It can be the case of a question out of the exam’s topic or formulated in a bad way. Advanced versions of IRT take this into account and model it through a parameter called discrimination.
For instance, the following graph represents three questions with the same difficulty, but with three different discriminations.

Multiple-Choice Question Discrimination Graph

As you can see, discrimination impacts the “slope” of the curve. Very high discrimination implies that examinees with a skill just a bit higher than the question difficulty are expected to ace it. If a user gets it correct, it is very strong evidence that the skill is higher than the question difficulty. On the other hand, the flat curve represents the extreme case of no discrimination; any examinee, regardless the skill, has a 50% probability of answering correctly. Therefore, even if a user gets it correct, we cannot make any inference of the actual user’s skill.

Interestingly enough, questions can even have a negative discrimination. This typically means there is some built-in problem in the question (e.g., unclear wording) and so it should be reviewed.

Conclusion

Multiple-choice question assessments are the typical tool used to assess users’ skills, such as to prescreen candidates to hire or to assess the current level of proficiency before starting a training path.

While relying on the percentage of correct answers is a straightforward and effective way of measuring the skills of the examinees, there are several scenarios where this simple count can fail and provide inaccurate results. Factors like the number of questions, difficulty, and discrimination strongly affect the assessment and are to be taken into account. We presented IRT as a state-of-the-art technique able to support even complex cases. However, it is important that the person who designs the exam takes these factors into account.

As an example, if the exam includes only easy questions, there won’t be any way IRT, as well as any other tool, is able to assess an examinee’s proficiency. Similarly, if the available questions are not discriminating enough, the result will be a not-particularly-confident skill estimate.

Last, but not least, the number of questions answered by the examinee affects the estimate confidence. While there is not a number of ideal questions, IRT is based on probabilities. These probabilities provide the estimate and measurement of its confidence, determining whether the number of questions was sufficient for an accurate assessment.