Abstract


Usability evaluation methods have become an integral part of software design life cycles for many product-based industries. These methods include usability testing, expert review, questionnaire and survey, and many more. This paper will focus particularly on expert review and usability testing and discuss various issues associated with the reliability of these methods. The paper further provides techniques to mitigate the issues in order to increase the reliability measures of these methods.

Usability testing and expert review, in this paper, are used as an umbrella term that encompasses all the variations of the respective methods.



Overview


Usability testing and expert reviews are similar in the sense that both evaluation methods assess the usability of a product. There is no standard procedure for running either a usability test or an expert review. However, there’s a huge difference. Usability tests are conducted in controlled environments involving multiple users working on a set of tasks. Expert review, on the other hand, involves a usability expert testing the system using his/her knowledge and experience while wearing the hat of a typical user.

The fact that usability testing and expert review does not involve any standard procedure, poses a big question in front of usability practitioners—are these methods reliable? To answer this question, we first need to understand what reliability means.

According to Oxford dictionary, reliability is the quality of being trustworthy or the quality of performing consistently. Applying this definition, we can rephrase the above question as follows:

Can the clients/stakeholders trust the results of the evaluation methods in context?
Will be the results be consistent if the same test is run by different evaluators?
Will be the results be consistent if the same test is conducted with different participants?

With an intention to answer these questions, usability evaluation methods have been under scrutiny for several years now. Much literature has been written describing the reliability of these methods and the answer to the big question still remains NO, which I agree with. In order to understand the non-reliability of these methods, it is imperative to understand the problems associated with them.



Problems


Based on the literature, there are many reasons why a usability test or an expert review may not be as reliable as usability practitioners once thought. Let’s look at them one by one.


1. Optimal number of participants


Nielsen (2000) and Virzi (1992) conducted separate studies and concluded that 5 participants yield 80% of the usability issues. However, Faulkner (2003) argues that testing with 5 participants is risky as it does not always yield 80% of the usability problems. The results of Faulkner’s experiment showed that some of the randomly selected sets of 5 participants (out of 60) found 99% of the problems; other sets found only 55%. These experiments clearly demonstrate that usability testing isn’t reliable.


2. The ‘evaluator effect’


Evaluators assessing the same system using the same usability evaluation method tend to uncover substantially different sets of problems. This is called ‘evaluator effect’ (Hertzum et al., 2002). Evaluator effect has been validated by a number of studies. Comparative Usability Evaluation (CUE) is one such series of studies in which a considerable number of professional usability teams from all over the world independently and simultaneously evaluate the same system using their own usability/expert review practices. The evaluation results are then studied and compared with each other in order to check the reliability of the evaluation methods. In CUE-2, Molich et al. (1998) studied 9 teams of usability professionals and showed that there are striking differences in the approach, reporting, and findings between the teams. In CUE-4, Molich et al. (2008) studied seventeen professional teams and showed that evaluator effect is present in both usability testing and expert review. Recently in CUE-9, Hertzum et al. (2014) studied nineteen usability professionals and concluded that evaluator effect is present in moderated and unmoderated usability tests.


3. Biases in evaluation methods


Usability tests or expert reviews are scenarios simulated as close to what users would actually do with the system while practitioners observe, record, or evaluate. No matter how real the scenario, task, data, software, or environment is, these methods are feigned, and subtle biases creep into the process. (Croskerry, 2002; Neilsen, 2000; Shafir et al., 1993; Sauro, 2012; Loranger, 2016) Biases that can affect the evaluation methods are listed below:

Hawthorne Effect: Participants are more vigilant and determined to do tasks when they are being observed.

Task-Selection Bias: “If you’ve asked me to do it, it must be able to be done.”

Social Desirability Bias: Users generally tell you what they think you want to hear.

Sample Bias: Participants may not accurately represent the actual users.

Honorariums: If the honorarium the user receives is the only motivator, then the quality of the data might be affected.

Selection Bias: “If you’ve asked me about it, it must be important.”

Primacy & Recency Effects: Tendency to weigh events that happened first and last more heavily than earlier events.

Negativity Bias: Tendency to give more attention or weight to negative experience, even when they are inconsequential.

Framing Effect: Tendency to react differently to the same information depending on how it’s worded.

Anchoring Bias: Tendency to heavily depend on specific information or features, usually the ones encountered first.



Mitigation


Every method has issues and biases involved, but this alone can’t be the reason to dismiss the data. Many basic and applied research studies incorporate methods that have issues and biases involved. Despite these shortcomings, we are able to learn and infer a great deal from these research studies. We must, hence, try to find ways to mitigate the above-mentioned problems.


1. Triangulation


One way to mitigate problems associated with usability testing and expert review is through explicit use of multiple methods, measures and approaches to evaluate the determined issues in a system (Wilson, 2006). Results found through multiple methods would help converge on the problem areas and increase the overall reliability of data. For example, while evaluating a speech-enabled interactive voice response system for a banking client, we conducted three separate studies—usability test, heuristic evaluation, and customer support data analysis. This is referred to as between-method triangulation which helped us focus on core problems that emerged across all three methods. This not only gave us more reliable results but also helped convince the stakeholders and team to focus on this convergence of results.

Within-method triangulation can help mitigate many risks associated with the various biases involved. For example, in one of the studies, we conducted for an e-commerce client, usability tests were run by multiple moderators and evaluators, canceling the effect of individual biases. Thus, triangulating data from multiple practitioners reduced the influence of negativity bias, anchoring bias, and primacy & recency effects.


2. Transparency at work


Knowing individual members of your team and discussing the biases each one brings to the table also helps mitigate the problems associated with evaluation methods. Consider a case where a team of usability practitioners are evaluating a gaming interface. Practitioners who themselves are gamers in real life would evaluate the interface from an expert users’ point of view. However, other practitioners who aren’t into games would evaluate the interface from a novice users’ frame of reference. Communicating the level of gaming experience every evaluator has can help avoid skewed results.


3. Sampling


Some sampling methods rely on volunteers, who “opt-in”, especially in usability testing studies. There are concerns about the representativeness of samples of individuals who opt-in. For example, if the test session runs during office hours, recruits might include only those subjects who aren’t busy. This can be an issue if your actual users are engineers, physicians or other skilled workers who can’t easily take time off work. Thus, using sampling methods that involves selection of participants based on certain criteria that include all the important characteristics of the actual user population can help mitigate sample bias. In this case, sample bias can be eliminated using remote testing too.

In studies where difference among participants is independent variable, for example expertise, care must be taken to ensure that all other parameters are kept constant.


4. Blind raters


One way to reduce biases is to have multiple blind raters, i.e. people other than moderators or evaluators, categorize and rate problems. Since the raters have no knowledge about conditions or participants, there are very little chances that a bias would creep in. The validity of this method can be further improved by randomizing the order in which the problems are rated.



Conclusion


In an evaluation study, a practitioner would never report all the usability problems found, since such a report would be unusable. He/she would probably select 20-30 severe problems based on the results and report them, essentially abandoning the rest. In many companies, usability testing, and expert review have been proved to be useful and provided a great return on investments.

I would still maintain my stand that usability testing and expert review, as usability evaluation methods, are not reliable. However, the problems associated with these methods can be mitigated by communicating individual differences and biases; incorporating appropriate sampling methods; and using multiple methods, measures and approaches to triangulate results.

In closing, I like to quote renowned user experience professional Steve Krug (2014) – “If you want a great site, you’ve got to test…. Testing one user is 100 percent better than testing none.”



References


Croskerry, P. (2002). Achieving quality in clinical decision making: cognitive strategies and detection of bias. Academic Emergency Medicine, 9(11), 1184-1204.

Faulkner, L. (2003). Beyond the five-user assumption: Benefits of increased sample sizes in usability testing. Behavior Research Methods, Instruments, & Computers, 35(3), 379-383.

Hertzum, M., Molich, R., & Jacobsen, N. E. (2014). What you get is what you see: revisiting the evaluator effect in usability tests. Behaviour & Information Technology, 33(2), 144-162.

Hertzum, M., Jacobsen, N. E., & Molich, R. (2002, April). Usability inspections by groups of specialists: perceived agreement in spite of disparate observations. In CHI'02 extended abstracts on Human factors in computing systems (pp. 662-663). ACM.

Krug, Steve, author. (2014). Don't make me think, revisited : a common sense approach to Web usability. [Berkeley, Calif.] :New Riders,

LORANGER, %. (2016, October 23). The Negativity Bias in User Experience. Retrieved February 24, 2018, from https://www.nngroup.com/articles/negativity-bias-ux/

Molich, R., Bevan, N., Butler, S., Curson, I., Kindlund, E., Kirakowski, J., and Miller, D., 1998, Comparative evaluation of usability tests. Usability Professionals Association 1998 Conference, 22-26 June 1998 (Washington DC: Usability Professionals Association), pp. 189-200.

Molich, R., & Dumas, J. S. (2008). Comparative usability evaluation (CUE-4). Behaviour & Information Technology, 27(3), 263-281.

Nielsen, J. (2000). Why you only need to test with 5 users.

Sauro, J. (2012, April 21). 9 Biases in Usability Testing. Retrieved February 23, 2018, from https://measuringu.com/ut-bias/

Shafir, E., Simonson, I., & Tversky, A. (1993). Reason-based choice. Cognition, 49(1-2), 11-36.

Virzi, R. A. (1992). Refining the test phase of usability evaluation: How many subjects is enough?. Human factors, 34(4), 457-468.

Wilson, C. E. (2006). Triangulation: the explicit use of multiple methods, measures, and approaches for determining core issues in product development. interactions, 13(6), 46-ff.