Defining the Problem With Usability Testing (UT)

This document is a basic guide to usability testing for user experience researchers, designers & product teams in the federal space. It covers our guiding principles, constraints unique to federal product development, and tips to produce effective usability tests.

Usability testing minimizes risk by validating features and enhancements before they reach users. It confirms that the correct problem was identified, which is integral to CTG's design philosophy. The usability test type should differ between the design lifecycle stages to maximize their ability to solve the right problem with the most impact.

Federal product development has high stakes and can face lawmaker scrutiny. And when a product fails usability or adoption, there can not only be a loss of confidence, but also a detrimental effect on people's lives on a large scale.

TLDR USABILITY GUIDE

Before You Test: Checklist

Perform initial root cause analysis. Why now, why this?
Identify the stage: Discovery, Concept Alignment, Validation, or Post-Release.
Consider Bailey’s Human Performance Model.
Define your hypothesis and establish a control.
Write tasks that are realistic.
Have a plan for what test to use when.

Completion & Satisfaction Standards

Task completion target: 100% of participants complete the task.
Meets your organization’s definition of usability.
Satisfaction target: at least 85% respond “Like/Agree” on a Likert or Satisfaction scale, netting majority approval.
The solution solves the problem.

01 The Guiding Philosophy

Defining Usability

Every organization defines usability differently according to their needs and product. For a Federal space, usability is:

Usable: Task completion gets the user from point A to point B.
Efficient: Intuitive and quick to use, does not negatively impact user metrics.
Flexible: Policy and methods can change rapidly. The product should still work even if policy or scale changes.
Accessible: Federal products require 100% 508 Compliance.

Usability must also be considerate of external factors. Your users will have inherent biases like opinions on the current design system or policy coming from the business. There will be budget constraints.

During Testing, You Are Not a Designer. You Are a Researcher.

Before you write a single task scenario or recruit a single participant, give yourself ample time to write the objective and overall plan of the test:

Objective: what insight do you hope to gain?
Hypothesis: what is your underlying assumption to validate?
Recruitment plan: who owns reaching out to the users? By what date? Which user types?
Test type: what kind of testing will be done? Why this test?
Test plan: Detailed description of what the test will entail.

As a researcher, we create a comfortable environment, and observe the user’s reactions. We listen, and we only ask questions when appropriate. We do not take offense. Being clear in your testing plan helps you to confidently run research.

Testing for MVP

How much is required before it is useful enough to release? That brings us back to our definition of usability in the federal space. Because of a confluence of factors like urgency and policy, your usability test should always probe the user’s threshold for a Minimum Viable Product.

What capabilities must exist on Day 1 for this product to deliver real value? What can users temporarily do without?

Additionally, does the solution offered encompass and validate your answers to these integral design questions:

What outcome is the organization trying to achieve?
Who is affected by the problem?
What constraints or resources exist?
How will success be measured?
What would failure look like?

Once you get that initial use case usable and ready to go, you can always revisit for another use case. Read about AIRBNB’s methodology on “getting it right first” before scaling up.

02 The Federal Context

How you plan, recruit, and run your tests must be shaped by the environment you are working in. Here is a great breakdown of usability testing by Homeland Security.

Image of a federal worker in front of an American flag

User Availability

Federal users are busy. They face mission-critical workloads. Your average user probably wears many hats in their agency. They work in office, so there are natural distractions, like audial sounds or environmental changes due to their setup. Most test settings will be remote, so we must be cognizant of those factors.

Federal systems run on legacy technology that gets replaced every five to ten years, sometimes longer. This is a generalization, but a useful one: the long replacement cycle creates a workforce shaped by years of adapting to systems that rarely change. When change finally arrives, reactions can be mixed.

Ronaldinho and Messi

Like many systems that have been around for a long time, we can generalize two user groups. If you like FC Barcelona, we have Ronaldinho, the old school champion, and Messi, the rookie. (Both are legendary).

Ronaldinho carries a wealth of institutional knowledge. He knows the workflows deeply, often better than anyone who designed the current system. But he has also been forced to establish workarounds, and he has built habits.

Messi is more receptive to visual and workflow changes. They are not jaded to system changes. Note there: In the federal context, a “rookie” might have five years of service or have transferred in from another agency.

When you recruit users, gather a realistic mix of both the Ronaldinhos and the Messis. Do not test defensive players if you are creating a tool for goalies.

External Vs Internal

Depending on clearance level, and if it is an external or internal product, usability testing and survey feedback must be collected anonymously. Traditional user interviews, which rely on an ongoing relationship with a named participant, may not be available.

The sponging and alignment that would normally happen in interviews can instead be accomplished through desk research on available reports and prior assessments, and aggregated surveys. In the fed space, we keep it flexible and, “We get what we get and we don’t get upset.”

03 Theoretical Foundations

Two frameworks form the intellectual backbone of the approach described in this guide.

Bailey’s Human Performance Model

P=f(Human,Activity,Context)

Bailey's Human Performance Model works alongside usability testing. It posits that the performance of the product is the sum of the human, the context, and the activity.

Consider what happens when a group of humans with skillset A design a product for humans with skillset B, and for the purpose of X. And let's say that purpose X changes to purpose Y. Then the product cannot perform well, and will no longer work.

That is exactly what happened in Chernobyl. A group of physicists designed a power plant for physicists, but it was ordinary people running the power plant. When the context changed, (an explosion), those "users" did not know what to do, and could not understand even the user manual that was written to operate the plant. Furthermore, there was no instruction for the scenario of an explosion, so while the reactor began to melt down, the larger population remained uninformed.

Bailey's Human Performance Model is used with Usability Testing in order to most accurately predict user behavior. Predicting that user behavior creates the validation. They work together because of how incredible complex human behavior is.

The Human is the person using the system: their cognitive abilities, prior knowledge, emotional state, and physical capabilities. Other factors you can keep digging for: what user role is the human? How often does the human get trained in updates? How does the human react to system changes?
The Context is the environmental and situational factors surrounding the task. If it’s a tool to report a natural disaster, the performance of the tool hinges on its usability in multiple environmental extremes.
The Activity is the specific task or operation the user is executing. If it’s a reporting tool for a natural disaster, then the activity is the task of reporting and sharing event details and escalating when necessary.

No single component tells the whole story. A task that is perfectly designed for one human in one context may be genuinely unusable for a different human under different conditions.

The Bailey’s Human Performance Model is a framework to ask: who is this person, where are they, and what exactly are they trying to do?

The Machine vs. The Person

In the Handbook to Usability Testing, Jeffrey Rubin and Dana Chisnell draw a helpful distinction. During the design and development of a product, the emphasis is on the machine or system, not on the person who is the ultimate end user.

When you are in Figma building a wireframe, you are thinking about layout, hierarchy, component states, and field logic. You are mapping out the service blueprint and understanding backend needs. You are 'in the machine'.

Testing transforms those mock ups of clean boxes and fields into a form that a human must complete in real life.

04 Four Approaches to Usability Testing (UT)

Context is everything. We choose the test based on what we understand about the problem. Here we discuss Homeland Security's recommended categories for approaching UT.

Exploratory

Exploratory is synonymous with discovery, and concept testing. When we are constrained by anonymity or user availability, desk research is a good substitution. Review audit findings, policy documents, prior assessment results, and patterns in existing helpdesk logs and support tickets. Some of our designers even comb through Reddit to unearth frustrations. Play back past user interviews and take notes.

Comparative

Comparative testing is when the idea has formed, and low fidelity or mid fidelity wireframes are available for preference or A/B testing. It is pivotal to keep one as a control, and make slight to larger changes for each piece of the control for the second experience.

Keep in mind, many users, especially those not accustomed to the development process, will not be able to get over small things like placeholder text or a container being open when it should be collapsed. Plan accordingly.

Assessment

Assessment should be done throughout the design and research lifecycle. This is to constantly observe and notate how users are interacting with the tests, their reactions, and fine tuning like a radio antennae. Satisfaction polls or questionnaires are great for performing assessment.

Validation

Validation is for any test that focuses on quantitative results, such as time to learn, time to complete a task, or completion rates.

05 Four Approaches to Usability Testing (UT)

USABILITY TEST TYPES

In the federal context, when anonymity requirements rule out interviews, we need to be flexible!

Concept Test

Rubin and Chisnell call out in the Handbook of Usability Testing that testing too late in the process is one of the most expensive mistakes a team can make. As soon in the development process as you can, show a small group of the intended users a sketch, a rough wireframe, or diagram and listen.

Survey

Surveys define the scope. You are not learning the deep why though it does build towards your hypothesis. You are looking for patterns and impact this proposed feature or enhancement would make across a larger group.

A/B / Preference Test

You have two versions and you need to know which one actually performs better. Run this when the design is mature enough that both versions could ship, and the difference between them is specific and isolated.

Questionnaire

A questionnaire is a survey you give before or after a test session to capture context about the user or their reaction to what they just did. Pre-session questionnaires establish baseline knowledge and role. Post-session questionnaires capture satisfaction and impressions while the experience is fresh. They are a wrapper around a test, not a test themselves. This is the perfect way to gather insights on a usability test.

End to End Usability Test

This is the full picture: a participant completes an entire workflow from start to finish, not just an isolated task. You run this when you need to know whether the product holds together as a system: whether handoffs between steps work, whether users lose the thread, whether the cumulative experience is coherent. It is the most expensive test to run but also the most effective.

Not a usability test...

Poll

A poll is a single question and pulse check. It is not a usability test. Use it when you need a quick directional read.

06 Designing the Test

Your test should come with some form of a user guide to help users through the journey, especially if you cannot be present with them. Consider that they may be multitasking, distracted, or dealing with images completely foreign to their tenure with a legacy system.

In your documentation, using “realistic” text can have its disadvantages as well because it can cause a user to fixate on inconsequential details.

Always vet your test with your quality engineer, and pass them to your internal team leadership. They are the other players on your team highly skilled in finding bugs and details. When it goes to the client, it needs to be clear and easy to understand for any kind of user.

Have a control in the back of your mind, and to validate that control, use different tests during different stages. The control you start with should not be the one you end with, otherwise you have probably not tested correctly!

Variant changes should be labeled or called out as such within your testing materials, such as TEST A VERTICAL LAYOUT, and TEST B HORIZONTAL LAYOUT. Whatever you think is obvious to a user…it may only be obvious to you.

06 Measuring Success

Testing is complete when time and confidence have reached a balance our federal client is satisfied with.

The Confidence Factor: Defining Done

Did 100% of participants successfully complete the task presented to them in the test? A task that some users cannot complete means the work is not ready. Testers represent a small percentage of the population of users, so that number will be further amplified upon release.

Do the users like the solution? This can be determined by a likert-style scale for agreement (best used when there are multiple questions around the same topic), or a Level of Satisfaction questionnaire to be "folded in" after a usability test.

By the way, I ‘Strongly Agree’ that Chocolate ice cream is satisfying.

In the federal space, we’ve found that using a hybrid of the above (like a simplified Likert-style scale), has worked best in getting results from overworked users.

Standard Likert scales can run up to ten levels, but federal workers respond significantly better to a three-level scale: Dislike, Neutral, and Like, with an optional comment field.

Our advice is to keep it simple. Past testing results show that within testing materials, users engage significantly less after the 3rd “problem” presented or repeat their qualitative feedback.

Meeting Your Goals Is Never Easy

What makes the federal context distinctive is that federal systems touch enormous populations with high visibility. Raise it to your leads for an extension if there is a genuine lack of confidence in the usability of what is about to ship.

If testing has gone swimmingly well, of course, the chances of you having to come back in 6 months to “fix it” is significantly reduced, so fret not if you’ve gotten all the way up to this part of the article!

Be Wrong Sometimes

Yes, we hear this all the time in design blog articles and at lectures. But you know what I am talking about! When your UT shows you that you went in the wrong direction, that's okay because the product has not come out yet and there's still time to start up the compass again.

Write a Good Task

Tasks are the soccer ball of a usability test. A poorly written task produces misleading results, and the game just can't go on. Include images and accurate test steps, treating it like a User Acceptance Testing criteria. And don’t forget to make the task realistic to set the field for the user.