Working with participants in AI data collections: Drawing from User Research and Communication Design
Authored with David Mondello and Karen Chappell

AI products and services require large sets of data to power their predictions. Homogenous datasets — no matter how large — will have a difficult time handling variations found across real world examples. As a result, it is critical to construct training datasets that capture the diversity in the real world. What this means is collecting potentially sensitive details about people, such as demographic (e.g. ancestry, socioeconomic status, gender identity) or biometric (e.g. skin tone, speech samples) information, as part of model training and evaluation.
While details such as skin tone or socioeconomic status may help improve the performance of certain AI systems, gaining access to this information requires individuals to self-report potentially sensitive aspects of their identities during the data collection process. Rightfully so, some participants may have questions about why such information is necessary to share in the first place. Others, on the other hand, may have concerns about the privacy protections in place to safeguard the details they choose to share.
Drawing on established practices from the discipline of both user research and communication design, the following section details some of the key information to surface with participants throughout the data collection process — from recruitment all the way through the session debrief. Providing session moderators (who are not typically user researchers) with this information — and packaging it in a clear and consumable way — can help create comfortable experiences for participants involved in data collection efforts.
Artifacts
Recruiting collateral (e.g., posters, flyers, messages)
— Purpose of the data collection session (e.g. to improve the accuracy of automatic speech recognition systems so that it works well for everyone)
— The nature of the session (in-person, remote)
— Specifics of compensation
—The tasks that are involved and equipment that participants will be using
— Duration of the session
— Any applicable eligibility criteria
— Link to find detailed information about the project
Consent form
— Describe how the data will be used to improve products
— Any confidentiality clauses (e.g., not to photograph or share information about the project with others)
— Grant of rights to use their data
— Being specific about the data being collected
— Voluntary nature of the participation and the right to withdraw anytime
Moderator FAQ /Reference Sheet
— Purpose of the data collection session
— Duration of the session
— Incentive to the participant for participating in the session
— Tasks participants will be doing during the session (e.g. read sentences)
— The types of data (e.g., speech sample, photo, video) being collected
— Individuals having access to the data (e.g., about 20 security specialists within X company)
— Where the data is being stored (e.g. in encrypted hard drives and then uploaded to secure storage)
— The data retention period (e.g. up to 2 years and then deleted)
— Risks (physical, social, emotional) if any that exist as a result of participating in the data collection
— Whether the session is video/audio recorded
Moderator script
— Introductions (e.g., “My name is [X], and I will moderate your session today. I help team X collect human data to improve their products and services.”)
— Prompt to answer any questions that participants may have after they sign the consent form
— An overview of the session objective and how long the session is expected to last
— Clear instructions on the tasks participants will be completing. Ideally ask participants to repeat back the instructions to gauge their comprehension and provide practice trials
— An overview of what participants will be using during the session (e.g., a hi- fidelity prototype, prototype with known glitches)
— Craft the sessions such that the moderator provides undivided attention to the participant (versus doing a lot of data entry)
— Answer any remaining questions during the debrief
— Describe where participants can go if they have additional questions about the collection, at a later date
— Thank participants for their time and contributions (e.g., “Thank you so much for taking the time to help us make products that work for everyone.”)
Tips and tricks for moderators
So much of the participant experience during data collections is shaped by the moderator leading the session. It is therefore a vitally important role, and one that can really make a difference in helping to address potential concerns or questions that participants might have. Below are a few things to keep in mind when moderating these sessions to comfortably guide participants through the data collection process.
1. Be calm and confident
2. Engage in the conversation (i.e., avoid conversations with others, looking at screens)
3. Review the FAQs ahead of time to be able to promptly answer questions from participants
4. Pilot a few sessions and adjust the protocol based on learnings
5. Take note of body postures/language (e.g., do not invade their personal space, do not express frustration)
6. Check periodically to see how participants are doing, whether they have any questions or want to take a break
7. Accommodate those who are differentially abled (e.g., modifying a task to accommodate for those with physical impairments)
8. Provide incentives as planned to participants even if the session could not be completed (e.g., due to system failure)
9. Procure and test devices that will be used for the collection, for the secure storage and transfer of data
10. Be gracious, thank participants for their time, and emphasize how their contribution fits into the larger picture