Why It's Important:  Generative Artificial Intelligence (GenAI) based systems’ harms, like AI psychosis, exacerbation of echo chambers, and spread of misinformation, are increasingly coming to light with such systems. Researchers are investigating ways to assess these harms through robust evaluation techniques. These techniques, usually involving assessments of GenAI models on benchmark evaluation datasets, help assess potential harms a system could have and perpetuate. Recent works have shown the limitations of a narrow focus on model evaluations, which miss incorporating the end users’ experiences as part of system evaluations. To include end users as part of system evaluations, researchers have recommended including feedback from individuals using the systems as part of holistic system evaluations. Our work builds on such recommendations to study and offer an approach to include system users’ voices while evaluating the impact of GenAI-based systems in high-stakes domains.

 

Graphic of human profile from merged computer data with shapes and codeOur Approach:  Our research goal is to develop a framework to include systems' users' experiences as part of impact evaluations of AI-based systems. We draw on the principles of social audits—community-led evaluations of government projects, laws, and policies aiming to ensure transparency and accountability of government programs—to unpack a method for participatory evaluations of AI-based systems. 

To conduct our study, we will collaborate with Noora Health, a public health non-profit organization, working with family caregivers from underserved communities, to create awareness around caregiving practices. We will co-design a GenAI-based chatbot to help respond to community members' queries on maternal and child health in India. Through log-data analysis and semi-structured interviews, we will study the experiences of the community members in appropriating chatbot's recommendations in their daily lives. We will also conduct semi-structured interviews with administrative officials and elected representatives at the village level in India to understand how they currently conduct social audits of the government's programs to ensure public accountability and how it may be extended to the evaluation of AI-based programs. 

Through our proposed framework, we hope to build on and add to the science of evaluating GenAI-based systems within the field of Human-Computer Interaction.