Rethinking Statistical Thresholds and the Rise of Small Data
Have you ever looked down a list of ethnicities in a survey and realized that you weren’t represented, that you literally had to check “Other”? If you’re a member of an Indigenous nation, you may know the feeling. According to Krystal Tsosie, a Navajo geneticist and bioethics scholar at Vanderbilt University, “We either occupy some unknown place in the category of “Other”, or in biological studies we’re not deemed to have enough statistical power to detect associations.”
This state of missingness, or data erasure, is a blind spot in many statistical approaches to datasets, according to Oliver Bear Don’t Walk IV, a member of the Apsáalooke Nation and a PhD candidate in Biomedical Informatics at Columbia University. It’s also something of an anachronism if we’re supposedly living in the Big Data Era.
The implications aren’t pretty. “Indigenous people get left out because we’re not considered a large enough population of interest, of impact, or significant threshold to warrant study,” Tsosie says. “That’s particularly unfortunate when you consider that our communities are small in size for a reason, having to do with structural racism and colonial burdens that have been imposed upon us.”
Rethinking the Threshold
An unnaturally high statistical threshold is something that Indigenous data scientists are beginning to re-think. Specifically, they are revisiting some of the implicit hypothesis that underlie population genetics but that don't apply to Indigenous peoples. One example is the Hardy-Weinberg principle, which states that genotype frequencies will remain constant in a population from generation to generation in the absence of other evolutionary influences.
“That’s why we both advocate for small data studies, because the populations you may be working with in Indigenous peoples are often just a few hundred individuals,” Tsosie notes. “And that means that most in-field standard statistical approaches do not apply.”
Small data approaches also take in other drivers that relate to outcomes, like structural barriers to health. For example, it may not be possible to wash your hands often as a preventative measure to reduce SARS-CoV-2 transmission (the causative agent for COVID-19) if the state you live in has usurped your water rights. Or it may not be possible to keep viral transmission rates low if you must drive hundreds of miles to seek preventative health.
“If we ignore other co-variants related to health, especially when they are so prevalent in Indigenous communities, we’re also ignoring key factors that drive health inequities,” Tsosie says. “Small data means looking more holistically at health, not just looking at the genome and what's in the clinical record, but also looking at environmental, cultural, and traditional factors. If certain parts of our data are erased, it’s critical to rethink the statistics that are presented in front of us.”
Living in the Future
As science continues to outpace ethics, it's vitally important that we help ethics to catch up,” Bear Don’t Walk says, “even if many conversations in the computer science domain have gone in circles.” He raises the example of Princeton professor Ruha Benjamin and her insight in the book Race after Technology, that Black people in American live in the future. “Dr. Benjamin discusses this in the context of how Black people have been a harbinger of inequities encoded in technology. Without an ethical grounding for data, in other words, technology can go awry in ways the dominant white cultures may not experience for years to come,” he adds.
“For many populations ethical issues concerning data are there to examine,” Bear Don’t Walk says, “and we believe we can use genomics, biomedical informatics, data science, computer science, and other disciplines to shore up the ways we think about them.”
Thinking in terms of small data studies, no one should expect sweeping universal rules to follow for ethical data use because there are so many ways that different communities can conceptualize the idea, “This is a respectful way to work with my data.” The hard truth, as both scientists point out, is that despite our desire to answer this question, the answer may be different for each community we study. As in many other walks of life, there is no one size fits all.
Krystal Tsosie is a Navajo geneticist and bioethics scholar at Vanderbilt University. She is a co-founder of the Native BioData Consortium, a non-profit research institution to create an Indigenous-led biological and data repository. She envisioned and organized IndigiData, a four-day remote workshop that took place for the first time in June 2021, along with Matt Anderson, Assistant Professor at The Ohio State University, and other Indigenous academics.
Oliver J. Bear Don’t Walk IV is a member of the Apsáalooke Nation and a PhD candidate in Biomedical Informatics at Columbia University. He was a key organizer of the first IndigiData workshop.