Last week, I introduced critical AI literacy as an essential element for teaching and learning in the current moment, arguing that generative AI is fundamentally a literacy technology. I had intended to spend this post unpacking the term “critical” in “critical AI literacy,” but in the meantime, I attended two webinars that got me thinking about additional arguments against utilizing so-called AI text detectors in the classroom.
The most prevalent arguments against AI text detection that I have heard are: 1) detectors alone do not teach students about effective and ethical source use, which is genre-, discipline-, and even culture-specific; 2) they set up an adversarial relationship with students, substituting pedagogy with policing; 3) they have unacceptably high false positive rates; and 4) they have high false negative rates, too, which may or may not matter depending on the points raised in #1. (For a deeper, academic take on this topic, see Chris Anson’s 2021 article on AI and the social construction of authorship.)
These arguments featured prominently in a webinar on the impact of AI on higher education hosted by the American Association of Colleges & Universities (AAC&U) on Wednesday. Eddie Watson, an Associate Vice President at AAC&U, specifically cautioned higher education institutions against creating academic dishonesty policies based on these AI detection platforms, mainly because of the false negatives. He also cited a survey in which students said they aren’t as worried about being caught using generative AI as they are getting falsely accused of doing so—a telltale sign of the adversarial relationship these platforms create between students and faculty. (I don’t have a citation for this survey. If you do, please drop it in the comments.)
Then, on Thursday, I attended an NSF-hosted distinguished lecture by Dr. Rebecca Willett, Profesor of Statistics and Computer Science at University of Chicago. Her argument, in brief, is that there is still ample room for additional, foundational research in machine learning (ML). In fact, she argued that such foundational research must precede many, if not most, ML applications, especially in the sciences. Her analogy? Building ML applications without fundamental knowledge of math, statistics, and computer science is like trying to build biotech without a fundamental understanding of biology.
What struck me in particular about her argument is that the probabilistic nature of ML, including large language models (LLMs), leaves them open to error, and therefore potentially to bad science. If a model is 90% accurate on average, then it risks leaving out rare but meaningful events. (Hurricanes were her example.) Furthermore, current ML models are just not that good at estimating their own probability of being (in)correct. Put differently, and to paraphrase Dr. Willett, machine learning applications produce plausible, but not necessarily trustworthy, results. For Dr. Willett, ML researchers need to engage in more foundational research to better quantify uncertainty, which should result in more trustworthy results.
All of this got me thinking again about AI text detectors, which are built on ML foundations. If LLMs produce plausible, but not necessarily trustworthy, results, how can we expect detectors to produce trustworthy results? Take, for instance, GPTZero, whose detection model is based on not only a dataset of more student texts and web search, but also concepts like burstiness and perplexity, which are themselves algorithmic means of measuring stylistic variety and statistical fit between a prediction and a sample.
If, as Dr. Willett argues, ML applications still cannot effectively quantify the level of certainty in their output, then surely GPTZero and the like cannot either. False positives and false negatives are, therefore, symptoms of a deeper, more fundamental gap in knowledge about how to produce not just plausible but trustworthy results.
Although I would prefer that instructors never use these tools because of the ethical problems they pose for good pedagogy, I would at least hope they would wait to adopt them until we have a clear sense that they produce results with high levels of certainty approaching 100%.
This theoretical argument does bring me back to critical AI literacy. My friend
suggested last week that statistical knowledge is a fundamental part of critical AI literacy, and I agree. We don’t need to be mathematicians, statisticians, or computer scientists (I am trained in humanistic and social scientific education scholarship), but we do need enough statistical knowledge to view LLMs with a healthy dose of skepticism. This does not mean we must do away with generative AI altogether in educational settings, but it does mean we should temper our expectations about its trustworthiness, and help our students do so as well.If you know of any good resources on statistical literacy aimed at non-mathematical types, send me a message or drop a link in the comments.