Surveying Rhetorical Readability on Amazon Mechanical Turk
This research was supported by generous funding from the Charles L. Cahill Grant through the University of North Carolina Wilmington and the Stolle Award through the Saint Louis University College of Arts and Sciences.
This study was approved by the University of North Carolina Institutional Review Board (#21-0211).
“GPT-3 Can Summarize Books as Well as a Human Can,” boasts the title of a 2021 news article from the technology website Boing Boing (Frauenfelder, 2021). This statement should be unsurprising; OpenAI’s Generative Pre-trained Transformer-3 (GPT-3) Natural Language Processing (NLP) system has become a bona fide algorithmic celebrity, breathlessly documented by hundreds of media outlets for its wondrous capacity to, as reported in The New York Times, blog, argue, generate tweets, pen poetry, summarize email, answer trivia questions, translate languages, and even write its own computer programs (Metz, 2020). The education website EduRef brought GPT-3 directly into the ambit of higher education by pitting it against human college students in an anonymous multi-subject essay contest, with unsettlingly mixed results: it earned about a C average, besting some human participants in history, law, and research methods, and only wiping out in creative writing (“What Grades,” 2021).
Of course, with its promise comes the menace of replacement anxiety that surrounded GPT-3’s automated chess-playing forbearer the Mechanical Turk. “How Do You Know a Human Wrote This?”, admonishes the title of Farhad Majoo’s (2020) piece in The New York Times. This question is answered and the specter revealed in a separate piece from The Guardian: “A Robot Wrote This Entire Article,” states its doomsaying title, “Are You Scared Yet, Human?” (”A Robot Wrote,” 2020).
Still, GPT-3’s occult ability to summarize books should be contextualized. The process divides full texts into smaller sections that the software summarizes. Then, in a hybrid arrangement akin to Amazon’s global crowdsourcing marketplace Mechanical Turk (MTurk),1 these sub-summaries are evaluated by humans and assembled by GPT-3 into an aggregate summary-of-summaries for the whole work. In this way, the “human labelers are able to supervise and evaluate the models quickly, despite not having read the entire books themselves” (Wu et al., 2021, p. 1). This seems a bit odd; although OpenAI ambiguously asserts that the process “achieves a 6/7 rating (similar to the average human-written summary) from humans who have read the book 5% of the time and a 5/7 rating 15% of the time,” it is not clear how its summaries might account for large-scale themes that develop throughout the work across sections or intertextual literary and cultural concepts (“Summarizing Books,” 2021).
Despite the trepidation and caveats, NLP summary is an significant aspiration for large tech companies, and it is a knotty and labor-intensive task that resists brute-force methods. Google’s PEGASUS model, described in Zhang et al. (2020), seeks to achieve human-quality abstractive text summarization, which is “one of the most challenging tasks in natural language processing, involving understanding of long passages, information compression, and language generation” (Liu & Zhao, 2020). Microsoft and the graphics hardware company Nvidia have partnered to construct a generative language model with more than triple the parameters of GPT-3, giving it “unmatched accuracy” in areas including reading comprehension, commonsense reasoning, and word sense disambiguation (Alvi, 2021).
Crucially, these companies are expanding their summary ambitions beyond written text, which makes sense in a world of expanding multimedia data. The third version of Google’s Inception model can describe images in “natural-sounding English phrases” after fine tuning itself on a massive corpus of human-generated captions (Shallue, 2016), and Microsoft’s Azure AI “can generate captions for images that are, in many cases, more accurate than the descriptions people write” (Roach, 2020). In an interview with the technology website Engadget, Eric Boyd of Azure AI calls this one of the hardest problems in AI, because it “represents not only understanding the objects in a scene, but how they’re interacting, and how to describe them” (as stated in Hardawar, 2020). OpenAI’s DALL•E inverts such image summarization processes by using GPT-3 to convert textual descriptions into algorithmically generated illustrations. Additionally, Google’s Chrome browser has made steps toward audio and video summary with Live Caption, which automatically produces real-time captions for online media. Although captions are not quite summaries, neither are they mere transcriptions, as Dr. Sean Zdenek discussed with us; effective captions selectively convey context and meaning.
As we addressed in the first study from the Following Mechanical Turks research project, hybrid human/machine summarization is a situated, multifaceted task bound to the complex concept of readability. Readability, or rather its lack, seems to haunt attempts at automated summary; the panel of expert graders for the essay contest mentioned above, who were unaware if they were scoring a text from a human or an algorithm, remarked that GPT-3’s work was too vague, blunt, and awkward; that it did not display critical thinking, clarity, or varied sentence structure; that it was clichéd, oddly personal(!), and bland; it did not incorporate the five senses and told rather than showed (“What Grades,” 2021). To be sure, human summaries also can display these issues,2 which is precisely what makes automation’s subtle differences uncanny.
It is possible that this palpable strangeness is because automated NLP summary systems often implicitly frame readability as a feature of the text. Visible in claims that a language model can perform as well as or better than humans is an embedded notion of what is objectively in the document that can be described more or less accurately. Such a perspective allows models to disassemble and analyze the text as a combination of internal features, but readability in part may arise from a broader web of human significations. We might consider, for example, separate lines of NLP research focused on reading human language for other reasons. Eyigoz et al.’s (2020) “Linguistic Markers Predict Onset of Alzheimer’s Disease” demonstrates that the onset of Alzheimer’s disease can be reliably predicted years in advance from an automated evaluation of a written summary. This study asked participants with no current diagnosis or symptoms of Alheizmer’s disease to describe what is known in language research as the “cookie theft picture,” the standard diagnostic tool shown in Figure 1.
The Cookie Theft Picture
Note. From The Assessment of Aphasia and Related Disorders, by Harold Goodglass and Edith Kaplan, 1983.
An AI analysis predicted with approximately 75 percent accuracy which respondents would develop Alzheimer’s disease based on repetitive word usage, mechanical errors, and “telegraphic speech,” language consisting mainly of content words and largely missing function words (Eyigoz et al., 2020, p. 5). This neurolinguistic finding may align with others such as the established presence of handwriting abnormalities in people with Parkinson’s disease (Thomas et al., 2017).
There is in these biological applications of human intelligibility and computational forays into content summary a unifying reduction: the ungirding assumption that information is a thing unbound by form, that it is an ur-substance open for shared processing by different kinds of readers. This presumption, influenced by information science, appears in similar forms such as the now accepted omnipresence of metadata and the faded dream of the semantic web: a structured digital knowledge network intelligible to both humans and machines.3 Forms of readability are complicated. They collapse physiology and intellect, and they invoke human aspects beyond semantic content. This aligns with the project’s first study, which discusses how on MTurk printed receipts, situations, moods, and objects—their boundaries and degree of damage, the point at which they tilt from thing to trash—were variably readable. The first study used aggregate text mining to identify readability as an aspect of what it called telling within MTurk: humans acting as sensors to make phenomena—including themselves—intelligible to non-human readers. The current study advances from this base by transitioning from the language of MTurk’s Human Intelligence Task (HIT) job posts to their effects on and entanglements with both workers (turkers) and requesters. In so doing, this study seeks to address what readability is on MTurk and how it functions. It does this by participating in MTurk directly, collating 2,200 surveys in multiple variations posted to the service that asked respondents to produce and evaluate readable summaries of four different types of content and to provide their perceptions of these tasks. This study’s findings identify ways that readability emerges on MTurk through hybrid human and non-human reciprocity, identifying metadiscourse, enthymeme, audience targeting, and incentives as meaningful rhetorical elements.
The first study distinguished readability and trauma as two principal components of telling. In the current study, we have opted to explore readability further but not trauma due to the sensitives surrounding treating it as a commercial or scholarly topic, especially when engaged in human subjects research. Our discussion with Dr. Lauren Cagle underscored the ethical dimensions of engaging participants as partners when addressing personal content. The structure of MTurk may hinder such arrangements; it was, after all, explicitly built with the goal of reducing humans to machinic components. News articles such as Dhruv Mehrotra’s “Horror Stories From Inside Amazon’s Mechanical Turk” point out that academic practice is not blameless here. “Yet as benign as academic research may seem,” states Mehrotra, discussing a survey for turkers administered by the technology website Gizmodo, “12 percent of respondents claimed that the worst or strangest experience on Mechanical Turk was due to what can only be described as uncomfortable personal data requests wherein the worker reported feeling emotionally traumatized by an academic survey.” One respondent in this article states of the upsetting feelings that emerged when completing an academic personal information survey on MTurk: “Someone paid me like 50 cents to recall the most painful memory of my life. It fucked me up for the entire day.” Although trauma was a large component of the telling we identified on MTurk through archival text mining in the first study—humans indeed seem uniquely qualified to articulate the ways that humans suffer—it is not an area we felt equipped to engage through human subjects methods.
All research has risks, but examining readability largely avoids these particular concerns, and the complex concept seems well suited to direct engagement with MTurk. As discussed in the project’s first study, readability pertains to multiple forms of media and situations. It collapses physiological and intellectual acts that would be separated in other contexts, and it calls for seemingly characteristic human situational knowledge. As a topic of human subjects research, it readily transitions the project’s focus from HITs’ language to their rhetorical effects on turkers and requesters. Our focus thus is not turker demographics or lived experiences (Ipeirotis, 2010a; Ipeirotis, 2010b; Martin et al., 2014; Ross et al., 2010) but how MTurk generates readability: how turkers tell information through composition in response to persuasion.
To place this research within a context of rhetoric theory, we here participate in unresolved disciplinary conversations about the contemporary status of literacy in an era of digital information networks—what things are readable and what makes them so. This burgeoning x-literacy—engaged through Walter Ong’s (1971, 1982) secondary orality, Gregory L. Ulmer’s (2003) electracy, and Jan Rune Holmevik’s (2012) inter/vention, as well as more general concepts such as digitality or information/media literacy—often suggests a progressive, symmetrical trajectory: information has circulated through symbol use, developing from spoken words through printed texts to digital media. This narrative demarcates three epochs—orality, literacy, and x-literacy—signified by characteristic inventions such as the polis, the printing press, and the computer network, as the following breakdown from Ulmer (2010) in Figure 2 reflects:
Ulmer’s breakdown of the aspects associated with three eras of literacy
Note. From “The Learning Screen,” by Gregory L. Ulmer, 2010, Networked: A (Networked_book) about (Networked_art), Turbulence, (https://turbulence.org/project/the-learning-screen/).
It is valuable to frame the current moment as part of this iterative development—as more literacy—because doing so produces serviceable analogical insights; however, this attempted reinscription also has relevant gaps that become visible when we juxtapose these eras’ base units and means of circulation, as shown in Table 1.
Base units and means of circulation for three eras of literacy
|Base Units||spoken word|
|Means of Circulation||agora|
Resolving x-literacy’s means of circulation seems feasible, but our attempts to engage, understand, and define the concept have been fraught by the lack of shared agreement about what constitutes its base unit, analogous to the uttered phoneme or scripted alphabet. Multimedia would seem to be a candidate in keeping with literacy’s lineage, but the concept is not thoroughly defined nor sufficiently comprehensive, and, as overtures toward visual rhetoric have suggested, its interpretation protocols are not established. The binary bit or even the electron may be an alternative, and although these are intelligible to digital machines, humans would seem to be completely excluded. Computer code is an appealing choice because it bridges humans and digital machines, and certainly the claim that coding is the new literacy has been asserted as much as it has been debunked. In some ways, coding is akin to textual literacy, because it is a specialized skill that may create a new intellectual gap between haves and have-nots, or it may be something that is originally the province of a distinct group but ultimately disseminates into the larger population. However, coding is unlike print literacy because although code is readable by humans, it is only actionable by machines; humans can understand something about what computer code does, but they cannot respond to it as can its intended audience.
There may not be a clear concept that can complete the grid because of an assumption instantiated by Plato and vetted through Cartesian dualism and posthuman prostheticism: that literacy is for us, that humans are apex literate agents en- and de-coding information that depends upon our acts of interpretation for existence and circulation. This anthropocentric paradigm is valuable, but as our studies of MTurk suggest, it may not adequately account for the participation of non-humans in literacy tasks that are significant elements of our cognitive systems. Our existing constructions of x-literacy may be largely untenable and ill suited to the present moment because they neglect the radical significance of digital networks’ emergence. Our interrogation of readability within MTurk attempts to engage these issues directly by examining what is being read, how, and by whom.
To make this exploration practical, we solicited turker labor and feedback through posted HITs. After receiving approval for our study from the University of North Carolina Institutional Review Board (#21-0211), we posted several targeted surveys to MTurk organized into two sequential phases (see Appendix A and Appendix B). In the first phase, surveys composed through MTurk’s built-in survey creation tools gave respondents three prompts:
The first prompt asked respondents to produce a readable one-sentence summary of a provided piece of media content depicting a man helping a boy with his math homework: a 20-second video clip, a 20-second audio clip, an image, or a paragraph of text. We created all of the content so there were no intellectual property concerns, and it was derived from the same source: the audio clip was extracted from the video clip, the image was a still from the video clip, and the paragraph was a description of the video clip.
The second prompt asked respondents to rate the clarity of the survey’s instructions on a four-point scale and to provide one sentence of rationale for their ratings.
The third prompt asked respondents to rate the fairness of their compensation for completing the survey on a four-point scale and to provide one sentence of rationale for their ratings.
There were 32 versions of this survey. 24 of them may be put into four groups corresponding to the four types of media content. Each of these groups then may be further divided into two subgroups based on differences in the posts’ titles and descriptions:
The title and description of the first subgroup focused on the task the respondent was being asked to complete, for example: TITLE—Summarize a video clip; DESCRIPTION—Summarize a 20-second video clip in one sentence and answer two rating scale questions.
The title and description of the second subgroup focused on how long the task was expected to take, for example: TITLE—2–3 minute research survey; DESCRIPTION—Answer 3 brief questions (2–3 minutes).
All of the other language in the two versions was identical; only the titles and descriptions are different so that we could determine if these language changes correlated with differences in the summaries produced or the ratings of clarity and compensation fairness. (Titles and descriptions that emphasize either task or time are common on MTurk, but the effects of these different strategies have not been thoroughly studied in rhetoric.)
In order to examine how compensation rates affected responses, we posted these survey variations at different times and offered one of three different rewards: $1.00, $1.50, or $2.00. The language of the posts was identical; only the compensation rate changed. (All workers had the opportunity to complete all versions of the survey, preventing a scenario in which participants could be compensated differently for completing the same task.)
Posts at all compensation rates were open to all respondents, but for the eight versions at the $2.00 level, we also made additional requests that were limited to experienced respondents with a high submission acceptance rate by using MTurk’s built-in qualification settings. In accordance with Amazon’s guidance about using qualifications (Amazon Mechanical Turk, 2019) and third-party advice for academic MTurk requesters (“Tips for Academic Requesters,” 2012), these requests were restricted to respondents who had made at least 5,000 approved submissions with an acceptance rate above 98%.4
These combinations accounted for 24 versions of the survey: four different groups corresponding to the four types of media, each containing two subgroups separated by differences in titles and descriptions, each of which were compensated at three different amounts, as shown in Figure 3.
Digram of the first phase main survey versions
The other eight versions of the survey were very similar to these, but they used bonus rewards in addition to a flat $1.00 compensation. (Bonuses are not required on MTurk, but they are common.) These posts were restricted to experienced respondents with a high acceptance rate, and the bonus amount a submission received was determined by the summary’s ranking in the project’s second phase (described below). Submissions in the top third of rankings received a $3.00 bonus, those ranking in the middle third received a $2.00 bonus, and those in the bottom third received a $1.00 bonus.5 Most of the surveys’ instructional language for these posts was the same as the other versions but with revised content explaining how bonuses were to be distributed. Titles and descriptions also were altered to indicate that the tasks offered bonuses.
We collected five responses for each of the 24 survey versions, resulting in 120 responses, and an additional five responses from experienced respondents with a high acceptance rate for each of the eight versions at the $2.00 compensation level, resulting in 40 more responses. Additionally, we collected five responses for the eight survey versions that used bonus payment, resulting in another 40 responses. This made the total number of responses collected in the first phase 200 (50 for each of the four media types).
All first phase surveys notified respondents that their responses were anonymous but the summaries they produced would appear in later tasks posted to MTurk. (This is because the summaries generated in the first phase were reprinted in second phase surveys to evaluate their readability.) We reviewed the summaries to eliminate from the pool any that could be construed as being threatening or obscene to prevent respondents in the study’s second phase from unintentionally encountering objectionable material, which would contravene MTurk policies. No such material appeared so no responses were eliminated. Summaries that were irrelevant or nonsensical were intentionally retained in the pool.
The primary goal of the second phase was for respondents to determine which of the summaries generated in the first phase were more readable. To achieve this, we posted to MTurk surveys we built in Qualtrics that asked respondents to complete three prompts:
The first prompt presented respondents with one of the four pieces of media content from the first phase and asked them to rank four provided one-sentence summaries of it from most to least readable. (These provided options came from the pool of summaries generated by respondents in the first phase and were algorithmically selected by Qualtrics so that they were mostly random but each appeared approximately 20 times for equivalent representation.)
The second prompt asked the respondent to rate the clarity of the survey’s instructions on a four-point scale.
The third prompt asks the respondent to rate the fairness of their compensation for completing this survey on a four-point scale.
There were four versions of the second phase survey corresponding to the four different pieces of media content. The only differences among versions were the references to the piece of media content; all other language in the four versions was identical. We collected 250 responses for each of the four versions that were open to all respondents, for a total of 1,000. We also collected an additional 250 responses for each of the four versions that were restricted to experienced respondents with a high acceptance rate, which contributed another 1,000 responses for a total of 2,000 responses in the second phase. Respondents were paid $0.50 after submitting a valid completion code assigned by Qualtrics.
Respondents in both phases could complete all survey versions for which they were qualified but only one per version so that a single respondent was not overrepresented in the data pool. Respondent rewards were paid automatically through MTurk no later than seven days after they submitted a completed survey. Processed, anonymous survey data was stored in project spreadsheets for the first phase and the second phase. After both phases of data collection were complete, we were able to make comparisons and identify correlations to address our research questions.
Our findings on readability coalesce around two categories of texts, each with distinct rhetorical situations: the HITs we composed seeking survey replies and the summaries turkers produced in response. The summaries’ readability ratings as determined by turkers in the study’s second phase largely align with commonsense expectations: those that are efficient, nuanced, and grammatical appear near the top and those that are irrelevant, incomplete, or mechanically flawed cluster near the bottom. Yet within this predictable distribution are several threads that provide insight into readabilty on MTurk.
One salient finding involves summaries’ use of metadiscourse: critically reflective language that comments upon the text or the writer’s position within the rhetorical situation (Hyland, 2017, 2019). Rather than just summarizing the source material, this language signals the act of summary itself by communicating the author’s subjectivity, for example:
- “A child and man are talking and although much of what is said in [sic] unclear they refer to the number eighteen in a way that suggests they are discussing a math problem.”
- “It sounds like an adult and child doing a math problem.”
- “Seems a student is replying a math question to her teacher, however she seems uncertain if the answer is correct.”
- “A man (presumably a father) is helping a little boy (presumably his son) with his homework at a desk in their home, with a cardboard cutout of Garth Brooks in the background.”
This metadiscourse addresses both technical concerns, such as difficulties distinguishing speech from noise in the audio recording, and contextual ambiguity about the depicted scenario, such as what act is being described. It reflects the collapsing of physiological and intellectual acts identified in the previous study’s discussion of readability. This mingling appears to be borne out in turkers’ ratings of the compositional task’s clarity; metadiscourse that acknowledges uncertainty and expresses a subjective degree of confidence appears almost exclusively in summaries of the audio clip, which had the lowest rating of task clarity in the first phase, as Table 2 shows.
Average turker ratings of phase one’s task clarity and compensation fairness separated by content type
|Task Clarity (out of 4)||Compensation Fairness (out of 4)|
Almost all audio clip summaries that use metadiscourse to contextualize source content were rated in the upper quintile of readability; however, summaries that exclusively contain metadiscourse without referencing source content cluster in the lowest quintile. There are examples of such summaries for all four media types, including:
- “THIS AUDIO CLIP CONTAINS MORE BACKROUND [sic] NOISE AND THE HUMEN’S [sic] VOICE IS NOT CLEAR”
- “THE PICTURE IS VERY CLEAR.”
- “video is very clear”
Metadiscourse’s bimodal distribution in the highest and lowest rating tiers suggests that overtly referencing the subjective act of summarizing is a productive component of readability on MTurk, but sufficient semantic content also must be included. That is to say, “Nice” is, in a sense, readable, because it can be deciphered and comprehended readily; however, it alone does not constitute a readable summary, because it conveys no subject matter. Readability signals an act of interpretation but is not limited to it.
A counterpart of metadiscourse’s explicit recognition of ambiguity is the summaries’ implicit response to enthymeme. This appears in mundane ways, such as the common assumptions that the two people depicted in the scenario are a father and his son or that the boy is completing math homework. The summaries that include what are, from our perspective as the texts’ producers, factual errors underscore that these are contextual beliefs rather than restatements of provided information. Some of these are minor misjudgments of the situation, such as the paragraph summary stating that “the man observes the boy finishing an exam.” Some discrepancies also occur because of a lack of information, such as the audio clip summary rated as the most readable: “A man is helping a young girl with a math problem.” Most informative, however, are summaries that make intuitive leaps based on context clues. For example, one of the math problems the boy is solving mentions the number 18, and so multiple summaries of the audio clip address cultural associations with that age in the USA. This is apparent in the summary that reads: “A girl is asking for a cigarette to a man and he asks if she is 18 years old and she says she is.” There is nothing in the provided content that addresses such issues; they are implications spawned from the incidental mentioning of the number 18.
These items reveal enthymematic conclusions based on necessarily incomplete premises. Situated cultural knowledge is implicitly being brought to bear. These summary acts are not merely mechanical consolidations of provided content. Although this recognition may seem somewhat prosaic, it is crucial because of MTurk’s intentional blurring of distinctions between human behavior and automated acts.
We might turn from turkers’ summaries to the HITs we composed by examining their success at the intended goal—persuading turkers to compose readable summaries—and turkers’ impressions of the tasks’ clarity and compensation fairness. The top five most readable summaries for each of the four media types show little difference between those produced in response to HITs that focused on communicating the requested task (11 out of 20) and those produced in response to HITs that foregrounded the expected time expenditure (9 out of 20). However, when task and time distinctions are negated and these same 20 top summaries are grouped by compensation strategy, a more clear pattern emerges, as shown in Table 3:
Counts of top five readable summaries for all four content types separated by compensation strategy
Summaries produced in response to HITs offering a bonus constituted nine of the 20 summaries, and they were the top selection for the audio, image, and video media types. This is the case even though the base reward for these HITs was $1.00, which was equal to the lowest phase one compensation amount. These HITs were different because they provided additional scaled rewards that were potentially higher than other HITs’ flat compensation, and these bonuses were purported to be assigned competitively.
Summaries produced in response to HITs offering a bonus performed substantially better than those resulting from flat compensation, even when controlling for turkers’ experience levels. Both the bonus HITs and the $2.00 (qualifications) HITs only could be completed by experienced respondents with a high submission acceptance rate. The same number of HITs were posted for both of these variations, but the bonus HITs led to triple the number of top five summaries and three of four top choices.
The other HITs offering flat rewards of $2.00 or $1.50 each garnered three top five summaries, and—perhaps predictably—the lowest compensation rate of $1.00 produced only two. Collectively, this distribution suggests that in terms of intended outcome, compensation incentives—particularly variable, competitive bonuses—had a larger rhetorical effect on readability than changes in instructional language.
An interesting aberration is found in the paragraph summaries. The top choice for this content type resulted from a $1.00 HIT, two other top five summaries resulted from $1.50 HITs, and bonus and $2.00 (open) HITs produced one each. This is significantly different than the other content types’ results, which suggests that written text has some distinguishing traits that merit further discussion.
Looking at turkers’ ratings of task clarity and compensation fairness also provides insight into HITs readability. In phase two, respondents assessed turkers’ summaries from phase one rather than generating their own. As with phase one, phase two’s clarity and fairness ratings are high overall, and distinctions between media types are not significant. The audio clip again is notably the lowest rated content type for both measures, suggesting that the difficulty of discriminating content and invoking a context affects the scores.
An informative trend appears when phase two’s scores are separated by HITs that were restricted to experienced respondents with a high submission acceptance rate (Bonus and $2.00 [qualifications]) and those that were open to all turkers ($2.00 [open], $1.50, and $1.00). Respondents to the restricted HITs uniformly rated tasks as more clear and compensation as more fair than respondents to open HITs, as Table 4 shows.
Average turker ratings of phase two’s task clarity and compensation fairness separated by HIT category and content type
|Task Clarity (out of 4)||Compensation Fairness (out of 4)|
This trend also largely appears in phase one ratings, with minor deviations, as shown in Table 5.
Average turker ratings of phase one’s task clarity and compensation fairness separated by HIT category and content type
|Task Clarity (out of 4)||Compensation Fairness (out of 4)|
Experienced, prolific turkers consistently express more favorable perceptions of these summary tasks than do a broader sampling of turkers. This suggests that experience is a component of readability on MTurk; the same text is not equivalently readable for all turkers. Readability emerges from a relationship among the turker, the text, and the task.
In combination, the findings on HITs’ success toward their intended goal and turkers’ perceptions of their clarity and fairness imply that the readability of these summary tasks was more meaningfully shaped by compensation strategies and audience targeting than instructional language differences.
Visibility and legibility animate the findings in ways that align with the discussion of telling in the project’s first study. Through metadiscourse, the summaries explicitly foreground humans’ situation within a rhetorical task. Through enthymeme, the summaries implicitly appeal to subjective contextual information. Through the mechanism of MTurk, these aspects are collated and made intelligible. Readability functions in multiple ways: turkers generate readable information in response to readable HITs and are themselves read and ordered by the MTurk system.
This raises the issue of what readability is on the service. Ratings suggest that it is not simplicity, because single word statements that are comprehensible but incomplete have low scores. It also is not accuracy, because a top-rated summary states that the boy in the depicted scenario is a girl. It is not a matter of maximal detail or neutral perspective. It appears to be a balanced assemblage of these and other factors.
A nuance of this issue that emerges from the readability ratings is that although summaries from experienced turkers tend to populate the upper ranks, it may not be appropriate to view them as “better.” They may be more readable because they are more finely attuned to readability within the context of MTurk, an entity predicated upon encouraging humans to behave like machines. The service seeks to produce automated outcomes for tasks that resist them. In this sense, MTurk’s readability may be inherently mechanistic.
To explore these questions, we might return to the piece of media that displays significantly different readability ratings than the others: the paragraph. It is notable that written text is unlike the other media because it is also the one most commonly associated with automated summary, as discussed in the introduction. There may be something different in the conventions and expectations of written text that flattens distinctions between human and algorithmic cognitive acts.
We considered this aspect by experimenting with GPT-3’s summary capabilities. Coercing it into recapping the same paragraph we provided to turkers produced interesting results, as shown in Figure 4.
Summaries of the study’s sample paragraph text by GPT-3
Here we see some of the same traits present in turkers’ summaries. GPT-3 provides metadiscourse as it expresses difficulty trying to limit the length of its summary, stating: “I’m not sure if I can do it in one sentence, but I’ll try.” It selectively cuts to what it believes is the passage’s kernel and extrapolates contextual information not contained in the source content through its statement: “I think the most important thing to note is that the man and the boy are not in a classroom.”
These passages were intentionally culled, and as a whole the summaries still come off as—well, odd. We only did preliminary activities in GPT-3 and did not rigorously test its summarization capacity because our goal is to examine readability on MTurk, not to compare human and algorithmic summaries to find a “winner”; however, the presence of these traits compels us to consider their overlaps.
Technology journalist Tristan Greene (2021) might suggest that the operations of such machine learning language models are no less “parlor tricks” than was the original Mechanical Turk. Gary Marcus and Ernest Davis (2020), authors of “GPT-3, Bloviator: OpenAI’s Language Generator Has No Idea What It’s Talking About,” state of GPT-3 that “although its output is grammatical, and even impressively idiomatic, its comprehension of the world is often seriously off, which means you can never really trust what it says.” It has a sophisticated6 capacity to associate words and phrases, even within subject matter domains, but it has no externally grounded semantic or ontological framework. This leads Marcus and Davis to bluntly designate it ”a fluent spouter of bullshit.” Bender et al. (2021) seemingly would concur. Their paper7 identifies risks with what it calls a “stochastic parrot”: an automated language model “haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning.”
Nevertheless, there are other ways to view these acts and their output than through the binary of humans who understand meaning and machines that only statistically yoke together words they have tabulated but do not comprehend. GPT-3, like its distant forbearer the Mechanical Turk, is not necessarily just faking being human; it is authentically being itself. It may be more appropriate to view the summaries generated by turkers and algorithms as moving toward a lingua franca, an amalgamated notion of what constitutes readability. Although the human is foregrounded through telling, we also see a bi-directional movement: humans become more standardized through their conditioned legibility to algorithmic systems and the systems necessarily take on some of their character. That is to say, humans also often are parrots fluent in bullshit; our cognitive acts too may be creaky illusions. What may be happening is a growing together, an entanglement. MTurk is generating hybrid readablities through its operations.
Clive Thompson (2021) provides an illuminating parallel by addressing a similar effect in one of the study’s other media types: still images. More specifically, Thompson speculates why Google’s CAPTCHA8 images—the photo grids that force you to click all the traffic lights to prove that you are human—are so “unbearably depressing.” Thompson argues that because the images are shot from elevated devices on cars (often to train self-driving algorithms), their angles are uncanny because they slightly deviate from human eye lines, and they almost completely exclude the natural world and feature only the built environments surrounding roadways. Additionally, the image quality and focus are poor; we are shown the images precisely because their deficiencies have thwarted automated image recognition systems, so we have been conscripted to assist. They also are devoid of recognizable humans to avert legal privacy issues, and they are segmented into grids, replicating the alien visual scanning field of a machine or the compound eye of an insect. “They weren’t taken by humans, and they weren’t taken for humans,” states Thompson, “They are by AI, for AI.” He goes on to note that the web services company Cloudflare has calculated that each day humans collectively spend 500 years completing CAPTCHAs. He wonders: “What has it done to humanity, to be forced us [sic] to regard these images, for years on end?” He then provides his own answer, noting that we now are compelled to “look at the world the way an AI does.”
The picture in our study, of course, was unlike a Google CAPTCHA image: it presented a posed but recognizably personal moment between a parent and child. However, this may not be a binary issue; as the AI coerces us to look through its lens we oblige it to watch through our eyes. Collectively we are building a shared perspective, as humans did with the still and motion picture cameras before the hobgoblin of image recognition algorithms. We again see in this a function of visibility: the practice of making individual sensors legible at a collective scale. This requires protocols, standard ways of communicating, and those must be negotiated. There is a synthesis in these prolonged couplings.
Addressing another of the study’s media formats, Eugene Wei (2020) explores such overlaps in video through his examination of the clip sharing service TikTok9 titled “Seeing Like an Algorithm.” Wei draws from James C. Scott’s (2020) Seeing Like a State, which analyzes how nation-states summarize citizens through reductive abstractions to make aggregate patterns comprehensible. Scott explicates the significant failures of such compelled attempts because, according to Wei, “they impose a false sense of order on a reality more complex than they can imagine.” Yet Wei references the successes of visibility on TikTok, in part because the service has created conditions that induce users to make themselves intelligible. Readability on the system emerges through incentivizing visibility, as Wei contends:
James Scott speaks of “seeing like a state,” of massive shifts in fields like urban design that made quantities like plots of land and their respective owners “legible” to tax collectors. TikTok’s design makes its videos, users, and user preferences legible to its For You Page algorithm. The app design fulfills one of its primary responsibilities: “seeing like an algorithm.”
This algorithmic perspective returns us to telling, to the individual’s systemic value being bound to making the self visible so it may be interpreted at a collective scale. More specifically within the domain of rhetoric, the visibilities of Wei’s algorithms and Scott’s bureaucracies cohere within Casey Boyle’s (2016) “pervasive citizenship,” people functioning as continuously perceptible data points in a feverishly fraught drive toward “data-driven governance” (pp. 269–270). Boyle sees in the contemporary polis a waning of traditional rhetorical/democratic forms associated with previous modes of orality and literacy, of “communicating language for deliberating civic activity” (p. 270). In their absence, there has been an expansion of “sensor technologies and big data methodologies” that allow for collective action through tabulated activity (p. 271). Boyle holds that just as wearable devices and portable computing technologies inform the individual body, this continual telling allows citizens to function as the sensors of a collective social body.
This returns us to our theoretical framework’s question of readability in the context of literacy, because now we may have a item that can fill in our table’s missing cell, if we dislocate ourselves from the central position: the human intelligence task, understood broadly. We may inhabit an era when literacy is best framed not as a human prerogative but an aspect of complex systems that collate our tellings. We thus form x-literacy’s base unit, sensing information and ourselves. As intelligible symbols for collective networks such as MTurk, we have become newly post-literate. We still engage with orality and print literacy. We still summarize audio, images, paragraphs, video and situations. But to view contemporary literacy as the human capacity to interpret coded symbols is to look through the wrong end of the lens. We do not read the digital; the digital reads us.
One might attempt to dismiss this inverse reframing of literacy as mere semantics, but from language comes concepts. As George Lakoff and Mark Johnson (2003) contend in Metaphors We Live By, linguistic models determine our presumed realities. To rebut the charge that materialist treatments of agency merely constitute empty verbal trickery, Bruno Latour and Michel Callon (1992) assert in “Don’t Throw the Baby Out with the Bath School!” that language unites words and things (p. 354). They argue that “a common vocabulary and a common ontology should be created by crisscrossing the divide by borrowing terms from one end to depict the other” (359). This is Kevin Kelly’s (2010) espoused purpose in choosing the provocative and anthropocentric verb “wants” for the title of his book, What Technology Wants.
This may seem like hyperbole, but in addition to the meta-organization of MTurk, we might consider the services, the social networks, the search engines that Wei and Boyle suggest order our experience. Legal scholars Ryan Calo and Danielle Citron (2019) point out that the US administrative state in the twenty-first century is iteratively shifting from using computed data to assist human decision-makers to empowering automated systems that make autonomous policy decisions in policing, social services, and governance without human beings in the loop. This is no longer a situation of humans generating knowledge and circulating it amongst themselves through the means of the era; technologies are directly reading and writing what we inadequately deem human society.
This need not be bleak; technologists such as Kelly (2010) and cognitive scientists including Andy Clark (2004) may suggest that we always have been entangled with technologies, and those partnerships have proved largely beneficial. We constantly negotiate the boundaries of what “we” are and are not, and the respective roles of each. This is in part occurring through emerging notions of readability on MTurk.
Readability on MTurk is a complex aspect arising from human participants’ interactions with the system’s affordances. This study’s examination of survey responses has identified three principal findings regarding readability on MTurk:
Summaries explicitly foreground the situated aspects of readability through metadiscourse that conflates physiological and intellectual capacities, particularly in response to ambiguity.
Summaries implicitly respond to enthymemes through intuitive statements that rely on tacit sociocultural knowledge. These statements function as tells in the sense of poker: unconscious signals of positioning.
Compensation strategies and audience targeting may be more effectual for summary HITs’ readability than instructional language changes.
Collectively these observations situate readability in a context of telling, a reciprocal process of signifying and ordering shared by humans and machines. This places readability and telling within the lineage of literacy. The human intelligence task, which overtly appears on MTurk but also is latent in CAPTCHAs, digital citizenship, and many other forms, becomes the encoded symbol legible to macro-level systems. Contemporary literacy invokes a situation in which we are both readers and read.
It would be possible to further this research by comparing turkers’ summaries with those generated by GPT-3 or image, audio, and video description services—such as those previously identified—which may illuminate differences between them. Although our goal is not to determine which type is objectively more accurate, it may be valuable to triangulate how readability emerges on MTurk in comparison to other contexts.
Additionally, this work can contribute to the broader interrogation of human and machine interactions that is the Following Mechanical Turks project’s primary focus. In the project’s first study, we examined MTurk’s language to ascertain how the human was framed rhetorically. The current study has considered how HITs’ readability functions with the service’s participants. These texts have outlined pathways into a broader theoretical exploration of how the MTurk system, whose affordances entangle human and non-human elements, articulates notions of the human in a contemporary context of automated labor.
For an overview of MTurk, see Tirrell & Rivers (2020). ↩
We suspect that anyone reading this has graded their share of papers. ↩
Wynings (2017) points out that the semantic web, networked content specially tagged with additional metadata to make it readily intelligible to algorithms, was rendered unnecessary by AI and machine learning systems probabilistically reading existing human information and behavior patterns. ↩
We used these criteria rather than MTurk’s “Masters” qualification because that designation is, by Amazon’s own admission, opaque ("FAQs,” 2018), and turkers report that it is capriciously assigned or withheld (asmrgurll, 2018). The qualifications we used are common, clear, and objective. The relationship between concealment and visibility is crucial to this study, and we see in the Masters qualification another sleight-of-hand in a service explicitly named for a deception that hides human labor beneath a veneer of automation. ↩
Although the surveys’ instructional language provided this variable breakdown to respondents, ultimately we gave all of them the highest bonus of $3.00 in addition to the flat $1.00 compensation. We did this because, based on our conversation with Dr. Cagle, we determined that respondents’ belief in a tiered bonus structure was sufficient to identify any observable effects, and the highest bonus amount would more fairly compensate them for their labor. Additionally, we did not want to delay respondents’ bonuses until all second phase data was processed, so promptly awarding all of them the highest bonus was the best solution. ↩
This descriptor may be quite accurate. In Marcus & Davis (2020), AI researcher Douglas Summers-Stay describes GPT-3 in terms that recall Gorgias’s renown as a skilled confabulator and Plato’s more pejorative contrast between cookery and medicine in the eponymous dialogue: “GPT is odd because it doesn’t ‘care’ about getting the right answer to a question you put to it. It’s more like an improv actor who is totally dedicated to their craft, never breaks character, and has never left home but only read about the world in books. Like such an actor, when it doesn’t know something, it will just fake it. You wouldn’t trust an improv actor playing a doctor to give you medical advice.” ↩
According to Hao (2020), Timnit Gebru, one of the paper’s authors and a noted advocate for diversity and ethics in AI research, was removed from her position with Google because of this work, which challenged core elements of the company’s business model. ↩
It is worth noting that CAPTCHA is a contrived acronym that stands for “Completely Automated Public Turing test to tell Computers and Humans Apart.” ↩
Wei’s (2020) discussion of TikTok discerns an “exoticization that often characterizes Western analysis of the Chinese tech scene,” which aligns with the intentional patina of orientalism on the Mechanical Turk. ↩