Spinning words as disguise: Shady services for ethical research?

Joseph Reagle, Manas Gaur

2022-01-11

STATUS: Published.

Spinning words as disguise: Shady services for ethical research?
by Joseph Reagle and Manas Gaur.
First Monday, Volume 27, Number 1 - 3 January 2022
https://firstmonday.org/ojs/index.php/fm/article/download/12350/10588
doi: https://dx.doi.org/10.5210/fm.v27i1.12350

ABSTRACT: Ethical researchers who want to quote public user-generated content without further exposing these sources have little guidance as to how to disguise quotes. Reagle (2021b) showed that researchers’ attempts to disguise phrases on Reddit are often haphazard and ineffective. Are there tools that can help? Automated word spinners, used to generate reams of ad-laden content, seem suited to the task. We select ten quotations from fictional posts on r/AmItheButtface and “spin” them using Spin Rewriter and WordAi. We review the usability of the services and then (1) search for their spins on Google and (2) ask human subjects (N=19) to judge them for fidelity. Participants also disguise three of those phrases and these are assessed for efficacy and the tactics employed. We recommend that researchers disguise their prose by substituting novel words (i.e., swapping infrequently occurring words, such as “toxic” with “radioactive”) and rearranging elements of sentence structure. The practice of testing spins, however, remains essential even when using good tactics; a Python script is provided to facilitate such testing.

Introduction

Reddit is known as the “front page of the web,” claiming “52M+ daily active users” and “100K+ communities” (Reddit, 2021). Millions of Redditors, including minors and other vulnerable populations, have thousands of subreddits to discuss extraordinarily specific and sometimes sensitive topics, including sexuality, health, violence, and drug use.

Given the public prominence, breadth, and depth of Reddit’s user-generated content, researchers use it as a data source. Proferes et al. (2021) identified 727 such studies published between 2010 and May 2020. A fraction of these papers use what Bruckman (2002) characterized as heavy disguise, wherein usernames are elided and phrases reworded so that it’s difficult for others to locate the source. Proferes et al. (2021) found that just 18 (2.5 percent) of their studies claimed to “paraphrase” Redditors in their reports.

The minority of researchers who claim to disguise their sources note that users’ health, relationships, employment, and legal standing are jeopardized by extra exposure. Additionally, users need not be personally identified to feel embarrassed, to be harassed, or to be forced to abandon a long-held pseudonym. Researchers themselves are at risk if their practice has fallen short of approved IRB policy or other regulations, such as health privacy regulations. Even when the use of public sources is outside institutional human subjects review, researchers might face embarrassment or repercussions if a source complains. Disguising sources’ prose can mitigate these harms.

Unfortunately, many attempts at ethical disguise fail. Reagle (2021b) interrogated 22 Reddit research reports, with 19 claiming to reword phrases to disguise their sources, with only 8 of those succeeding. Some researchers simply failed to reword phrases. One researcher collected and reported data they said they would not. Two others failed to scrub their reports of locatable information after they opted for heavier disguise during their reports’ review and editing. Some failed to introduce enough change into their phrases. We have no evidence Redditors were affected by these failures, but the potential for their harm (e.g., embarrassment, harassment, or changes to employment and legal status) and to the researchers’ reputations exist.

Wordspinners disguise text by substituting synonyms and rearranging, condensing, or expanding prose so as the result appears novel. Could researchers be helped by these services that automatically disguise (or “spin”) prose?

Spin Rewriter and WordAi are typically used to build content farms for “search engine optimization” (SEO) without Google detecting the spun content as copied. Spin Rewriter “takes a single article and turns it into dozens of 100% unique, human-quality articles. All these unique articles will let you rank higher, and for more profitable keywords” (INFINET, 2021). Google is likely to see a web of such interlinked articles as authoritative content, driving search traffic to the farm and increasing its advertising revenue. Like Spin Rewriter, WordAi alters sentences with some understanding of semantics, and “this high level of rewriting ensures that Google and Copyscape can’t detect your content while still remaining human readable!” (Cortx, 2021).

Perhaps these shady services, typically used by plagiarists and spammers, can be used for the ethical disguise of researchers’ online sources. We test how difficult it is to locate spun phrases, the quality of the resulting prose, and compare this with human efforts.

Background

Reddit and Sensitive Topics

Reddit was founded in June 2005 as a pseudonymous-friendly website for users to share and vote for links they had read (i.e., “I read it.”) Reddit’s development as a forum of forums, where users could trivially create subreddits, each with their own moderators, led the website to succeed over its link-sharing peers such as Digg and Delicious. (It also led to problematic content and behavior in the first half of Reddit’s life.)

Like Twitter and Wikipedia, Reddit serves an extraordinary corpus of mostly public data. That is, while there are private and quarantined subreddits, the vast majority of content is public: transparently accessible to any web browser or search engine. More so than Wikipedia and much of Twitter, Reddit hosts discussions of a personal character. Subreddits on sexuality, health (including mental health and eating disorders), interpersonal abuse and violence, and drug use and cessation have been topics of research. Reddit is a compelling and accessible venue, but with sensitive—even if public—information.

Disguising Sources to Mitigate Their Location

We speak of disguising public sources to prevent them from being located.

Bruckman (2002) identified a spectrum of disguise, from none to heavy. Under light disguise, for example, “an outsider could probably figure out who is who with a little investigation.” Under heavy disguise, some false details are introduced and verbatim quotes are avoided if a “search mechanism could link those quotes to the person in question.” If the heavy disguise is successful, “someone deliberately seeking to find a subject’s identity would likely be unable to do so.” Introducing false or combined details about a source has been referred to as fabrication, a tactic of heavy disguise. Fabrication can conflict with traditional notions of research rigor and integrity. Markham (2012) argues that if done with care, fabrication can be the most ethical approach. If not done with care, however, fabrication can lead to suspicions of fraud (Singal, 2016).

In human subjects research, such as healthcare, de-identification “involves the removal of personally identifying information in order to protect personal privacy” (Guidelines for Data de-Identification or Anonymization, 2015). Anonymized is sometimes used synonymously with de-identified, or can have a stronger connotation of data being rendered incapable of being re-identified (Lubarsky, 2017) We avoid anonymized because it is far too an assured word given the known cases of failure (Bradbury, 2021; Ohm, 2010). And in public data contexts, there might not be personally identifiable information given the use of pseudonyms. Reagle (2021b) provides a more complete review of alternative terms—and the ethical question of the following section—but we speak of testing the locatability of disguised phrases, especially those spun by automated services. When sources are located, user accounts could be further de-identified by adversaries.

Should Researchers Disguise Public Sources?

If and when to disguise is an ongoing conversation among researchers and ethicists. For example, Sharf (1999, p. 253) argued that researchers should seek the consent of public sources and “implied consent should not be presumed if the writer does not respond.” Rodham & Gavin (2006) responded “that this is an unnecessarily extreme position to take” and wrote, “messages which are posted on such open forums are public acts, deliberately intended for public consumption.” There is a substantive disagreement here, but there are also issues of definition. Sharf was studying a breast cancer email list (“public” because the list is “open” for anyone to join), whereas Rodham and Gavin’s sense (i.e., “intended for public consumption”) permits the content to be transparently accessed by third-party search and archival services. These two senses of “public” can have different ethical implications.

Additionally, whether researchers should disguise is dependent on site-specific considerations, be it at Wikipedia, a mostly-pseudonymous encyclopedia (Pentzold, 2017), at 4chan, a highly-anonymous discussion board (Zelenkauskaite et al., 2020), at sites where “we are studying people who deserve credit for their work” (Bruckman et al., 2015), or public sites where people, nonetheless, discuss sensitive topics or share images (Andalibi et al., 2017; Ayers et al., 2018; Chen et al., 2021; Dym & Fiesler, 2020; Fiesler & Proferes, 2018; Haimson et al., 2016; Sowles et al., 2017). Additionally, websites have affordances that affect how sources can be located, such as novel search capabilities or external archives (Reagle, 2021b).

We take no position on if researchers should disguise their sources. Rather, we focus on tactics available to researchers who want to disguise their quotes, because they often do a poor job of it (Ayers et al., 2018; Reagle, 2021b).

Locating sources

Concerned researchers have started to assess how often usernames, quotations, and media are included in research reports.

Ayers et al. (2018) analyzed 112 health-related papers discussing Twitter and found 72 percent quoted a tweet, “of these, we identified at least one quoted account holder, representing 84%.” When usernames were disclosed, in 21 percent of the papers, all were trivially located. Ayers et al. wrote that these practices violate International Committee of Medical Journal Editors (ICMJE) ethics standards because (1) Twitter users might protect or delete messages after collection, and (2) revealing this information has no scientific value.

Proferes et al. (2021) performed a systematic overview of 727 research studies that used Reddit data and were published between 2010 and May 2020. “Sixty eight manuscripts (9.4%) explicitly mentioned identifiable Reddit usernames in their paper and 659 (90.7%) did not. Two hundred and seven papers (28.5%) used direct quotes from users as part of their publications, 18 papers used paraphrased quotes, noting they were paraphrased (2.5%) and 502 (69.1%) did not include direct quotes.” 1

In the studies above, researchers who paraphrase quotes are found to be rare and laudable. But are such paraphrases effective? Reagle (2021b) interrogated 22 Reddit research reports: 3 of light disguise, using verbatim quotes, and 19 of heavier disguise, claiming to use reworded phrases. They concluded that disguising sources is effective only if done and tested rigorously because they located all of the verbatim sources (3/3 reports) and many of the reworded sources (11/19 reports). Researchers who elided the forum and year or collected data over multiples thereof were less likely to have their sources located. Conversely, if locating a source is like finding a needle in a haystack, “reports that focus on a single subreddit (as stated or inferred) in a single year winnow away much of the hay,” making the search easier (Reagle, 2021b). A few researchers admirably tested their disguises in Google, though their success was dependent on the specificity of their queries.

Spinning words

Unlike some writing tools, such as QuillBot (2021), that can rewrite prose for clarity’s sake, spinners are designed to be used at scale—creating dozens of variations—while avoiding detection. That is, their spins should be natural sounding and true to the original source without being detected as a copy.

Spin Rewriter was launched in 2011 by Aaron Sustar (2016). In addition to an interactive human interface, it provides a WordPress (blogging platform) plugin and an API with libraries for C#, JavaScript, PHP, and Python (Spin Rewriter API SDK, 2021). As of May 2021, the service costs $47 a month or $77 a year with a special discount. WordAi was launched in 2012 by Alex Cardinell (2012). It too offers an interactive website and API for $49.95 a month or $299.40 a year.

To understand how word spinning works, consider this Spin Rewriter example and the following source prose:

What’s worse is that friend doesn’t seem to understand what the problem is with some stranger coming into our living space and basically wags his tail every time he sees this stranger.

Spin Rewriter, as does WordAi, provides spintax so that the client can see the range of available variation, using curly brackets {..} to represent grouping and pipes | to represent alternatives. In this case:

What’s {worse| even worse} is that {friend| buddy| pal| good friend} {doesn’t| does not} {seem| appear} to {understand| comprehend} what the {problem| issue} is with some {stranger| complete stranger} {coming into| entering| entering into} our {living space| home} and {basically| essentially| generally} wags his tail {every time| each time| whenever} he sees this {stranger| complete stranger}.

One such instance of spun prose is:

What’s worse is that friend does not appear to understand what the issue is with some stranger coming into our living space and generally wags his tail every time he sees this complete stranger.

We will assess spins for their (non)locatability and their fidelity to meaning and fluency. Locatability should be inversely related to computational metrics of “lexical dissimilarity”: “how much has the paraphrase changed the original sentence?” 2 The more dissimilar the spin is from its source phrase, the less likely a search engine will return the source. However, dissimilarity is a static function of the phrases; locatability is the result of human searches of a dynamic web using ever-changing indexes and algorithms. Our use of fidelity combines the “adequacy” of meaning preservation and the “fluency” of the result.3 “Semantic completeness” is an alternative, and preferred, term to “adequacy” 4

Our impressions of these services as tools for ethical disguise are as follows.

The SpinRewriter website was easy to use. The spun content was comprehensible and had few tells of artificial generation. Given these services operate on the shady edges of the web, where services require credit card information during a trial period and might create spurious charges, it was a relief that the Spin Rewriter account was easily canceled—but not without many promotional emails to return to the service.

WordAi was not as polished and the results had a noticeable tell: the capitalization of words was odd, such as: “Fortunately a Couple of Days Back That the door was left open…” (We manually corrected these when shown to human subjects as we wondered if we made a mistake in configuration, the corrections were easy to do, and the corrections made the comparisons more interesting.) Cancellation required that customer service be contacted—though a representative said this canceling via the website should soon be available. (We have not tested WordAi 5.0, which was released in the last week of June 2021.)

There is significant computer science literature on “paraphrasing” text using artificial intelligence techniques (Androutsopoulos & Malakasiotis, 2010; Celikyilmaz et al., 2021). But word spinning, as an accessible service to non-technical users, is little discussed in the literature. We’d like to see any researcher avail themselves of such a tool, even if they lack the technical expertise (Zelenkauskaite & Bucy, 2016) to understand or implement semantic modeling and transformations. In the educational context, Kannangara (2017) found that word spinners rarely improve prose quality and successfully evade plagiarism detection. The following experiments test these findings from the perspective of a researcher disguising online sources in their reports.

Experiment 1: Locating Automated Spinning

On Reddit, a post is followed by comments within a thread. Posts and comments are, generically, messages. We used five posts tagged as fictional on the subredditr/AmItheButtface. This subreddit is “the cool, relaxed, bastard nephew of /r/AmItheAsshole” (r/AmItheButtface, 2020). It allows fictional posts, which are often scenarios from popular media (e.g., a TV character looking for his underwear) or the antics of toddlers and pets (e.g., a cat who enjoys swatting knickknacks off the mantle). Consequently, our phrases have the form of personal and possibly sensitive advice disclosures but are labeled as fictional by their authors.

For each of the five posts, we selected a phrase from the post and from a comment, yielding ten quotes altogether.

Users of spinners can configure the spins, varying the amount of fidelity and structural changes performed, as well as providing custom word lists to the spinners. We typically opted for the default settings. At Spin Rewriter, we selected “Most readable: only use synonyms that are definitely correct.” Though we experimented with “Very readable” at WordAi, we opted for the default “Readable” as it provided more varied prose with no loss in fidelity. Playing with the options for rearranging sentences and paragraphs didn’t seem to be of consequence given the source phrases were short.

We developed the reddit-search.py GPLv3-licensed script (Reagle, 2021a) to help locate phrases within the first page of search results. The script iterates through the phrases in a spreadsheet, building search engine queries and opening the results in browser tabs for manual scrutiny. It queries the search engines at Google, Reddit, and RedditSearch/Pushshift (Baumgartner et al., 2020). To ease the testing of disguise, the script can automatically check the search results for the sources’ URLs if provided in the spreadsheet’s url column.

For the present study, only Google results are reported because the Reddit-specific engines were ineffective against these non-verbatim queries.

Ethical Policy

Redditors of r/AmItheButtface did not consent to the use of their prose. Though we mention the subreddit and provide a few quotes that could be searched for by readers of the present report, we elide usernames and dates. We believe lack of consent and light disguise are appropriate: posts were in a public forum, from obvious pseudonyms, marked as fictional by their authors, and had existed for more than five months without deletion at the time of capture.

This policy is part of Institutional Review Board application #20-08-30 by the first author and “approved” as DHHS Review Category #2: “Exempt… No further action or IRB oversight is required as long as the project remains the same.” The second author joined the project later and only had access to the data within this report itself.

Results

Table 1 includes ten source phrases, their spins, a metric of their dissimilarity from the source, and whether the spins were found. WMD measures the semantic differences between two sentences, where higher numbers indicate greater difference and zero means identical (Kusner et al., 2015). WMD is also provided in Table 4.

Table 1: Reddit phrases and automated spins
Source Phrases WMD Found
1. Reddit Luckily a few days ago the door was left slightly open when my mom was out and I went inside and found that mouse.
Spin Rewriter Fortunately a few days ago the door was left somewhat open when my mom was out and I went within and found that mouse. 0.393
WordAi Fortunately a Couple of Days Back That the door was left open when my Mother was out and I went inside and Discovered that mouse. 0.472
2. Reddit Maybe she was just surprised and needed a few minutes to think about how to reciprocate this thoughtful gift you gave her.
Spin Rewriter Maybe she was just stunned and needed a few minutes to consider how to reciprocate this thoughtful present you provided her. 0.347
WordAi Perhaps she was surprised and wanted a Couple of minutes to Consider how to exude this thoughtful gift you gave her. 0.307
3. Reddit What’s worse is that friend doesn’t seem to understand what the problem is with some stranger coming into our living space.
Spin Rewriter What’s even worse is that buddy does not appear to understand what the problem is with some stranger coming into our home. 0.244 Google
WordAi What is worse is that that friend does not Appear to know what the issue is with a stranger coming to our living area. 0.303 Google
4. Reddit I think you should have made your opinion known on the first night. You let ambiguity regarding your feelings develop.
Spin Rewriter I believe you must have made your viewpoint known on the first night. You let uncertainty concerning your feelings establish. 0.441
WordAi That I believe that you need to have made your opinion known about the very first night. You allow ambiguity regarding your emotions grow. 0.368
5. Reddit When I first got into her bedroom, I quickly rummaged through the piles of clothes sitting on the floor to confirm my suspicions.
Spin Rewriter When I initially entered into her bed room, I quickly searched through the stacks of clothing sitting on the floor to verify my suspicions. 0.324
WordAi When I got into her bedroom, then I quickly rummaged through the piles of clothing sitting on the ground to verify my feelings. 0.364
6. Reddit She has her boundaries for a reason - she only wants people she can trust in her room, people who won’t go digging through her shit & taking things.
Spin Rewriter She has her borders for a factor - she just wants individuals she can trust in her space, individuals who will not go digging through her shit & taking things. 0.524
WordAi She’s her bounds for a reason - she just needs people she can trust in her room, individuals that will not go digging through her shit & doing matters. 0.372
7. Reddit At this point, I had absolutely no choice but to press the glowing button on the Xbox and put an end to the madness.
Spin Rewriter At this point, I had definitely no choice but to press the glowing button on the Xbox and put an end to the insanity. 0.230
WordAi Now, I had no option but to press on the glowing button on the Xbox and put a stop to the insanity. 0.450
8. Reddit Your parents are toxic. Your mom sounds like a narcissist. Your dad just stood by while she said that to you? He’s enabling her. Don’t let them gaslight you.
Spin Rewriter Your moms and dads are hazardous. Your mama seems like a narcissist. Your daddy just waited while she stated that to you? He’s enabling her. Don’t let them gaslight you. 0.298
WordAi Your parents are poisonous. Your mother sounds like a narcissist. Your daddy just stood while she explained that for you? He is enabling her. Do not let them gaslight you. 0.207
9. Reddit Anyway, I told her to stop coming around, but she wouldn’t stop. I ended up having to call the cops on her to keep her away.
Spin Rewriter Anyway, I told her to stop occurring, but she would not stop. I wound up having to call the police officers on her to keep her away. 0.270
WordAi Anyhow, I advised her to quit coming about, but she would not stop. I ended up needing to call the cops on her to maintain her away. 0.302
10. Reddit Dying in hospitals is a new thing; before that, for centuries and centuries, people have died at home.
Spin Rewriter Passing away in healthcare facilities is a new thing; before that, for centuries and centuries, individuals have died at home. 0.370 Google
WordAi Dying in hospitals is a brand new item; earlier this, for centuries and centuries, people have died at home. 0.232

The tactics employed on these short phrases by the spinners are simple: single-word substitutions. In phrase 8 we see that Spin Rewriter (awkwardly) replaced “parents” with “moms and dads” but this rare multi-word substitution is still a single substitution rather than a substitution across many words that is comprehensive of semantics.

Google located both spins of phrase 3; we suspect there were not enough words with applicable synonyms. For phrase 10, Google located the Spin Rewriter version. This was because WordAi was more aggressive: replacing “new thing” with the awkward “new item.” This diverted Google but sounds artificial. Fidelity and variation are often balanced against the other.

Generally, we found Spin Rewriter’s prose was more fluent, especially given WordAi’s odd capitalization in a few examples. Is this impression shared by others? And how do human subjects spin phrases?

Experiment 2: Surveying Researchers

In May 2021, 20 people completed an online survey via a Google Form. We solicited participants from the 3rd Annual Obfuscation Workshop and on the email list of the Association of Internet Researchers (AoIR). One person withdrew at the final stage of selecting “submit” or “withdraw,” for unknown reasons, and their data is not included (N=19).

Both of the solicited communities include people interested or engaged in the practice of ethical disguise. However, this is not a representative sample of those who use ethical disguise in their research reports. Even so, their responses do lead to useful insights about how researchers might spin sources’ phrases and the efficacy of those tactics. The form asked participants to fill in their occupation, which can be summarized as:

Responses are indexed to the row in the resulting Google Form spreadsheet. For example, R02 is the first response, given that the first row has column headings. The two phases of the survey consisted of participants (1) performing their spins of three example phrases, and (2) judging the performance of Spin Rewriter and WordAi. For the experiment, subjects did their own spins before exposure to the automated examples—though this order is reversed in the explanation below.

Ethical Policy

Participants assented via a consent form that was the first page of the online Google Form. No identifying information is provided in this report or the publicly available data.

This policy is part of the same approved application mentioned in Experiment 1.

Experiment 2.1: Judging Automated Spinning

Phrases from r/AmItheButtface (1–3, in Table 1) were spun with Spin Rewriter and WordAi, with the latter’s odd capitalization corrected. Subjects were asked: “Given an original quote and two disguised versions, select the one you think is better with respect to non-discoverability and fidelity. Select ‘equivalent’ if you think them so.” The results are show in Table 2. One subject wrote they did not understand the “equivalent” option and expressed a preference for each spinner.

Table 2: Subjects’ preferred spins
Phrase WordAi Spin Rewriter Equivalent
1 2 12 5
2 8 6 5
3 4 6 9
Total 14 24 19

Spin Rewriter is favored by subjects, primarily on the strength of the first phrase’s spins. (An analysis of variance shows a statistically significant preference for Spin Rewriter on phrase 1: F = 7.56, p = 0.001 < 0.05.) Even so, a fair amount of people expressed no preference between the two.

Experiment 2.2: Tactics of Human Disguise

Subjects (N=18, R04 abstained from this portion) were asked to disguise phrases 6–8: “The following quotes are from pseudonymous and fictional posts on an advice forum. Disguise (fuzz) them to the degree that you think they will not be discovered via a search engine while maximizing fidelity to the original.” The (colored) boxes of dotted and dashed lines were not visible to participants; instead, we provide them to ease analysis and understanding.

phrase 6
She has her boundaries for a reason - she only wants people she can trust in her room, people who won’t go digging through her shit & taking things.
phrase 7
At this point, I had absolutely no choice but to press the glowing button on the Xbox and put an end to the madness.
phrase 8
Your parents are toxic. Your mom sounds like a narcissist. Your dad just stood by while she said that to you? He’s enabling her. Don’t let them gaslight you.

In the resulting data (reddit-mask-survey-spins.csv) and its coding (reddit-mask-survey-spins-coded.xlsx) we see the following spinning tactics.

  1. ungendered nouns and pronouns
  2. single-word substitutions
  3. multiple-words substitutions
  4. rearranged sentence structure
  5. removed elements of sentence structure

R12, for example, replaced the gendered “mom” and “dad” with “parent.” (If you recall, Spin Rewriter clumsily did the opposite, replacing “parents” with “moms and dads.”)

R12’s phrase 8
Your parents are toxic. Your parent sounds like a narcissist. Your other parent just stood by while she said that to you? (…) enabling her. Don’t let them gaslight you.

R19 replaced the gendered pronoun “she” with the singular “they” while accidentally preserving gender at the end of their phrase.

R19’s phrase 6
They have their boundaries for a reason. They only want people they can trust in their room, people who will not go digging through her things.

No one chose to reverse the genders of those discussed.

On phrase 7, R02’s spin showed multi-words substitutions and the rearrangement of the three elements: (a) lack of choice, (b) pressing a button, and (c) ending madness. R02 replaced “end the madness” with “stop the craziness” and moved that element to the middle of the sentence.

Original phrase 7
At this point, I had absolutely no choice but to press the glowing button on the Xbox and put an end to the madness.
R02’s phrase 7
At that very moment there was only one way to stop the craziness: to push the lighted X-box switch.

As seen in Table 3, most all subjects transformed the phrases with multi-words substitutions. A third of that rearranged the positions of major elements of the phrase. Only a few subjects exclusively used single-word substitutions or removed elements of the phrase.

Table 3: Subjects’ spin tactics
Phrase Ungender Singles Multiples Rearrangements Removals
6 1 1 16 6 5
7 0 17 5 3
8 2 1 17 8 4

Two of the spins exemplify the tension of balancing fidelity (of prose) against fecundity (of variation).

Across all phrases, R08 aggressively minimized the prose. For phrase 7, they maintained the element of reaching a moment without choice, but removed the elements of turning off a game and ending “madness.”

R08’s phrase 7
You’ve got to know when to say No.

Across all phrases, R12 tended to maintain the original prose with some words trimmed via ellipses (three times in phrase 6; once in phrases 7 and 8). In phrase 7, they removed “At this moment,”.

R12’s phrase 7
(…) I had absolutely no choice but to press the glowing button on the Xbox and put an end to the madness.

R08’s spin should not be locatable, but it strays far from the original meaning. R12’s spin maintains much fidelity but might be locatable.

Experiment 2.3: Locating Human Disguises

We used a Python script developed to facilitate the search for Reddit sources (Reagle, 2021a) and reviewed all hits on the first page of Google results (between 7–20 results). After confirming the original versions of phrases 6, 7, and 8 could be located, we tested 54 human disguises (3 phrases by 18 participants) and the six automated disguises from Table 1. Table 4 shows the located spins (see reddit-mask-survey-spins.csv for all data).

Table 4: Located disguises for phrases from Table 1
Phrase Subject Spun WMD Found
6 R12 has (…) boundaries for a reason (…) only wants people (…) can trust in (…) room, people who won’t go digging through her shit and taking things. 0.306 Google
6 R19 They have their boundaries for a reason. They only want people they can trust in their room, people who will not go digging through her things. 0.294 Google
7 R15 Now I had to push the button on the console and stop the madness. 0.478 Google
8 R06 Your mom is a narcissist, and your dad just stood by when she said that to you! They are toxic, don’t let them get the better of you. 0.479 Google
8 R11 Your mom and dad are toxic. She seems to be narcissistic and if your dad just stood by while she said those things, then he’s enabling her. They are gaslighting you, don’t let them do it. 0.472 Google
8 R14 You need to not let your parents gaslight you. They seem toxic and I can’t believe your dad just stood by while your narcissistic mother said those things to you. 0.349 Google
8 R16 Your dad just standing by while she said that shows that he is enabling your narcissistic mother, and both of them are toxic. You shouldn’t let your parents gaslight you. 0.313 Google

Recall, from Table 1, that the automated spins of phrases 3 and 10 were located via Google searches. Here, we located the sources of the of phrases 6, 7, and 8. (This sequence is a coincidence, nothing more.)

It’s impossible to know why Google returned a source in the first page of results for these disguises: it’s a complex and opaque algorithm. (And as Google’s algorithm changes, so could these results.) However, when considering the human tactics mentioned above, we suspect:

Discussion

Rewording prose can be part of effective disguise, especially the combination of:

  1. multiple-words substitutions, focusing on novel words, such as “radioactive” in place of “toxic,” as well as proper nouns and names;
  2. altering or removing elements of sentence structure.

No single tactic, however, is sufficient, and successful disguise is at risk when the source is novel and when the scope of the search is narrow. As Reagle (2021b) noted, it is easy to find a shiny needle (i.e., unusual words) in a small amount of hay (i.e., a given subreddit in a given year).

For researchers who want to disguise their sources, automated spinners are viable starting tools. Despite their limitations, Spin Rewriter did well, as did WordAi aside from the odd capitalization of some words. The spinners would still need some configuration and experimentation, but their use is more about scale and cost than quality. If a monthly or annual fee provides a time and cost-saving to the researcher, spinners are worth considering. (QuillBot (2021) is another fee-based service, and yields similar quality spins for free on phrases with less than 700 characters.)

Ultimately, even good automated or human spins can fail as effective disguise. R15’s spin in Table 4 is an example of this: we located the source despite multiple-words substitutions and rearranged structure. The most important practice is testing spins to see if their queries yield their sources on the first page of search results.

Limitations and Applicability

This work is relatively small in scale, and searching for and assessing spun phrases is subjective and idiosyncratic. The current work is across ten phrases, using two automated spinners, and nineteen human subjects. We used Reddit posts that were tagged as fictional, focused on Google searches, limited queries to exact and inexact variations with some experimentation, and scrutinized only the first page of results.

Because the intention is to limit others from locating research sources, the choices and efforts made here likely exceed those of most members of the public. Testing additional phrases, spinners, or search engines would not likely increase the insights we gained. (Bing and DuckDuckGo were used to search for the disguised phrases Google found in Table 4; they found nothing, they’re likely no match for Google.)

Despite these limitations, we believe our recommendations are suitable for disguising instances on platforms other than Reddit. Of course, site-specific considerations are important. In their review of Reddit research, Reagle (2021b) also searched for phrases using Reddit itself and Pushshift’s RedditSearch. Site-specificity, however, does not negate our general suggestions of substituting novel words and rearranging sentence structure. What it means is that researchers should test their disguises against whatever other indexes and search services are relevant to their sites of study.

An important limitation is that the present study is static, and the field of study is dynamic. Forums, like Reddit, often make changes that affect their features and how legible they are to external services. Google, and other search engines, are continually updating their algorithms, affecting what users can find. And the larger information infrastructure evolves. For example, as an undergrad, one of us frequented the Internet’s Usenet (est. 1980), a massive decentralized discussion forum the predated the World Wide Web (est. 1991) and Reddit (est. 2005). As a student, he thought he was posting to a relatively ephemeral venue as messages were deleted on most servers after a few months—storage was limited. A Web-based archive of much of Usenet was made available by Deja News in 1995; they were bought and integrated into Google Search in 2001. Old posts had a visibility and lifespan not previously conceived. Perhaps one-day RedditSearch/Pushshift will support inexact/elastic searches rivaling Google. This anecdote shouldn’t be taken as an excuse to do nothing. Rather, it means we should be as informed and rigorous as possible and be careful of the assumptions we make.

Future work

Our intention is to make recommendations to practitioners of ethical disguise: we test extant services, recommend specific tactics, and offer a script for testing disguises. Yet, more work is needed on the technical and applied fronts.

First, we make little use of the Word Mover Distance (WMD) metrics in Table 1 and 4. WMD and other measures of difference between phrases should be assessed for their ability to predict the efficacy of a disguise. Again, a disguise’s “lexical dissimilarity” from its source does not guarantee non-locatability, but perhaps there is a threshold below which a disguise is likely to be insufficient.

Second, techniques beyond those offered by word spinners should be explored, extended, and applied to ethical disguise. Perhaps rival techniques exist at the intersection of semantic modeling, knowledge graphs, natural language understanding, and reinforcement learning. Moreover, by leveraging the metrics described in this research, we envision a self-supervised tool for creating disguised phrases. A successful tool would maximize non-locatability of sources and the fidelity to the source quotation (i.e., semantic completeness and fluency.)

Using shady services for ethical purposes has an ironic appeal, but there’s room for techniques specific to ethical disguise, openly specified and perhaps provided as a service by research or disciplinary associations.

Conclusion

Researchers who disguise their online sources would benefit from understanding successful disguise tactics and a tool for testing the efficacy of the results.

In addition to avoiding the “small haystack” of using phrases from too few subreddits over too short a time (Reagle, 2021b), automated word spinners could be a part of an ethical toolkit. The best spinners advertise that they test the results of their algorithms to avoid detection, and so researchers might use this shady practice to better their reporting of online sources.

We selected ten phrases from fictional posts on r/AmItheButtface and “spun” them using Spin Rewriter and WordAi. The results were then (1) searched for on Google and (2) judged for fidelity by human subjects (N=19). The spinning services fared relatively well.

The subjects were also asked to spin three of those phrases, which we assessed for efficacy and the tactics employed. Reagle (2021b) found that altering or removing mention of the source forum and date (e.g., r/AmItheButtface in 2020) limits the likelihood of finding it. We recommend that when it comes to rewording phrases, researchers use multiple-words substitutions—especially of novel words (e.g., “radioactive” in place of “toxic”)—and alter or remove elements of sentence structure.

The practice of testing spins by the researcher, however, is necessary. We offer a GPLv3-licensed Python script toward this end (Reagle, 2021a).

About the Authors

Joseph Reagle is an Associate Professor of Communication Studies at Northeastern University. Direct comments to: joseph [at] reagle [dot] org

Manas Gaur is a Graduate Researcher in the Artificial Intelligence Institute at the University of South Carolina. Email: mgaur [at] email [dot] sc [dot] edu

Acknowledgements

We thank Nicholas Proferes for comments on an early draft.

References

Andalibi, N., Ozturk, P., & Forte, A. (2017). Sensitive self-disclosures, responses, and social support on Instagram. Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. http://dx.doi.org/10.1145/2998181.2998243
Androutsopoulos, I., & Malakasiotis, P. (2010). A survey of paraphrasing and textual Entailment methods. Journal of Artificial Intelligence Research, 38, 135–187. http://dx.doi.org/10.1613/jair.2985
Ayers, J. W., Caputi, T. L., Nebeker, C., & Dredze, M. (2018). Don’t quote me: Reverse identification of research participants in social media studies. NPJ Digital Medicine, 1(1). https://doi.org/10.1038/s41746-018-0036-2
Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., & Blackburn, J. (2020). The Pushshift Reddit dataset. Proceedings of The International AAAI Conference on Web and Social Media, 14(1), 830–839. https://ojs.aaai.org/index.php/ICWSM/article/view/7347
Bradbury, D. (2021, September 16). De-identify, re-identify: Anonymised data’s dirty little secret. The Register. https://georgetownlawtechreview.org/re-identification-of-anonymized-data/GLTR-04-2017/
Bruckman, A. (2002). Studying the amateur artist: a perspective on disguising data collected in human subjects research on the Internet. Ethics and Information Technology, 4(3). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.432.1591&rep=rep1&type=pdf
Bruckman, A., Luther, K., & Fiesler, and C. (2015). When should we use real names in published accounts of internet research? In E. Hargittai & C. Sandvig (Eds.), Digital research confidential: The secrets of studying behavior online. MIT Press.
Cardinell, A. (2012, October 10). How absolutely anyone can make $25/hour with spinning. WordAi Blog. https://wordai.com/blog/how-absolutely-anyone-can-make-25hour-with-spinning/
Celikyilmaz, A., Clark, E., & Gao, J. (2021). Evaluation of text generation: A survey. arXiv. https://arxiv.org/abs/2006.14799
Chen, Y., Sherren, K., Smit, M., & Lee, K. Y. (2021). Using social media images as data in social science research. New Media & Society, 146144482110387. http://dx.doi.org/10.1177/14614448211038761
Cortx. (2021, April 27). The smartest article rewriter ever. WordAi. https://wordai.com/
Dym, B., & Fiesler, C. (2020). Ethical and privacy considerations for research using online fandom data. Transformative Works and Cultures, 33. http://dx.doi.org/10.3983/twc.2020.1733
Fiesler, C., & Proferes, N. (2018). “Participant” perceptions of Twitter research ethics. Social Media + Society, 4(1). https://doi.org/10.1177/2056305118763366
Guidelines for data de-identification or anonymization. (2015, July 24). EDUCAUSE. https://www.educause.edu/focus-areas-and-initiatives/policy-and-security/cybersecurity-program/resources/information-security-guide/toolkits/guidelines-for-data-deidentification-or-anonymization
Haimson, O. L., Andalibi, N., & Pater, J. (2016, December 20). Ethical use of visual social media content in research publications. AHRECS. https://ahrecs.com/ethical-use-visual-social-media-content-research-publications/
INFINET. (2021, January 1). The only article spinner that truly understands the meaning of your content. Spin Rewriter. https://www.spinrewriter.com/
Kannangara, D. N. (2017). Quality, ethics and plagiarism issues in documents generated using word spinning software. MIER Journal of Educational Studies, Trends & Practices, 7(1). https://www.researchgate.net/publication/326332201_QUALITY_ETHICS_AND_PLAGIARISM_ISSUES_IN_DOCUMENTS_GENERATED_USING_WORD_SPINNING_SOFTWARE
Kusner, M. J., Sun, Y., Kolkin, N. I., & Weinberger, K. Q. (2015). From word embeddings to document distances. Proceedings of 32nd International Conference on International Conference on Machine Learning, 37, 957–966.
Liu, C., Dahlmeier, D., & Ng, H. T. (2010). PEM: A paraphrase evaluation metric exploiting parallel texts. Proceedings of Empirical Methods in Natural Language Processing, 923–932. https://aclanthology.org/D10-1090/
Lubarsky, B. (2017). Re-Identification of “anonymized” data. Georgetown Law Technology Review, 202. https://georgetownlawtechreview.org/re-identification-of-anonymized-data/GLTR-04-2017/
Markham, A. (2012). Fabrication as ethical practice: Qualitative inquiry in ambiguous Internet contexts. Information, Communication & Society, 15(3). https://doi.org/10.1080/1369118x.2011.641993
McCarthy, P. M., Guess, R. H., & McNamara, D. S. (2009). The components of paraphrase evaluations. Behavior Research Methods, 41(3). https://doi.org/10.3758/brm.41.3.682
Ohm, P. (2010). Broken promises of privacy: Responding to the surprising failure of anonymization. UCLA Law Review, 58(2). https://www.uclalawreview.org/broken-promises-of-privacy-responding-to-the-surprising-failure-of-anonymization-2/
Pentzold, C. (2017). “What are these researchers doing in my Wikipedia?”: Ethical premises and practical judgment in internet-based ethnography. Ethics and Information Technology, 19(2), 143–155. http://dx.doi.org/10.1007/s10676-017-9423-7
Proferes, N., Jones, N., Gilbert, S., Fiesler, C., & Zimmer, M. (2021). Studying Reddit: A systematic overview of disciplines, approaches, methods, and ethics. Social Media + Society.
QuillBot. (2021, September 28). Paraphrasing tool. QuillBot.com. https://quillbot.com/
r/AmItheButtface. (2020, December 22). Reddit. https://www.reddit.com/r/AmItheButtface/wiki/index
Reagle, J. (2021a, June 8). Tools for scraping and analyzing Reddit. GitHub reagle/reddit. https://github.com/reagle/reddit
Reagle, J. (2021b). Disguising Reddit sources and the efficacy of ethical research (under review). https://reagle.org/joseph/2020/mask/disguise.html
Reddit. (2021, January 27). Reddit by the numbers. RedditInc. https://www.redditinc.com/press
Rodham, K., & Gavin, J. (2006). The ethics of using the internet to collect qualitative research data. Research Ethics, 2(3), 92–97. http://dx.doi.org/10.1177/174701610600200303
Sharf, B. (1999). Beyond netiquette: The ethics of doing naturalistic discourse research on the Internet. In S. Jones (Ed.), Doing internet research: Critical issues and methods for examining the net. Sage.
Singal, J. (2016, March 9). 3 lingering questions from the Alice Goffman controversy. The Cut. https://www.thecut.com/2016/01/3-lingering-questions-about-alice-goffman.html
Sowles, S. J., Krauss, M. J., Gebremedhn, L., & Cavazos-Rehg, P. A. (2017). “I feel like I’ve hit the bottom and have no idea what to do”: Supportive social networking on Reddit for individuals with a desire to quit cannabis use. Substance Abuse, 38(4), 477–482. http://dx.doi.org/10.1080/08897077.2017.1354956
Spin Rewriter API SDK. (2021, April 14). Spin Rewriter. https://www.spinrewriter.com/cp-api-code-samples
Sustar, A. (2016, September 14). Spin Rewriter. Happy 5th Birthday, Spin Rewriter! https://www.spinrewriter.com/blog/happy-5th-birthday-spin-rewriter
Zelenkauskaite, A., & Bucy, E. P. (2016). A scholarly divide: Social media, big data, and unattainable scholarship. First Monday. http://dx.doi.org/10.5210/fm.v21i5.6358
Zelenkauskaite, A., Toivanen, P., Huhtamäki, J., & Valaskivi, K. (2020). Shades of hatred online: 4chan duplicate circulation surge during hybrid media events. First Monday. http://dx.doi.org/10.5210/fm.v26i1.11075

Notes


  1. Proferes et al. (2021), p. 14↩︎

  2. Liu et al. (2010), p. 928↩︎

  3. Liu et al. (2010), p. 928↩︎

  4. McCarthy et al. (2009), p. 683↩︎