Recommendations for the Ethical Disguise of Online Sources

Joseph Reagle, Manas Guar

2022-10-27

Introduction

Some researchers of user-generated content make an effort to disguise the public comments they quote so as to not bring additional scrutiny to the comments’ authors. Do researchers actually do what they claim? How do they do so? And does it work?

We’ve found that researchers often mistakenly leave quotes in the clear, lack guidance on how to best perform ethical disguise, and consequently fail at the task. We believe that the few cases of failed disguise that come to light (e.g., Barbaro & Zeller, 2006; Morant et al., 2021; Singer, 2015; Zimmer, 2010) are but the tip of the iceberg. We therefore provide recommendations for ethical disguise and provide links to ongoing work to understand and improve this research practice.

Suggestions

Researchers often include phrases from online source in their reports; readers looking for the sources of those phrases can be likened to looking for needles in a haystack of similar content. To prevent the needles from being found researchers should enlarge the haystack, paint the needle tan, keep track of their efforts, and then test the results by looking for the needles themselves.

  1. Create a “large haystack” of potential sources.
    1. During research, enlarge the scope of research (e.g., sampling across time, forum, or tags rather than selecting content within a more narrow frame).
    2. In research reports, elide, abstract, or fabricate specifics of the data collection (e.g., “data was collected in the summer of 2020 across advice-related subreddits” vs. “data was collected from r/AmItheButtface in 2020 May 5–27”).
    3. Avoid including a verbatim thread title or hashtag even when disguising the associated messages because it dramatically shrinks the haystack.
  2. Paint the needle tan. When disguising a phrase from an online source, alter it in a way so that search engines are unlikely to return the source.
    1. Use multiple-words substitutions, focusing on novel words, such as “radioactive” in place of “toxic,” as well as proper nouns and names. Replacing ordinary words with synonyms is not enough.
    2. Alter elements and order of sentence structure. Simple elision of portions of a sentence is not enough.
    3. Experiment with automated tools for disguising phrases. For example, Quillbot yields decent spins for free on phrases under 700 characters. We provide a more thorough review of the subscription services Spin Rewriter and WordAi and are working to develop a non-proprietary, openly specified technique.
  3. Manage your process. Many phrases go undisguised in reports because of confusion among authors and changes during collaboration, review, and editing.
    1. Maintain a spreadsheet of phrases, their sources, and disguises used in your reports so your efforts can be coordinated and easily be confirmed.
  4. Test your efforts.
    1. Search for your disguised phrases yourself. If you are using Reddit content, for example, search for your phrase using known search engines (e.g., Reddit, Google, and RedditSearch) using the appropriate search facets. For example, the Google query <https://www.google.com/search?q=site:reddit.com r/{subreddit} {source_phrase}> specifies the Reddit website and the subreddit; additional time parameters might also be included. Reagle provides a Python script and demo spreadsheet to help test disguises.

Research

The recommendations above are based on work in different stages of development. Summaries are provided below.

1. Disguising Reddit sources and the efficacy of ethical research

STATUS: Published (Reagle, 2022)

Concerned researchers of online forums might implement what Bruckman (2002) referred to as disguise. Heavy disguise, for example, elides usernames and rewords quoted prose so that sources are difficult to locate via search engines. This can protect users who might be members of vulnerable populations, including minors, from additional harms such as harassment or additional identification. But does disguise work? I analyze 22 Reddit research reports: 3 of light disguise, using verbatim quotes, and 19 of heavier disguise, using reworded phrases. I test if their sources can be located via three different search services (i.e., Reddit, Google, and RedditSearch). I also interview 10 of the reports’ authors about their sourcing practices, influences, and experiences. Disguising sources is effective only if done and tested rigorously; I was able to locate all of the verbatim sources (3/3) and many of the reworded sources (11/19). There is a lack of understanding, among users and researchers, about how online messages can be located, especially after deletion. Researchers should conduct similar site-specific investigations and develop practical guidelines and tools for improving the ethical use of online sources.

2. Spinning words as disguise: Shady services for ethical research?

STATUS: Published (Reagle & Gaur, 2022)

Ethical researchers who want to quote public user-generated content without further exposing these sources have little guidance as to how to disguise quotes. Reagle (2021) showed that researchers’ attempts to disguise phrases on Reddit are often haphazard and ineffective. Are there tools that can help? Automated word spinners, used to generate reams of ad-laden content, seem suited to the task. We select ten quotations from fictional posts on r/AmItheButtface and “spin” them using Spin Rewriter and WordAi. We review the usability of the services and then (1) search for their spins on Google and (2) ask human subjects (N=19) to judge them for fidelity. Participants also disguise three of those phrases and these are assessed for efficacy and the tactics employed. We recommend that researchers disguise their prose by substituting novel words (i.e., swapping infrequently occurring words, such as “toxic” with “radioactive”) and rearranging elements of sentence structure. The practice of testing spins, however, remains essential even when using good tactics; a Python script is provided to facilitate such testing.

3. A technique for the ethical disguise of phrases

STATUS: Ongoing (Gaur, Reagle, et. al, n.d.)

Though ethical researchers can use proprietary “word spinning” services to disguise their online sources, we describe a non-proprietary technique. Our approach balances non-locatability (i.e., a phrase’s source is not easily discoverable via search engine) with fidelity and fluency (i.e., preserving the phrase’s semantic completeness, syntactical validity, and naturalness). Following Luo et al. (2019), we (1) build a natural language processing model that can paraphrase texts relative to user-configurable attributes such as formality and tense and (2) share a pretrained classifier that helps with training the model. We hope our technique will be of use to the research community and provided as a tool/service by institutional or disciplinary organizations.

References

Barbaro, M., & Zeller, T., Jr. (2006, August 9). A face is exposed for AOL searcher no. 4417749. The New York Times. https://www.nytimes.com/2006/08/09/technology/09aol.html
Bruckman, A. (2002). Studying the amateur artist: a perspective on disguising data collected in human subjects research on the Internet. Ethics and Information Technology, 4(3). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.432.1591&rep=rep1&type=pdf
Luo, F., Li, P., Zhou, J., Yang, P., Chang, B., Sui, Z., & Sun, X. (2019). A dual reinforcement learning framework for unsupervised text style transfer. arXiv. https://arxiv.org/abs/1905.10060
Morant, N., Chilman, N., Lloyd-Evans, B., Wackett, J., & Johnson, S. (2021). Acceptability of using social media content in mental health research: A reflection. Comment on “Twitter Users’ views on mental health crisis resolution team care compared with Stakeholder interviews and focus groups: Qualitative Analysis.” JMIR Mental Health, 8(8), e32475. http://dx.doi.org/10.2196/32475
Reagle, J. (2022). Disguising Reddit sources and the efficacy of ethical research. Ethics and Information Technology, 24(3). https://doi.org/10.1007/s10676-022-09663-w
Reagle, J. (2021). Disguising Reddit sources and the efficacy of ethical research. In Selected Papers of #AoIR2021 (pp. 6–9). https://doi.org/10.5210/spir.v2021i0.12096
Reagle, J., & Gaur, M. (2022). Spinning words as disguise: Shady services for ethical research? First Monday. https://doi.org/10.5210/fm.v27i1.12350
Singer, N. (2015, February 14). Love in the time of Twitter. The New York Times. https://web.archive.org/web/20190412053116/https://bits.blogs.nytimes.com/2015/02/13/love-in-the-times-of-twitter/
Zimmer, M. (2010). “But the data is already public”: On the ethics of research in Facebook. Ethics and Information Technology, 12(4). https://doi.org/10.1007/s10676-010-9227-5