2022-10-27
Some researchers of user-generated content make an effort to disguise the public comments they quote so as to not bring additional scrutiny to the comments’ authors. Do researchers actually do what they claim? How do they do so? And does it work?
We’ve found that researchers often mistakenly leave quotes in the clear, lack guidance on how to best perform ethical disguise, and consequently fail at the task. We believe that the few cases of failed disguise that come to light (e.g., Barbaro & Zeller, 2006; Morant et al., 2021; Singer, 2015; Zimmer, 2010) are but the tip of the iceberg. We therefore provide recommendations for ethical disguise and provide links to ongoing work to understand and improve this research practice.
Researchers often include phrases from online source in their reports; readers looking for the sources of those phrases can be likened to looking for needles in a haystack of similar content. To prevent the needles from being found researchers should enlarge the haystack, paint the needle tan, keep track of their efforts, and then test the results by looking for the needles themselves.
<https://www.google.com/search?q=site:reddit.com r/{subreddit} {source_phrase}>
specifies the Reddit website and the subreddit; additional time parameters might also be included. Reagle provides a Python script and demo spreadsheet to help test disguises.The recommendations above are based on work in different stages of development. Summaries are provided below.
STATUS: Published (Reagle, 2022)
Concerned researchers of online forums might implement what Bruckman (2002) referred to as disguise. Heavy disguise, for example, elides usernames and rewords quoted prose so that sources are difficult to locate via search engines. This can protect users who might be members of vulnerable populations, including minors, from additional harms such as harassment or additional identification. But does disguise work? I analyze 22 Reddit research reports: 3 of light disguise, using verbatim quotes, and 19 of heavier disguise, using reworded phrases. I test if their sources can be located via three different search services (i.e., Reddit, Google, and RedditSearch). I also interview 10 of the reports’ authors about their sourcing practices, influences, and experiences. Disguising sources is effective only if done and tested rigorously; I was able to locate all of the verbatim sources (3/3) and many of the reworded sources (11/19). There is a lack of understanding, among users and researchers, about how online messages can be located, especially after deletion. Researchers should conduct similar site-specific investigations and develop practical guidelines and tools for improving the ethical use of online sources.
STATUS: Published (Reagle & Gaur, 2022)
Ethical researchers who want to quote public user-generated content without further exposing these sources have little guidance as to how to disguise quotes. Reagle (2021) showed that researchers’ attempts to disguise phrases on Reddit are often haphazard and ineffective. Are there tools that can help? Automated word spinners, used to generate reams of ad-laden content, seem suited to the task. We select ten quotations from fictional posts on r/AmItheButtface and “spin” them using Spin Rewriter and WordAi. We review the usability of the services and then (1) search for their spins on Google and (2) ask human subjects (N=19) to judge them for fidelity. Participants also disguise three of those phrases and these are assessed for efficacy and the tactics employed. We recommend that researchers disguise their prose by substituting novel words (i.e., swapping infrequently occurring words, such as “toxic” with “radioactive”) and rearranging elements of sentence structure. The practice of testing spins, however, remains essential even when using good tactics; a Python script is provided to facilitate such testing.
STATUS: Ongoing (Gaur, Reagle, et. al, n.d.)
Though ethical researchers can use proprietary “word spinning” services to disguise their online sources, we describe a non-proprietary technique. Our approach balances non-locatability (i.e., a phrase’s source is not easily discoverable via search engine) with fidelity and fluency (i.e., preserving the phrase’s semantic completeness, syntactical validity, and naturalness). Following Luo et al. (2019), we (1) build a natural language processing model that can paraphrase texts relative to user-configurable attributes such as formality and tense and (2) share a pretrained classifier that helps with training the model. We hope our technique will be of use to the research community and provided as a tool/service by institutional or disciplinary organizations.