Even pseudonyms and throwaways delete their Reddit posts

Joseph Reagle

2023-05-05

Abstract

Concerned researchers of user-generated content might want to avoid using, citing, or quoting sensitive content likely to be deleted by their authors, even when pseudonymous or using one-time “throwaway” accounts. At Reddit, how many authors actually delete their submissions, why, and are they concerned if their deletion end up elsewhere? I analyze the three most popular sensitive-topic subreddits (r/Advice, r/AmItheAsshole, and r/relationship_advice) and show that deleting submissions is common. Roughly half of submissions are deleted by their users, most within the first day and week. Interviews with 30 Redditors reveal that their motives for deletion include ensuring the “internet doesn’t see them,” especially those who might “see it on my Reddit profile,” deciding their issue was resolved, receiving unhelpful or aggressive comments, and concluding their submission was no longer relevant. Most interviewees were not overly concerned about deleted submissions persisting elsewhere (e.g., social media, archives, and datasets) as long as it is not easily connected to their other activity or identity.

STATUS: Published: Reagle, J. (2023). Even pseudonyms and throwaways delete their Reddit posts. First Monday. https://doi.org/10.5210/fm.v28i6.13193

Introduction

Reddit, the “front page of the web,” claims “52M+ daily active users” and “100K+ communities” (Reddit, 2021). Those users, including minors and members of other vulnerable populations, discuss extraordinarily specific and sometimes sensitive topics, including relationships, sexuality, and health.

Given the public prominence, breadth, and depth of Redditors’ discussions, researchers use the website as a data source. For example, to advance the prediction of suicide risk, researchers used a dataset of 8 million posts between 2005-2016 from 270,000 users across 15 mental health-related subreddits (Gaur et al., 2019). That is, they used “Reddit as an unobtrusive data source for gleaning information about suicidal tendencies and other related mental health conditions afflicting depressed users.” This included subreddits such as r/addiction, r/autism, r/bipolar, r/cripplingalcoholism, r/depression, r/opiates, r/selfharm, and r/SuicideWatch.

After data collection and analysis, some researchers include quotes from their Reddit sources in their published reports. A survey of 727 Reddit research reports found that 9.4% “explicitly mentioned identifiable Reddit usernames” and 28.5% “used direct quotes from users as part of their publications,” with 2.5% attempting to disguise their quotes (Proferes et al., 2021).1 Like finding Twitter usernames (Ayers et al., 2018), locating the sources of verbatim quotes on Reddit is not difficult; even among reports claiming to disguise Reddit quotes, Reagle unmasked sources from 58% of the reports (11 out of 19) (Reagle, 2021a, 2021b).

Given the sensitivity of topics and the potential for locating posts’ sources, concerned researchers might want to avoid using, citing, or quoting sources likely to be deleted by their authors (even when pseudonymous or using one-time “throwaway” accounts). Yet, we have little understanding of how often users delete their posts, their motives, concerns, and when such deletions are most likely to occur.

Related Work

Reddit, Sensitive Topics, and Throwaways

In June 2005, Reddit was launched as a pseudonymous-friendly site where users could exchange and vote on links they had seen (i.e., “I read it,” and the upvotes on a submission accrue to authors as “karma.”) Reddit’s development as a forum of forums, where users could trivially create subreddits, with their own moderators, led the website to succeed over its link-sharing peers. Redditors post submissions (or submit posts, synonymously) and others comment — I use the term message for all such content, though the analysis is focused on deleted submissions.

Like Twitter and Wikipedia, Reddit serves an extraordinary corpus of mostly public data. That is, while some subreddits are private (invite only) or quarantined (hidden by administrators from the front page and search engines), the vast majority of content is public: transparently accessible to any web browser or search engine. More so than Wikipedia and much of Twitter, Reddit hosts discussions of a personal character. Researchers have used Reddit to address, for example, gender (Darwin, 2017), sexuality (Robards, 2017), “involuntary celibacy” (Adamczyk, 2016), mental health (Choudhury and De, 2014), eating disorders (Sowles et al., 2018), interpersonal abuse (Schrading et al., 2015), online harassment (Massanari, 2017), and drug use and cessation (Sowles et al., 2017). Reddit is a compelling and accessible venue for research, especially on sensitive topics.

The most popular subreddits dealing with sensitive topics are advice forums, three of which manage to rank in the “Subreddit Stats” leader-board. Out of all subreddits, r/Advice (0.7mil subscribers) is 15th in posts-per-day. r/AmItheAsshole (4.1mil subscribers) has fewer posts but is 2nd in comments-per-day. And r/relationship_advice (6.9mil subscribers) bests r/Advice as 9th in posts-per-day (Subreddit leader-board, 2022). Reddit’s “2022 recap” noted that r/AmItheAsshole had “became the #1 most-viewed community on the platform,” a factoid that was heralded in the media, including at Engadget (Dent, 2022, 2022).

Researchers’ presumptions, however, can be at odds with users’ expectations, as a few Twitter studies show. In a survey of 368 Twitter users, “few users were previously aware that their public tweets could be used by researchers, and the majority felt that researchers should not be able to use tweets without consent. However, we find that these attitudes are highly contextual, depending on factors such as how the research is conducted or disseminated, who is conducting it, and what the study is about” (Fiesler and Proferes, 2018). A study of members of fandom communities (via solicitations on Twitter and Tumblr) found similar concerns among some users (Dym and Fiesler, 2020).

There is no similar study of Redditors’ privacy expectations, but Redditors appear more privacy-savvy. Pseudonyms are ubiquitous, and “main” accounts are often complemented by “alt” and “throwaway” accounts for alternative and short-term use, respectively. A study of a large collection of mental health-related posts identified 4.5% of the names as throwaways (Choudhury and De, 2014).2 (The Reddit convention is to include “throw” in such usernames.) The number of actual throwaway accounts is likely greater. A smaller study found that only 40.8% of posts in which authors identified their accounts as a “throwaway” in the prose had the actual term in the username (Leavitt, 2015, p. 321)^.(Leavitt, 2015) The use of throwaways appears consequential; they tend to increase disclosure and support seeking on subreddits related to sexual abuse (Andalibi et al., 2016) and to receive longer and higher-rated responses on parenting subreddits (Ammari et al., 2019).

Even if the average Redditor is more privacy savvy than the average Twitter user, we don’t know if Redditors understand that their messages (submissions and comments) can survive deletion.

Deletion

Reddit, like the Internet, suffers from both bit rot and bit flit. A Reddit researcher hoping for a complete dataset might not appreciate that it is in fact incomplete (Gaffney and Matias, 2018). And a Reddit user who deleted their message might not appreciate that it has taken flight across the web (Ayers et al., 2018; Fiesler, 2019) — as might moderated messages or those in banned or quarantined subreddits (Stuck_In_the_Matrix, 2015, 2019).

Consider a submission on the famous advice subreddit r/AmItheAsshole. The rules state that posts are “for more than just submitters. People often find these discussions very engaging, and deleting the thread ends the discussion. We ask you only post here if you’re willing to allow people at least 48 hours of discussion” (r/AmItheAsshole FAQ, 2021). Enough people broke this rule that the moderators use an “automod” bot that immediately copies the post as a comment: “This way, if people decide to come back to a deleted discussion that they were part of, they will still be able to see the full context” (r/AmItheAsshole FAQ, 2021). Deleting the content of a post also nulls the author field. Consequently, deleting the content of the post disassociates the author’s username from the post yet the content remains available in the comment. r/relationship_advice’s automod does something similar: if a post becomes too popular it copies and then removes the text of the submission automatically (De4thbyTw1zzler, 2020). This helps protect authors from overexposure and limits submissions’ virality — and unfortunately makes it difficult for an author to update a submission. Confusingly, this is referred to as a “karma limit” (i.e., a submission got too many upvotes) whereas on other subreddits this term designates the minimum karma an author must have to post.

Outside Reddit, external services collect messages; Pushshift, most notably, is “updated in real-time, and includes historical data back to Reddit’s inception.” The service is for “social media researchers to reduce time spent in the data collection, cleaning, and storage phases of their projects,” though anyone can access it (Baumgartner et al., 2020). Pushshift permits users to request removal from its service, but the backlog of requests was, at one point, over a year (Top-Building2429, 2021), so the data might remain contaminated for some time. And as Proferes et al. noted, “users may be entirely unaware that their data are still circulating in third-party datasets” (Proferes et al., 2021).3

Additionally, other websites use Pushshift to restore missing messages. Such messages included those removed by subreddit moderators or admins and those deleted by their authors. Reveddit, for example, “reveals content removed from Reddit by moderators. It does not show user-deleted content” (Hawkins, 2022). That is, it helps users see if their content is being “shadow banned” (removed unbeknownst to its author). Undelete services include the defunct Ceddit and Removeddit and the live Unddit and ReSavr websites. The last declares:

Civilization is well into the information age and everything that is posted on the internet must remain. After typing out long reddit comments, some users wish to delete them for whatever reason. They do this in spite of the fact that their comment might have been useful to someone else. Well, now everything (over ~650 chars) will stay. (Deleted Reddit comments, 2021)

To see any deleted content in a thread at Unddit, for example, simply replace “re” in the domain name with “un,” as in this post: https://UNddit.com/r/teachingresources/comments/r78sae/.

Finally, a user’s post might end up in a replication of Pushshift. For a brief period an alternative service, Archivesort, used old Pushshift data dumps — with data subsequently removed by Pushshift still present (Top-Building2429, 2021). Pushshift data has also been packaged in common “big data” frameworks, permitting even more powerful queries and analysis. Both BigQuery (Balamuta, 2018) and ConvoKit’s (2018) Reddit datasets have been used by researchers.

Concerned researchers have noted this problem, though the few suggested solutions are just that: tentative and without much agreement or following. The British Psychological Society’s “Ethics Guidelines for Internet-Mediated Research” suggests that as long as “public” data is “handled in accordance with other ethical principles, it is likely that withdrawal procedures [of data from a dataset subsequently deleted] will not be needed.” Such assessments ought to be informed by “the principle of proportionality”: “considerations of the level of risk/harm must be weighed up against scientific value, the quality and authenticity of reports of research findings, and possible practical issues too” (Proferes et al., 2021).4

Fiesler suggested that researchers avoid replicating and redistributing data, but instead share only the message IDs or links, which other researchers can “re-hydrate” if still available (Fiesler, 2019). However, “this introduces a separate problem of having incomplete archives, and thus bringing the reproducibility of that work into question. We do not have a solution to this challenge, but instead note that researchers using Reddit data should carefully consider how and why they are sharing their data” (Proferes et al., 2021). The platforms themselves might update their Terms of Service to prohibit the retention of deleted messages, as Twitter did (Fiesler and Proferes, 2018).5 However, many terms of service are dubious, and platforms have not made this suggestion practicable by allowing anyone to easily query the deletion status of messages at scale. And even if they did so, this might better enable those who wish to archive deleted messages.

Ultimately, researchers need to make site-specific inquiries and adapt their data usage to the affordances, norms, and expectations of their site and its users (Pentzold, 2017).

Method

To assess and understand Redditors’ deletion of their posts, I used a mixed-method approach of collecting and analyzing Reddit/Pushshift data and interviewing Redditors about their motives. I focused on three of the most popular subreddits dealing with sensitive topics, namely requests for personal advice at r/Advice, r/AmItheAsshole, and r/relationship_advice. (I also quantify the deletion rates of two other groups of subreddits, sensitive and non-sensitive, for comparison.)

The JupyterLab notebook showing the programmatic collection and analysis of data is publicly available.

Gathering Data

Gathering data via the Reddit API is constrained in its power and performance. For example, Reddit does not support queries across date ranges. Pushshift’s API supports more powerful queries, though it too has constraints in terms of timeliness and completeness. Both APIs are restricted by rate limits, and elements of their documentation are under-specified and out-of-date.

For static data pulls over larger spans of time, I developed the Python script reddit-query.py (Reagle, 2021c) that queries Pushshift for a number of submissions, up to limit, in a subreddit. If the set of applicable submissions exceeds limit, the user of the script can specify contiguous messages at the start of the set or a sampling throughout. The script uses the resulting message IDs from Pushshift to query Reddit itself using the PRAW library.

Understanding the status of a submission is frustratingly difficult because Reddit uses a number of attributes that are not documented and it changes its practice without notice. I labeled the messages according to the following heuristics:

During 2020–2022 I collected datasets about deletions across three advice subreddits: r/Advice, r/AmItheAsshole, and r/relationship_advice. These datasets can reveal who has deleted their submission between Pushshift’s ingest (often within 24 hours) and when the script was executed.

To characterize when deletions occur, I developed reddit-watch.py to gather recent message IDs from Pushshift and then query their deletion and moderation status on Reddit over time. To track changes in activity, I added a new heuristic to those above:

On 2022 July 12, I initialized a watch of submissions from the last 24 hours from r/Advice (N=1160), r/AmItheAsshole (N=1689), and r/relationship_advice (N=2541). It ran at least twice a day for 100 days. By day 40, there are only a few user deletions or moderator removals per day (see Figures 1–3). The first no-activity in each subreddit occurred on the 29th, 35th, 45th days respectively — though the increasingly thin tails of activity persist. Ninety day after users’ postings, Reddit itself, rather than subreddit moderators, begins removing some submissions — apparently those of now deleted user accounts. This heretofore unknown behavior is not documented nor explained by Reddit. Even so, the rate of deleted submissions remains low and continues to dissipate.

Data collection and detailed analysis are documented in an export of a JupyterLab notebook (Reagle, 2022).

Interviews

Data from reddit-query.py and reddit-watch.py provide Reddit message IDs, deletion status, and usernames. reddit-message.py was then used to contact users via Reddit’s messaging system.

I messaged recent authors in small batches. I did not want to message more users than necessary — purposefully or mistakenly. Experimentation revealed that Reddit did not permit more than one message every 40 seconds. Because throwaway accounts are often untended or deleted once used, I also focused a round of solicitations to recent posters with “throwaway” in their usernames. For example, one set of 1,405 submissions had 169 “throwaways” that had deleted their submissions, resulting in 3 consents. Finally, on August 05 (roughly three weeks out), I messaged 60 late-deleting Redditors, 5 of whom responded to my questions.

About 7% of Redditors responded to my queries; less than 2% of throwaways accounts did so. I successfully interviewed 30 Redditors (6 of which were throwaways) via Reddit’s messaging system. In the interview results below, I arbitrarily assign my own pseudonyms to interviewees, which are combinations of adjectives and occupations, with the “ta_” prefix signifying a throwaway (e.g., “ta_generous_grocer”).

I asked:

  1. How long after making the post did you delete it? Is this typical?
  2. What motivated you to delete your post (even though pseudonymous)?
  3. Sometimes Reddit posts are archived elsewhere, including at https://www.removeddit.com/ or in a post’s comments by a bot like AutoModerator. Did you know this? Does it concern you?

I ceased soliciting interviews when the answers to my questions become redundant with those already offered (Charmaz, 2006).6 Questions about time and typicality yielded straightforward answers. Motivation was coded into four categories during multiple iterations of the coding and interviews. Awareness was a simple binary; I characterized concern as “None” (if stated as such, e.g., “No, I’m not concerned”), mild (i.e., acknowledged a concern but thought to be unlikely or inconsequential), moderate (i.e., simple affirmation), or high (e.g., vigorous affirmation entailing condemnation of online services or a change in behavior).

Ethics

I initially collected public data from Pushshift and Reddit without institutional review. Before contacting and interviewing Redditors, I submitted an application to Northeastern’s Institutional Review Board (IRB). Application #20-11-08 was “approved” as DHHS Review Category #2: “EXEMPT, CATEGORY #2 Revised Common Rule 45CFR46.104(d)2(II).”

I initiated a trial set of solicitations in the summer of 2021. The first three respondents were someone asking if it’s that easy to see deleted submissions, a 17-year-old declining because of the age requirement in the consent form, and someone wondering if I was “stalking” peoples’ message history from their profile. (We will see this is a major concern among interviewees.) Those who declined are not included in my data or reporting, aside from the brief characterization of these first three respondents.

I consequently amended the solicitation and IRB application. The new version clarified my research intentions, that I “have not and will not” read the content of submissions, and included an example of a service that shows deleted messages.

Hi! Your username is associated with a deleted post on an advice subreddit — which I have not and will not read.

Many Redditors don’t realize their deleted messages can show up elsewhere (e.g., https://removeddit.com). And researchers who study Reddit can inadvertently collect messages that Redditors subsequently delete. I’d like to understand why people delete their posts and how long they wait before doing so. I can then warn researchers about collecting messages too quickly. …

The revision was accepted by the IRB in July 2021.

Data collection and interviews were conducted with a light touch because of Redditors’ concerns about their user profile and messaging history. I focused queries on message status, without reviewing posts’ content or authors’ profiles. While interviewing I minimized solicitations and follow-ups. I solicited Redditors in small batches, no candidate was solicited more than once, and if an interviewee choose not to answer a question, I did not re-ask it; if anything, I only asked for clarity on what they already shared.

Results

Analytic Findings

How common is deletion?

On 2022 August 18, I collected the first 1000 submissions in a set of subreddits starting on March 01 of that year. The “tech” and “sensitive” bundles of subreddits were collected for purposes of comparison—to see if the advice subreddits are unusual, which they are. For the advice subreddits, I also collected data for 2018 and 2020—to see if there might be changes in moderation and deletion activity, which there are.

Table 1 shows that removal and deletion are common, especially on the advice subreddits. The popular advise subreddits have significantly more deletions (48%) than other sensitive subreddits (32.4%), which have significantly more deletions than tech-related subreddits (20.2%). Moderation has increased over the years, with r/AmItheAsshole going from 14% to 47% to 78%!

Table 1: Percent of submissions deleted and [removed].
subreddit 2018-Mar+ 2020-Mar+ 2022-Mar+
tech subreddits 20.0% [38.1%]
sensitive subreddits 32.4% [16.2%]
Advice 51.6% [09.7%] 53.0% [12.3%] 47.4% [42.8%]
AmItheAsshole 45.8% [13.9%] 48.9% [47.1%] 43.1% [78.4%]
relationship_advice 55.9% [09.5%] 58.9% [09.8%] 53.7% [48.0%]

The popular technology-related subreddits consisted of: Android, apple, audiophile, buildapc, DataHoarder, electronics, gadgets, hardware, ipad, linux, mac, sysadmin, techsupport, web, windows. The sensitive subreddits were those studied by Gaur et al. (Gaur et al., 2019): Anxiety, BPD, BipolarReddit, BipolarSOs, StopSelfHarm, SuicideWatch, addiction, aspergers, autism, bipolar, depression, opiates, schizophrenia, selfharm; cripplingalcoholism is not included because it was made private earlier in 2022.

When does deletion happen?

The following bar graphs show the magnitude of actions (moderator removed, author deleted, and text deleted by author) on a log-10 scale (left axis). The complementary dotted line plots the cumulative percent of deletions (right axis). The number of deletions is high, in keeping with Table 1, and quickly dissipates.

Figure 1: r/Advice actions

In all figures, most all actions happen within the first day and week, followed by a long tail of diminishing activity.

Figure 2: r/AmItheAsshole actions

Recall that author deletion does not necessarily mean the account was deleted, only that the author’s username is struck along with deleted text within a submission. Therefore, the degree to which author deletions exceed text deletions shows that genuine account deletions do happen, even if not checked for explicitly. Additionally, 90 days out, Reddit apparently removes submissions of those who deleted their account but not their submissions. A possible indicator of aggressive moderation on r/AmItheAsshole’s is that removals exceed deletions.

Figure 3: r/relationship_advice

Finally, moderation persists on r/relationship_advice longer than the other subreddits, probably because this subreddit automatically removes submissions that become too popular (e.g., “karma limit”), which can happen weeks after posting.

Interviews

Redditors who maintain multiple accounts often fail to check all of them frequently. On 2021 August 04 I collected a test sample of submissions from just two days prior when I was more likely to catch users’ attention. As described, this entails querying Pushshift for sets of message IDs and then querying those IDs on Reddit for their current status. Table 2 shows that quick deletion was common (11–31%) and use of obvious throwaways was significant but less common (3–12%). Deletion is not as high as in Table 1 because the authors haven’t had much time to delete their posts. The “throwaways who delete” shows that roughly a quarter of throwaways still deleted their posts. Perhaps they fear their post could still be identifiable or they use the throwaway for other activity — contra the name.

Table 2: Unique users posting two days prior (on 2021-Aug-04).
subreddit users deleted throwaways throwaways who delete
Advice 492 93 (19%) 13 (03%) 15 (26% of throwaways)
AmItheAsshole 480 51 (11%) 38 (08%) 07 (18% of throwaways)
relationship_advice 484 149 (31%) 57 (12%) 15 (26% of throwaways)

As described in “Methods,” I used this and similar queries, including one focused on throwaway users and one on those posting weeks later to solicit interviewees. From the 30 resulting interviews, a handful of themes emerged related to the questions on the motivation, timing, and concerns about posts surviving deletion. Not all interviewees spoke substantively to each of my questions; consequently, the sum of those who explicitly spoke of their awareness need not equal the sum of those who expressed concern.

Table 3: Number of interviewees expressing a sentiment.
# Motivation # Awareness # Concern
10 fear of exposure; profile privacy 06 yes 16 no
09 concern answered/resolved 11 no 04 mild
04 passing emotion or changed mind 03 moderate
04 felt attacked or misunderstood 00 high

Motivations

A top reason for deleting submissions was when an interviewee “already received an answer to my question [and] didn’t need 100 more” (large_lyricist, 2021). There’s no need to waste other Redditors’ time or have one’s notifications overrun by an active post. No interviewee spoke of wanting to preserve a post for the benefit of other Redditors. Instead, posting was viewed as a short-term utilitarian act. Conversely, one interviewee spoke of deleting a post if it received little attention because “after a few hours of not much post interactions, the likelihood of the post receiving more attention is very slim on advice subreddits” (safe_swimmer, 2021).

The other major motivation was fear of exposure, especially about sensitive topics:

I delete after I’ve gotten enough responses or if I find the comments not helpful. I also delete because I use Reddit to vent and speak out of frustration and then worry someone may come across my post and know it’s me. Like they may put two and two together. I just don’t want that information shared to the world. Sometimes I’ll come across a post and know exactly who posted it despite the random username. (icy_inventor, 2021)

Having readers connect a sensitive topic with interviewees’ other posts was a significant concern. Any reader can click on the author of a message and peruse the rest of the author’s messages, prompting interviewees to delete submissions so as “to remove evidence of it on my account” (faint_foreman, 2021). Similarly, “I don’t want people to see it on my Reddit profile” (ta_early_exporter, 2021). And, “I didn’t want it just sitting there, as it felt personal… I mean I just don’t want it listed on my page anymore” (ta_rare_ranger, 2021).

This concern of exposure was even expressed by two of the accounts with “throwaway” appearing in their usernames wherein one noted that deletion was a typical process for them, one they brought from their main and alternative accounts:

I still prefer to remove direct access to posts I make from my reddit profile. (I have multiple alts and occasionally delete on my mains) I don’t usually use throwaways. This is my first one, and it’s cos I’m going through a break up. (ta_rare_ranger, 2021)

As in other studies (Bruckman et al., 2015; Reagle, 2021b), wider exposure can be a source of both concern and satisfaction. u/narrow_nurse explained that their deletion on an advice subreddit was informed by a post of theirs on r/PurplePillDebate, which is “a neutral community to discuss sex and gender issues, specifically those pertaining to /r/TheBluePill and /r/TheRedPill” (r/PurplePillDebate, 2022). (r/TheBluePill is a satire of the r/TheRedPill, which is an infamous and quarantined subreddit for male dating strategies associated with pickup artists.)

I have also had major newspapers and journalists pick up on my reddit posts [on r/PurplePillDebate], that was emotionally terrifying, but also gratifying…. I couldn’t believe it. It was in multiple newspaper advice columns all at the same time. Facebook groups, social media…. unreal. (narrow_nurse, 2021)

Consequently, u/narrow_nurse and others spoke of watching the popularity of their posts with care.

I will delete it if it gets too popular (over 300 upvotes or any awards) because I think it’s more likely to get spread around the internet and the people I have written about in the post will recognize the situation, look up my username, and discover my personal details about my life and how I really feel about the people in it which I do NOT want them to know. (narrow_nurse, 2021)

Interviewees spoke of advice submissions as momentary “venting,” “trauma dumping,” or “coping” that need not persist, especially if they changed their mind: “Sometimes I have impulse posts and within an hour or so I just don’t feel like having it up anymore (plain_producer, 2021). Deleting was even seen as a healthy alternative to rumination:”I was over the situation I had posted about—I tend to post on advice subreddits as a means to cope in the heat of a tricky situation, so after a day I find it more healthy to delete the post and try and forget about it” (safe_swimmer, 2021).

The final major category of motivation was feeling attacked, misunderstood, or creeped on. One interviewee spoke of criticism characterizing her as a “lazy” and “mediocre housewife”; another wrote that, “as a woman, on occasion, my posts will receive negative and or sexual attention that I simply don’t feel like dealing with” (plain_producer, 2021).

Another interviewee noted that those quick to respond (and receive early votes) might “do so to accumulate karma points and as such don’t bother to properly read the thread or give the proper advice asked in such a situation.” This interviewee also worried that their own karma was negatively affected: “I guess the most shallow or vain of my motives, was because I was loosing too much karma points, of which I had very few since I don’t interact much on reddit” (calm_courier, 2021).

Time before deletion

Because 25 of the interviewees recently deleted their posts, it’s not surprising that many reported short periods between posting and deletion. Most reported deleting their posts within the day, and some within hours: “I deleted my post after about 5 hours. This is fairly typical for me for advice posts only” (fair_flutist, 2021).

Though a few of the interviewees said deleting was not typical behavior, half affirmed that it was.

Five of the 25 interviewed in July plus all 5 of those interviewed in August spoke of longer-term deletions. Two interviewees spoke of some event, such as negative responses, prompting a house cleaning: “every few years or so I’ll be reminded by some event (like this [interview]) that I need to go through and clean house. When I get this urge I’ll usually delete all (or most if I get distracted half way through) posts up until a few months prior” (young_yodeler, 2021). One Redditor shared that “I use this account because I get super bored at work and make fake r/relationships and r/legaladvice posts. So I delete them because it would otherwise be very obvious its creative writing” (bold_bowler, 2021). That is, someone looking at their messaging history would find a history of extraordinary, different, and conflicting biographical details. They delete their past fictions when inclined to write a new one.

Awareness and concern

Of the interviewees who spoke about their awareness, most reported they were unaware of specific archives and services. Even so, only a few interviewees spoke of this as being a moderate concern. The majority recognized that “once on the internet, always on the internet” (quick_quilter, 2021). Some interviewees spoke of familiarity with submissions being screenshot and circulated on other platforms: “I didn’t know of a specific website that does this but I’m not surprised and I know there are YouTubers that read peoples’ posts.” For example, one such Youtuber has 95K subscribers and 6M views; Youtube’s #aita hashtag has over 22K r/AmItheAsshole narrations or reactions across a thousand channels (Mark2022amn?; #aita, 2022). On Twitter, two popular accounts each have around half a million followers (AITA_online, 2020; redditships, 2020)

The majority of authors’ concerns were about others inferring the authors’ identities.

I had no idea posts are archived at times. If I were using an account associated directly with my identity, I might be concerned. But I only post my vent posts on “throwaway” accounts so I don’t really care as I feel like there’s no way to tie it back to me. (ta_joyful_judge, 2021)

Similarly, a user of r/AmItheAsshole noted “Yes I am completely aware auto mod did repost my original post in a comment but the difference is you can’t click on auto mod to go directly to my profile” (tall_typist, 2021). As another interviewee noted, the existence of undeletion services “makes me a bit uncomfortable but lesser if my user is concealed” (quick_quilter, 2021). Another interviewee spoke to the balance of this ambivalence:

In my mind, it would take a serious stalker mentality to jump through all these hoops to try to read my deleted posts if you recognized me through context. I don’t think anyone in my personal life wants to know my secrets THAT bad. So I guess I’m not that concerned. I’m a little more concerned about [AmItheAsshole] auto-moderators, since it takes less work by a stalker to find those in comments. I don’t like them. I think people should have the right to be forgotten and auto-mods make it too easy to connect details of a person’s life. (young_yodeler, 2021)

Discussion

Popular submission from r/Advice, r/AmItheAsshole, and r/relationship_advice are copied, narrated, and discussed throughout the Web, including on YouTube and Twitter. Not surprisingly, pseudonymity on Reddit is ubiquitous, and the use of mains, alts, and throwaways is significant (Leavitt, 2015).7

Despite this pseudonymity, I show that deleting submissions is a common practice — as is moderator removal — which coincides with Reddit’s 2021 “Transparency Report” (Transparency report 2021, 2021). Because Reddit aggregated admin and mod removals across all submissions and comments, it’s difficult to compare their figures with the present findings. Additionally, it’s not clear how Reddit counts removal and deletion activity, and their analysis spans the whole of Reddit, which is at tremendous scale: they removed over 400 thousand subreddits in 2021 alone. (Might all of those subreddits messages be counted in their removal figures?) However, our reports do agree that the percentages of moderated content is increasing.

In a study of Twitter’s users perception of platform researchers, the researchers found that “the majority of respondents are somewhat comfortable or are ambivalent about the idea of tweets being used in research” (Fiesler and Proferes, 2018). My 30 interviewees shared a similar lack of concern about their deleted posts surviving elsewhere as long as it could not easily be linked to their identities; that is, the greatest concern was “creeping,” wherein others connect authors’ sensitive/deleted posts with other Reddit activity and, possibly, the authors’ personal identity.

The present findings and methods correspond with the need for site-specific research (Pentzold, 2017). Reddit has its own norms and affordances, and r/Advice, r/AmItheAsshole, and r/relationship_advice vary with respect to their own archival and deletion policies: r/AmItheAsshole’s automod copies submission content into the comments without the author’s username; r/relationship_advice deletes the text and username of posts that become too popular.

Finally, currently around half of the submissions on the advice subreddits are moderated and/or deleted; this is higher than a sample of non-sensitive tech-related subreddits, and even other sensitive-topic subreddits studied in (Gaur et al., 2019). Most deletions happen within the first day and week; a month out, activity is down to zero-to-few actions per day. Researchers who want to minimize their use of deleted messages should keep this time frame in mind. Reddit itself seems to remove messages from deleted accounts after 90 days, and this has not been heretofore reported.

Limitations

The present analysis focuses on submissions only, not the comments that follow. Comments, too, are deleted on Reddit and persist elsewhere. However, comments are not, typically, as personally revealing as a submission; they are not replicated by subreddit bots, nor do they attract the same attention as the original post (both onsite and offsite).

r/Advice, r/AmItheAsshole, and r/relationship_advice are not necessarily representative of all sensitive-topic subreddits. While the insights gained from this analysis should inform researchers with what can happen, concerned researchers should investigate the rules, norms, conventions, and affordances of their own research sites. For those working with Reddit, Python scripts are provided (Reagle, 2021c) so others can replicate the current approach.

Conclusion

How ought researchers make use of public user-generated content? Typically the data is used as-is under the presumption users knowingly posted in a venue that is transparently accessible to any web browser or search engine. Additionally, users can and do take steps to manage their exposure. This presumption holds true for Reddit, except for the issue of deleted posts; one of the ways Redditors manage their exposure is by deleting their messages. Such messages, however, can persist in widely used archives (i.e., Pushshift) and related datasets (e.g., BigQuery and ConvoKit). Even researchers who collect data from Reddit itself, especially if it is contemporaneous, risk ingesting messages that users subsequently delete.

On three popular subreddits (r/Advice, r/AmItheAsshole, and r/relationship_advice), deleting submissions is common. Roughly half of submissions are deleted by their users, most within the first day and week. Interviews with 30 Redditors reveal that their motives for deletion include ensuring the “internet doesn’t see them,” especially those who might “see it on my Reddit profile,” deciding their issue was resolved, receiving unhelpful or aggressive comments, and concluding their submission was no longer relevant. Most interviewees are not overly concerned about deleted submissions persisting elsewhere (e.g., archives and datasets) as long as it is not connected to their other activity or identity.

When making use of public data, concerned researchers should consider the type of research, the sensitivity of the topic, the vulnerability of the sources, the attributes of the venue, and how the data is used (e.g., summarized in the aggregate or specifically quoted). This, then, require site-specific investigation of the affordances of the venue and the norms and expectations of its users.

No single policy on the use of public data applies for all researchers — Reddit included. Again, topic sensitivity, user vulnerability, venue affordances, and how the data is used and reported need to be considered. However, researchers should proceed with care when quoting vulnerable users on sensitive topics who believe their messages have been deleted and the use of which could bring users additional unwanted attention. Much of this risk can be minimized by only using messages which have had an opportunity to be deleted.

References

Katarzyna Adamczyk, 2016. “Voluntary and involuntary Singlehood and young Adults’ mental health: An investigation of mediating role of romantic loneliness,” Current Psychology, volume 36, number 4, pp. 888–904, and at http://dx.doi.org/10.1007/s12144-016-9478-3, accessed 23 May 2022.
#aita, 2022. Youtube, at https://www.youtube.com/hashtag/aita, accessed 7 July 2022.
AITA_online, 2020. “Am I the Asshole?” Twitter, at https://twitter.com/AITA_online, accessed 19 August 2022.
Tawfiq Ammari, Sarita Schoenebeck and Daniel Romero, 2019. “Self-declared throwaway accounts on Reddit,” Proceedings of the ACM on Human-Computer Interaction, volume 3, number CSCW, pp. 1–30, and at http://dx.doi.org/10.1145/3359237, accessed 16 August 2021.
Nazanin Andalibi, Oliver L. Haimson, Munmun De Choudhury and Andrea Forte, 2016. “Understanding social media disclosures of sexual abuse through the lenses of support seeking and anonymity,” Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, at http://dx.doi.org/10.1145/2858036.2858096, accessed 6 July 2020.
John W. Ayers, Theodore L. Caputi, Camille Nebeker and Mark Dredze, 2018. “Don’t quote me: Reverse identification of research participants in social media studies,” NPJ Digital Medicine, volume 1, number 1, at http://dx.doi.org/10.1038/s41746-018-0036-2, accessed 20 May 2021.
James Balamuta, 2018. “Using Google BigQuery to obtain Reddit comment phrase counts,” The Coatless Professor, at https://thecoatlessprofessor.com/programming/sql/using-google-bigquery-to-obtain-reddit-comment-phrase-counts/, accessed 11 August 2021.
Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire and Jeremy Blackburn, 2020. “The Pushshift Reddit dataset,” In: Proceedings of The International AAAI conference on web and social media, pp. 830–839, and at https://ojs.aaai.org/index.php/ICWSM/article/view/7347, accessed 17 June 2021.
bold_bowler, 2021. “Interview via Reddit messages,” Reddit.
Amy Bruckman, Kurt Luther and Casey Fiesler, 2015. “When should we use real names in published accounts of internet research?” In: E. Hargittai and C. Sandvig (editors). Digital research confidential: The secrets of studying behavior online, MIT Press.
calm_courier, 2021. “Interview via Reddit messages,” Reddit.
Kathy Charmaz, 2006. Constructing grounded theory: A practical guide through qualitative analysis, Sage Publications Ltd.
Mun De Choudhury and Sushovan De, 2014. “Mental health discourse on Reddit: Self-disclosure, social support, and anonymity,” In: Proceedings of The Eighth International AAAI Conference on Weblogs and Social Media, AIII, at https://ojs.aaai.org/index.php/ICWSM/article/view/14526, accessed 9 June 2020.
ConvoKit, 2018. “Reddit corpus (by subreddit),” Cornell, at https://convokit.cornell.edu/documentation/subreddit.html, accessed 31 August 2021.
Helana Darwin, 2017. “Doing gender beyond the binary: A virtual ethnography,” Symbolic Interaction, volume 40, number 3, pp. 317–334, and at http://dx.doi.org/10.1002/symb.316, accessed 6 July 2020.
De4thbyTw1zzler, 2020. “Why are the mods at /r/relationship_advice deleting the text from top posts that reach a ‘karma limit’ and then reposting that same text in the comments of that same post so they get the karma?” r/NoStupidQuestions, at https://www.reddit.com/r/NoStupidQuestions/comments/k5ftx5/why_are_the_mods_at_rrelationship_advice_deleting/geejm9x/?context=3, accessed 11 July 2022.
Deleted Reddit comments, 2021. ReSavr, at https://www.resavr.com/, accessed 22 July 2021.
Steve Dent, 2022. “Reddit’s reveals r/AmItheAsshole was its most popular subreddit in 2022,” Engadget, at https://www.engadget.com/reddit-recap-stats-2022-130015151.html, accessed 12 December 2022.
Brianna Dym and Casey Fiesler, 2020. “Ethical and privacy considerations for research using online fandom data,” Transformative Works and Cultures, volume 33, at http://dx.doi.org/10.3983/twc.2020.1733, accessed 19 May 2021.
faint_foreman, 2021. “Interview via Reddit messages,” Reddit.
fair_flutist, 2021. “Interview via Reddit messages,” Reddit.
Casey Fiesler, 2019. “Ethical considerations for research involving (speculative) public data,” Proceedings of the ACM on Human-Computer Interaction, volume 3, number GROUP, pp. 1–13, and at http://dx.doi.org/10.1145/3370271, accessed 19 May 2021.
Casey Fiesler and Nicholas Proferes, 2018. ‘Participant’ perceptions of Twitter research ethics,” Social Media + Society, volume 4, number 1, at http://dx.doi.org/10.1177/2056305118763366, accessed 2 July 2020.
Devin Gaffney and J. Nathan Matias, 2018. “Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus,” PLOS ONE, volume 13, number 7, at http://dx.doi.org/10.1371/journal.pone.0200162, accessed 20 May 2021.
Manas Gaur, Amanuel Alambo, Joy Prakash Sain, Ugur Kursuncu, Krishnaprasad Thirunarayan, Ramakanth Kavuluru, Amit Sheth, Randy Welton and Jyotishman Pathak, 2019. “Knowledge-aware assessment of severity of suicide risk for early intervention,” In: Proceedings of WWW ’19: The World Wide Web Conference, ACM Press, at http://dx.doi.org/10.1145/3308558.3313698, accessed 15 September 2021.
Rob Hawkins, 2022. “reveddit FAQ,” Reveddit, at https://www.reveddit.com/about/faq/#need, accessed 16 June 2022.
icy_inventor, 2021. “Interview via Reddit messages,” Reddit.
large_lyricist, 2021. “Interview via Reddit messages,” Reddit.
Alex Leavitt, 2015. ‘This is a throwaway account’,” Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing - CSCW ’15, at http://dx.doi.org/10.1145/2675133.2675175, accessed 6 July 2020.
Adrienne Massanari, 2017. “#Gamergate and the fappening: How Reddit’s algorithm, governance, and culture support toxic technocultures,” New Media & Society, volume 19, number 3, at http://dx.doi.org/10.1177/1461444815608807, accessed 9 October 2017.
narrow_nurse, 2021. “Interview via Reddit messages,” Reddit.
Christian Pentzold, 2017. ‘What are these researchers doing in my Wikipedia?’: Ethical premises and practical judgment in internet-based ethnography,” Ethics and Information Technology, volume 19, number 2, pp. 143–155, and at http://dx.doi.org/10.1007/s10676-017-9423-7, accessed 10 October 2017.
plain_producer, 2021. “Interview via Reddit messages,” Reddit.
Nicholas Proferes, Naiyan Jones, Sarah Gilbert, Casey Fiesler and Michael Zimmer, 2021. “Studying Reddit: A systematic overview of disciplines, approaches, methods, and ethics,” Social Media + Society, volume 7, number 2, at https://journals.sagepub.com/doi/full/10.1177/20563051211019004, accessed 6 May 2021.
quick_quilter, 2021. “Interview via Reddit messages,” Reddit.
r/AmItheAsshole FAQ, 2021. Reddit, at https://www.reddit.com/r/AmItheAsshole/wiki/faq, accessed 9 August 2021.
Joseph Reagle, 2021a. “Disguising Reddit sources and the efficacy of ethical research,” Selected Papers of #AoIR2021, pp. 6–9, and at https://journals.uic.edu/ojs/index.php/spir/article/view/12096.
Joseph Reagle, 2021b. “Disguising Reddit sources and the efficacy of ethical research (under review),” at https://reagle.org/joseph/2020/mask/disguise.html.
Joseph Reagle, 2022. “Reddit deletions Jupyter notebook 3,” reagle.org, at https://reagle.org/joseph/2022/reddit-deletions-3.html, accessed 21 June 2022.
Joseph Reagle, 2021c. “Tools for scraping and analyzing Reddit,” at https://github.com/reagle/reddit.
Reddit, 2021. “Reddit by the numbers,” RedditInc, at https://www.redditinc.com/press, accessed 27 August 2021.
redditships, 2020. relationships.txt,” Twitter, at https://twitter.com/redditships, accessed 19 August 2022.
Working Party on Internet-Mediated Research, 2021. Ethics guidelines for internet-mediated research, The British Psychological Society, at https://cms.bps.org.uk/sites/default/files/2022-06/Ethics%20Guidelines%20for%20Internet-mediated%20Research_0.pdf, accessed 28 October 2022.
Brady Robards, 2017. ‘Totally straight’: Contested sexual identities on social media site reddit,” Sexualities, volume 21, numbers 1-2, pp. 49–67, and at http://dx.doi.org/10.1177/1363460716678563, accessed 23 May 2022.
r/PurplePillDebate, 2022. Reddit, at https://www.reddit.com/r/PurplePillDebate/wiki/index, accessed 7 July 2022.
safe_swimmer, 2021. “Interview via Reddit messages,” Reddit.
Nicolas Schrading, Cecilia Ovesdotter Alm, Raymond Ptucha and Christopher Homan, 2015. “An analysis of domestic abuse discourse on Reddit,” In: Proceedings of 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2577–2583, and at https://aclanthology.org/D15-1309.pdf, accessed 23 May 2022.
Shaina J. Sowles, Melissa J. Krauss, Lewam Gebremedhn and Patricia A. Cavazos-Rehg, 2017. ‘I feel like I’ve hit the bottom and have no idea what to do’: Supportive social networking on Reddit for individuals with a desire to quit cannabis use,” Substance Abuse, volume 38, number 4, pp. 477–482, and at http://dx.doi.org/10.1080/08897077.2017.1354956, accessed 23 May 2022.
Shaina J. Sowles, Monique McLeary, Allison Optican, Elizabeth Cahn, Melissa J. Krauss, Ellen E. Fitzsimmons-Craft, Denise E. Wilfley and Patricia A. Cavazos-Rehg, 2018. “A content analysis of an online pro-eating disorder community on Reddit,” Body Image, volume 24, pp. 137–144, and at http://dx.doi.org/10.1016/j.bodyim.2018.01.001, accessed 23 May 2022.
Stuck_In_the_Matrix, 2019. “Pushshift will now be opting in by default to quarantined subreddits,” r/pushshift, at https://www.reddit.com/r/pushshift/comments/bazctc/pushshift_will_now_be_opting_in_by_default_to/, accessed 2 September 2021.
Stuck_In_the_Matrix, 2015. “Reddit data for ~900,000 subreddits (includes both public and private subreddits),” r/datasets, at https://www.reddit.com/r/datasets/comments/3k3mr9/reddit_data_for_900000_subreddits_includes_both/, accessed 2 September 2021.
Subreddit leader-board, 2022. Subreddit Stats, at https://subredditstats.com/, accessed 22 June 2022.
ta_early_exporter, 2021. “Interview via Reddit messages,” Reddit.
ta_joyful_judge, 2021. “Interview via Reddit messages,” Reddit.
tall_typist, 2021. “Interview via Reddit messages,” Reddit.
ta_rare_ranger, 2021. “Interview via Reddit messages,” Reddit.
Top-Building2429, 2021. “There is now another Pushshift-like reddit archival service (Archivesort). If you don’t want your deleted posts easily searchable, you should consider opting out.” r/privacy, at https://www.reddit.com/r/privacy/comments/ny90ku/there_is_now_another_pushshiftlike_reddit/, accessed 24 May 2022.
Transparency report 2021, 2021. Reddit, at https://www.redditinc.com/policies/transparency-report-2021-2/, accessed 11 January 2023.
young_yodeler, 2021. “Interview via Reddit messages,” Reddit.

Notes


  1. Proferes et al. (2021), p. 14↩︎

  2. Choudhury and De (2014), p. 78↩︎

  3. Proferes et al. (2021), p. 10↩︎

  4. Research (2021), p. 14↩︎

  5. Fiesler and Proferes (2018), p. 10↩︎

  6. Charmaz (2006), p. 102↩︎

  7. Leavitt (2015), p. 321↩︎