Introduction¶

A concerned researcher of Reddit might want to avoid quoting or citing posts likely to be deleted by their authors (even when pseudonymous), especially on sensitive topics. How many authors delete their posts on Reddit and by when are they likely to have done so?

In [27]:
import collections
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import matplotlib.ticker as mticker
In [28]:
def percent_true(series):  # in pd True=1 False=0, so mean() is percent True
    return "{:.1%}".format(round(series.mean(), 3))

r/Advice in 2018 (with popular)¶

Let's read in no more than the first 1000 posts starting in 2018-April on r/Advice. The year 2018 is ancient history on Reddit (with less activity) and back then 1000 submissions could span many weeks.

In [29]:
# reddit-query.py -a 2018-03-01 -b 2018-05-31 -r Advice -l 1000
df18advi = pd.read_csv("reddit_20180301-20180531_Advice_l1000_n1000.csv")
df18advi.shape
Out[29]:
(1000, 17)

All data is from the Reddit and Pushshift APIs via using reddit-query.py. Pushshift usually ingests submissions within a day and permits advanced queries, including over time periods. The Reddit API can't do this, but it can then be queried per submission id for the latest state of each submission.

In [30]:
print(percent_true(df18advi["del_text_p"]))  # "_p" suffix means pushshift dat
print(df18advi[df18advi["del_text_p"] == True]["elapsed_hours"].max())
14.8%
19

Within the first 19 hours (Pushshift's longest delay before ingesting a post in this data), 15% of redditors had already deleted their posts.

In [31]:
print(percent_true(df18advi["del_text_r"]))  # "_r" means reddit dat
51.6%

Presently, 52% of those posts are deleted on Reddit.

In [32]:
print(percent_true(df18advi["rem_text_r"]))
9.7%

10% of the messages have been removed by moderators.

Reddit shows an author as deleted if (1) they deleted their post or (2) they deleted their account. (To know if a user actually deleted their account we'd have to query each user during data collection, which is not presently done.)

In [33]:
print(percent_true(df18advi["del_author_r"]))
63.3%

63% of authors from this period in 2018 are shown as deleted in their submissions, this includes the 51% above as well as some genuinely deleted user accounts, who messages might survive if they didn't delete those first.)

Popular (commented on) posts¶

Are these numbers also true of popular posts? Let's look at no more than 1000 posts with more than 50 comments in all of 2018 (as there are few matches). (Note, Pushshift records score and num_comments at ingest, but updates num_comments as it ingests arriving comments, so num_comments is a more recent representation of reddit and a better proxy for popularity.)

In [34]:
# reddit-query.py -a 2018-01-01 -b 2018-12-31 -r Advice -c ">50" -l 1000
df18advi_c50 = pd.read_csv("reddit_20180101-20181231_Advice_c50+_l1000_n599.csv")
df18advi_c50.shape
Out[34]:
(599, 17)
In [35]:
print(percent_true(df18advi_c50["del_text_r"]))
38.7%
In [36]:
print(percent_true(df18advi_c50["del_author_r"]))
46.6%

There's only 599 posts that were sufficiently commented on back then and 39% of the posts are deleted. 47% of the authors' identities show deleted and given that Reddit deletes authors from a message when they delete their post, this excess shows there are some actual account deletions as well. In any case, popular posts are often deleted, just not as much.

r/AmItheAsshole, r/relationship_advice in 2018¶

Let's do the same for r/Advice and r/relationship_advice.

And since removed (moderate) message might be less likely to be deleted by the user because they already think it is gone, let's also calculate the percentage of deleted non-removed messages (i.e., nr_del). First, will setup some pretty printing.

Note: Given the graphs below from tracking actions on Reddit, it appears that at 90 days after posting, Reddit removes extant messages who authors' accounts have been deleted. This policy is undocumented, not known to the community, and it isn't clear when the policy was implemented and to what extent it is retroactive.

In [37]:
ROW_HEADER = (
    f"| {'year subreddit':24} | {'shape': ^10} | {'del': ^6}"
    f"| {'username': ^8} | {'removed': ^8} | {'nr_del': ^9} |     "
)
ROW_SPACING = "| {:24} | {: >10s} | {: >5s} | {: >8s} | {: >8s} | {: >9s} |"
In [38]:
def print_df_stats(label, df):
    print(
        ROW_SPACING.format(
            label,
            str(df.shape),
            percent_true(df["del_text_r"]),
            percent_true(df["del_author_r"]),
            percent_true(df["rem_text_r"]),
            percent_true(df[~df["rem_text_r"]]["del_text_r"]),
        )
    )
    return (
        label,
        df.shape[0],
        float(percent_true(df["del_text_r"]).strip("%")),
        float(percent_true(df["del_author_r"]).strip("%")),
        float(percent_true(df["rem_text_r"]).strip("%")),
        float(percent_true(df[~df["rem_text_r"]]["del_text_r"]).strip("%")),
    )

Now, let's look at the specifics.

In [39]:
print(ROW_HEADER)
# reddit-query.py -a 2018-03-01 -b 2018-05-31 -r Advice -l 1000
df18advi = pd.read_csv("reddit_20180301-20180531_Advice_l1000_n1000.csv")
print_df_stats("2018 Advice", df18advi)

# reddit-query.py -a 2018-03-01 -b 2018-05-31 -r AmItheAsshole -l 1000
df18aita = pd.read_csv("reddit_20180301-20180531_AmItheAsshole_l1000_n1000.csv")
print_df_stats("2018 AmItheAsshole", df18aita)

# reddit-query.py -a 2018-03-01 -b 2018-05-31 -r relationship_advice -l 1000
df18rela = pd.read_csv("reddit_20180301-20180531_relationship_advice_l1000_n1000.csv")
print_df_stats("2018 relationship_advice", df18rela)
| year subreddit           |   shape    |  del  | username | removed  |  nr_del   |     
| 2018 Advice              | (1000, 17) | 51.6% |    63.3% |     9.7% |     55.8% |
| 2018 AmItheAsshole       | (1000, 17) | 45.8% |    62.1% |    13.9% |     51.9% |
| 2018 relationship_advice | (1000, 17) | 55.9% |    64.5% |     9.5% |     60.9% |
Out[39]:
('2018 relationship_advice', 1000, 55.9, 64.5, 9.5, 60.9)

All the advice subreddits have significant deletion. r/AmItheAsshole was more aggressively moderated. When looking at non-moderated messages, the percentage of deleted posts is slightly higher. This suggests some users might not bother to delete if already removed.

Advice subreddits in 2020 and 2022¶

It's 2022-06-21, let's see how the advice subreddits are doing more recently. (The 2022 data was refetched on 2022-11-19 for subsequent comparisons with

In [40]:
print(ROW_HEADER)
# reddit-query.py -a 2020-03-01 -b 2020-05-31 -r Advice -l 1000
df20advi = pd.read_csv("reddit_20200301-20200531_Advice_l1000_n1000.csv")
print_df_stats("2020 Advice", df20advi)
# reddit-query.py -a 2020-03-01 -b 2020-05-31 -r AmItheAsshole -l 1000
df20aita = pd.read_csv("reddit_20200301-20200531_AmItheAsshole_l1000_n1000.csv")
print_df_stats("2020 AmItheAsshole", df20aita)
# reddit-query.py -a 2020-03-01 -b 2020-05-31 -r relationship_advice -l 1000
df20rela = pd.read_csv("reddit_20200301-20200531_relationship_advice_l1000_n1000.csv")
print_df_stats("2020 relationship_advice", df20rela)
print()
print(ROW_HEADER)
# reddit-query.py -a 2022-03-01 -b 2022-05-31 -r Advice -l 1000
df22advi = pd.read_csv("reddit_20220301-20220531_Advice_l1000_n1000.csv")
print_df_stats("2022 Advice", df22advi)
# reddit-query.py -a 2022-03-01 -b 2022-05-31 -r AmItheAsshole -l 1000
df22aita = pd.read_csv("reddit_20220301-20220531_AmItheAsshole_l1000_n1000.csv")
print_df_stats("2022 AmItheAsshole", df22aita)
# reddit-query.py -a 2022-03-01 -b 2022-05-31 -r relationship_advice -l 1000
df22rela = pd.read_csv("reddit_20220301-20220531_relationship_advice_l1000_n1000.csv")
print_df_stats("2022 relationship_advice", df22rela)
| year subreddit           |   shape    |  del  | username | removed  |  nr_del   |     
| 2020 Advice              | (1000, 17) | 53.0% |    57.3% |    12.3% |     52.6% |
| 2020 AmItheAsshole       | (1000, 17) | 48.9% |    53.6% |    47.1% |     44.2% |
| 2020 relationship_advice | (1000, 17) | 58.9% |    63.5% |     9.8% |     59.0% |

| year subreddit           |   shape    |  del  | username | removed  |  nr_del   |     
| 2022 Advice              | (1000, 17) | 47.4% |    49.5% |    42.8% |     17.7% |
| 2022 AmItheAsshole       | (1000, 17) | 43.1% |    46.6% |    78.4% |     13.9% |
| 2022 relationship_advice | (1000, 17) | 53.7% |    55.2% |    48.0% |     23.7% |
Out[40]:
('2022 relationship_advice', 1000, 53.7, 55.2, 48.0, 23.7)

The deletion percentages are significant but lower in 2022. Perhaps this is because people simply delete their user accounts, which is still high. Or users who make it past moderation are more likely to abide by norms and leave their posts up (i.e., r/AmItheAsshole).

As an aside, how many 2020/2022 removed/moderated posts where then deleted by their authors?

In [41]:
print(
    "2020 Advice              "
    + percent_true(df20advi["rem_text_r"] & df20advi["del_text_r"])
)
print(
    "2020 AmItheAsshole       "
    + percent_true(df20aita["rem_text_r"] & df20aita["del_text_r"])
)
print(
    "2020 relationship_advice "
    + percent_true(df20rela["rem_text_r"] & df20rela["del_text_r"])
)
print()
print(
    "2022 Advice              "
    + percent_true(df22advi["rem_text_r"] & df22advi["del_text_r"])
)
print(
    "2022 AmItheAsshole       "
    + percent_true(df22aita["rem_text_r"] & df22aita["del_text_r"])
)
print(
    "2022 relationship_advice "
    + percent_true(df22rela["rem_text_r"] & df22rela["del_text_r"])
)
2020 Advice              6.9%
2020 AmItheAsshole       25.5%
2020 relationship_advice 5.7%

2022 Advice              37.3%
2022 AmItheAsshole       40.1%
2022 relationship_advice 41.4%

Throwaway accounts¶

In [42]:
# reddit_20210803-20210804_Advice_n__l500_sampled_.csv --dry-run
# reddit-message.py -i reddit_20210803-20210804_AmItheAsshole_n__l500_sampled_.csv --dry-run
# reddit-message.py -i reddit_20210803-20210804_relationship_advice_n__l500_sampled_.csv  --dry-run

The commands above yields information for this table:

subreddit unique users deleted throwaways
Advice 492 93 (19%) 13 (03%)
AmItheAsshole 480 51 (11%) 38 (08%)
relationship_advice 484 149 (31%) 57 (12%)

When are posts deleted by?¶

If a concerned researcher wanted to avoid quoting or citing posts likely to be deleted by their authors (even when pseudonymous), how long ought they wait before including them in their data? We only have two snapshots: when Pushshift ingested the posts (typically within a day of posting) and what's on Reddit now.

Watching Reddit¶

reddit-watch.py was used to fetch 24 hours of the most recent messages from the three subreddits and then watch their status over time. This dataset was collected atleast twice a day, from 2022-Jul-11 to 2022-Sep-02.

In [43]:
# reddit-watch.py --init "Advice+AmItheAsshole+relationship_advice"
In [44]:
def watched_print(fn: str) -> None:
    df = pd.read_csv(
        fn,
        parse_dates=[
            "created_utc",
            "found_utc",
            "checked_utc",
            "del_author_r_utc",
            "del_text_r_utc",
            "rem_text_r_utc",
        ],
    )
    subreddit = fn.split("-")[1]
    year = fn.split("-")[2][0:4]
    print(ROW_HEADER)
    print_df_stats(f"{year} {subreddit}", df)
    # Rarely, oddly, created_utc > found_utc on the first few most recent submissions
    # I'm not sure why my computer time and Reddit are off by < 1m, but clipping at 0.1

    def days_since_created(df: pd.DataFrame, key: str) -> int:
        return (df[key] - df["created_utc"]).astype("timedelta64[D]").clip(0.1)

    df["del_author_days"] = days_since_created(df, "del_author_r_utc")
    df["del_text_days"] = days_since_created(df, "del_text_r_utc")
    df["rem_text_days"] = days_since_created(df, "rem_text_r_utc")
    # elapsed_days is time between check and found of last row
    # elapsed_days = (df["checked_utc"] - df["found_utc"]).dt.days.iloc[-1]
    elapsed_days = 100  # cutoff at 100
    print(f"{elapsed_days=}\n")

    def count_elapsed(elapsed_range: list[int], column: pd.Series) -> list[int]:
        count_d = {}
        for counter in elapsed_range:
            count_d[counter] = column[
                column.between(counter - 1, counter, inclusive="right")
            ].count()
        # print(f"{count_d=}")
        return count_d.values()

    elapsed_range = range(0, elapsed_days + 1)  # Max span of days to plot
    elapsed_df = pd.DataFrame(index=elapsed_range)

    Label = collections.namedtuple("Label", ["column", "color", "desc"])
    rem_text_l = Label("rem_text_days", "#bc5090", "removed text")  # red
    del_auth_l = Label("del_author_days", "#ffa600", "deleted author")  # yellow
    del_text_l = Label("del_text_days", "#58508d", "deleted text")  # blue
    labels = [rem_text_l, del_auth_l, del_text_l]

    # New dataframe of elapsed_counts for plotting
    for label in labels:
        print(f"{df[label.column].count()=:5} {label.column} ")
        elapsed_df[label.column] = count_elapsed(elapsed_range, df[label.column])
        elapsed_df[f"{label.column}_cp"] = round(
            100 * (elapsed_df[label.column].cumsum() / len(df))
        )
    print(f"{elapsed_df}")

    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html
    ax1 = elapsed_df[[label.column for label in labels]].plot.bar(
        figsize=[7, 4],
        width=0.9,
        color={label.column: label.color for label in labels},
        title=f"r/{subreddit} actions on elapsed day",
        xlabel="day",
        ylabel="actions (log-10)",
        logy=True,
        xticks=np.arange(0, elapsed_days, 7),
        rot=0,
        legend=True,
    )
    ax1.yaxis.set_major_formatter(mticker.ScalarFormatter())
    ax1.legend(
        labels=["removed text", "deleted author", "deleted text"], loc="upper center"
    )

    ax2 = ax1.twinx()
    ax2.set_ylabel(r"cumulative % deleted text")
    ax2.plot(
        elapsed_df["del_text_days_cp"],
        linestyle="dotted",
        color=del_text_l.color,
        linewidth=3,
        label=r"cumulative % deleted text",
    )
    ax2.set_ylim(1, 101)
    ax2.tick_params(axis="y")
    ax2.legend(loc="center", bbox_to_anchor=(0.5, 0.7))

    plt.tight_layout()
    plt.savefig(f"{fn}.png", dpi=150)
    # plt.savefig(f"{fn}.svg")
    plt.show()  # uncomment to print graph
In [45]:
watched_print("watch-Advice-20220712_n1160.csv")
| year subreddit           |   shape    |  del  | username | removed  |  nr_del   |     
| 2022 Advice              | (1160, 15) | 44.3% |    47.1% |    33.5% |     29.3% |
elapsed_days=100

df[label.column].count()=  389 rem_text_days 
df[label.column].count()=  546 del_author_days 
df[label.column].count()=  514 del_text_days 
     rem_text_days  rem_text_days_cp  del_author_days  del_author_days_cp  \
0                0               0.0                0                 0.0   
1              153              13.0              298                26.0   
2                3              13.0               27                28.0   
3                3              14.0               15                29.0   
4                0              14.0               13                30.0   
..             ...               ...              ...                 ...   
96               8              33.0                0                47.0   
97               4              33.0                2                47.0   
98               3              33.0                1                47.0   
99               4              34.0                1                47.0   
100              0              34.0                0                47.0   

     del_text_days  del_text_days_cp  
0                0               0.0  
1              283              24.0  
2               28              27.0  
3               15              28.0  
4               10              29.0  
..             ...               ...  
96               0              44.0  
97               2              44.0  
98               1              44.0  
99               1              44.0  
100              0              44.0  

[101 rows x 6 columns]
In [46]:
watched_print("watch-AmItheAsshole-20220712_n1689.csv")
| year subreddit           |   shape    |  del  | username | removed  |  nr_del   |     
| 2022 AmItheAsshole       | (1689, 15) | 38.7% |    41.2% |    70.3% |     19.9% |
elapsed_days=100

df[label.column].count()= 1187 rem_text_days 
df[label.column].count()=  696 del_author_days 
df[label.column].count()=  654 del_text_days 
     rem_text_days  rem_text_days_cp  del_author_days  del_author_days_cp  \
0                0               0.0                0                 0.0   
1             1072              63.0              493                29.0   
2                0              63.0               19                30.0   
3                1              64.0               16                31.0   
4                0              64.0               10                32.0   
..             ...               ...              ...                 ...   
96               1              70.0                1                41.0   
97               1              70.0                1                41.0   
98               2              70.0                1                41.0   
99               0              70.0                0                41.0   
100              0              70.0                0                41.0   

     del_text_days  del_text_days_cp  
0                0               0.0  
1              466              28.0  
2               18              29.0  
3               13              29.0  
4                9              30.0  
..             ...               ...  
96               1              39.0  
97               1              39.0  
98               1              39.0  
99               0              39.0  
100              0              39.0  

[101 rows x 6 columns]
In [47]:
watched_print("watch-relationship_advice-20220712_n2541.csv")
| year subreddit           |   shape    |  del  | username | removed  |  nr_del   |     
| 2022 relationship_advice | (2541, 15) | 55.7% |    57.6% |    52.2% |     35.6% |
elapsed_days=100

df[label.column].count()= 1326 rem_text_days 
df[label.column].count()= 1463 del_author_days 
df[label.column].count()= 1416 del_text_days 
     rem_text_days  rem_text_days_cp  del_author_days  del_author_days_cp  \
0                0               0.0                0                 0.0   
1              762              30.0             1012                40.0   
2               14              31.0               56                42.0   
3                1              31.0               36                43.0   
4                0              31.0               25                44.0   
..             ...               ...              ...                 ...   
96              10              51.0                1                57.0   
97               6              52.0                2                57.0   
98              10              52.0                1                58.0   
99               5              52.0                1                58.0   
100              0              52.0                0                58.0   

     del_text_days  del_text_days_cp  
0                0               0.0  
1              979              39.0  
2               56              41.0  
3               36              42.0  
4               20              43.0  
..             ...               ...  
96               1              56.0  
97               2              56.0  
98               1              56.0  
99               1              56.0  
100              0              56.0  

[101 rows x 6 columns]

Appendix: Deletion rates in other sensitive subreddits¶

The largest study of sensitive subreddits that I am familiar with is Gaur et al.'s (2019) "Knowledge-Aware Assessment of Severity of Suicide Risk for Early Intervention". The deletion rates for their study's 15 subreddits -- minus r/cripplingalcoholism which was made private in 2022. Given that Reddit now removes some messages after 90 days, let's consider the rates of action prior to that trigger.

In [48]:
health_stats = []
print(ROW_HEADER)
df22Anxy = pd.read_csv("reddit_20220814-20221109_Anxiety_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 Anxiety", df22Anxy))
df22BPDD = pd.read_csv("reddit_20220814-20221109_BPD_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 BPD", df22BPDD))
df22Bipt = pd.read_csv("reddit_20220814-20221109_BipolarReddit_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 BipolarReddit", df22Bipt))
df22Bips = pd.read_csv("reddit_20220814-20221109_BipolarSOs_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 BipolarSOs", df22Bips))
df22Stom = pd.read_csv("reddit_20220814-20221109_StopSelfHarm_l1000_n105.csv")
health_stats.append(print_df_stats("2022 StopSelfHarm", df22Stom))
df22Suih = pd.read_csv("reddit_20220814-20221109_SuicideWatch_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 SuicideWatch", df22Suih))
df22addn = pd.read_csv("reddit_20220814-20221109_addiction_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 addiction", df22addn))
df22asps = pd.read_csv("reddit_20220814-20221109_aspergers_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 aspergers", df22asps))
df22autm = pd.read_csv("reddit_20220814-20221109_autism_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 autism", df22autm))
df22bipr = pd.read_csv("reddit_20220814-20221109_bipolar_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 bipolar", df22bipr))
df22depn = pd.read_csv("reddit_20220814-20221109_depression_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 depression", df22depn))
df22opis = pd.read_csv("reddit_20220814-20221109_opiates_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 opiates", df22opis))
df22scha = pd.read_csv("reddit_20220814-20221109_schizophrenia_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 schizophrenia", df22scha))
df22selm = pd.read_csv("reddit_20220814-20221109_selfharm_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 selfharm", df22selm))
| year subreddit           |   shape    |  del  | username | removed  |  nr_del   |     
| 2022 Anxiety             | (1000, 17) | 35.0% |    37.9% |     5.2% |     35.4% |
| 2022 BPD                 | (1000, 17) | 41.5% |    43.6% |    20.4% |     38.8% |
| 2022 BipolarReddit       | (1000, 17) | 25.8% |    30.0% |     2.1% |     25.8% |
| 2022 BipolarSOs          | (1000, 17) | 38.7% |    40.1% |    35.6% |     27.5% |
| 2022 StopSelfHarm        |  (105, 17) | 12.4% |    13.3% |     2.9% |     12.7% |
| 2022 SuicideWatch        | (1000, 17) | 42.9% |    46.7% |    17.9% |     40.2% |
| 2022 addiction           | (1000, 17) | 32.1% |    33.9% |     9.1% |     33.7% |
| 2022 aspergers           | (1000, 17) | 30.7% |    33.8% |     8.5% |     28.4% |
| 2022 autism              | (1000, 17) | 24.5% |    26.7% |     4.0% |     23.1% |
| 2022 bipolar             | (1000, 17) | 32.8% |    34.6% |    25.1% |     28.2% |
| 2022 depression          | (1000, 17) | 38.9% |    42.0% |    19.9% |     37.8% |
| 2022 opiates             | (1000, 17) | 36.2% |    37.1% |    32.1% |     28.9% |
| 2022 schizophrenia       | (1000, 17) | 32.1% |    34.1% |    35.1% |     22.7% |
| 2022 selfharm            | (1000, 17) | 30.0% |    31.7% |     8.8% |     30.6% |
In [49]:
health_df = pd.DataFrame(
    health_stats, columns=["label", "size", "del", "username", "removed", "nr_del"]
)
print(health_df.mean(axis=0, numeric_only=True))
size        936.071429
del          32.400000
username     34.678571
removed      16.192857
nr_del       29.557143
dtype: float64
In [50]:
computer_stats = []
print(ROW_HEADER)
df22Andd = pd.read_csv("reddit_20220820-20221110_Android_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 Android", df22Andd))
df22Datr = pd.read_csv("reddit_20220820-20221110_DataHoarder_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 DataHoarder", df22Datr))
df22appe = pd.read_csv("reddit_20220820-20221110_apple_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 apple", df22appe))
df22aude = pd.read_csv("reddit_20220820-20221110_audiophile_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 audiophile", df22aude))
df22buic = pd.read_csv("reddit_20220820-20221110_buildapc_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 buildapc", df22buic))
df22eles = pd.read_csv("reddit_20220820-20221110_electronics_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 electronics", df22eles))
df22gads = pd.read_csv("reddit_20220820-20221110_gadgets_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 gadgets", df22gads))
df22hare = pd.read_csv("reddit_20220820-20221110_hardware_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 hardware", df22hare))
df22ipad = pd.read_csv("reddit_20220820-20221110_ipad_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 ipad", df22ipad))
df22linx = pd.read_csv("reddit_20220820-20221110_linux_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 linux", df22linx))
df22macc = pd.read_csv("reddit_20220820-20221110_mac_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 mac", df22macc))
df22sysn = pd.read_csv("reddit_20220820-20221110_sysadmin_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 sysadmin", df22sysn))
df22tect = pd.read_csv("reddit_20220820-20221110_techsupport_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 techsupport", df22tect))
df22webb = pd.read_csv("reddit_20220820-20221110_web_design_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 web", df22webb))
df22wins = pd.read_csv("reddit_20220820-20221110_windows_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 windows", df22wins))
| year subreddit           |   shape    |  del  | username | removed  |  nr_del   |     
| 2022 Android             | (1000, 17) | 14.8% |    15.6% |    70.1% |      4.0% |
| 2022 DataHoarder         | (1000, 17) | 14.6% |    15.5% |    13.3% |     12.3% |
| 2022 apple               | (1000, 17) | 25.4% |    27.0% |    81.3% |      5.9% |
| 2022 audiophile          | (1000, 17) | 26.2% |    26.6% |    42.7% |     17.3% |
| 2022 buildapc            | (1000, 17) | 21.7% |    22.3% |     5.7% |     20.8% |
| 2022 electronics         | (1000, 17) | 20.7% |    21.3% |    55.1% |      0.4% |
| 2022 gadgets             | (1000, 17) | 21.1% |    21.9% |    12.0% |     11.7% |
| 2022 hardware            | (1000, 17) | 20.1% |    21.0% |    44.6% |     15.2% |
| 2022 ipad                | (1000, 17) | 25.3% |    26.1% |    72.6% |     18.6% |
| 2022 linux               | (1000, 17) | 21.1% |    23.7% |    61.5% |     10.6% |
| 2022 mac                 | (1000, 17) | 18.6% |    20.4% |     4.4% |     16.4% |
| 2022 sysadmin            | (1000, 17) | 11.8% |    13.1% |     9.6% |     11.8% |
| 2022 techsupport         | (1000, 17) | 18.9% |    20.1% |    10.2% |     17.6% |
| 2022 web                 | (1000, 17) | 19.9% |    20.9% |    43.9% |     11.9% |
| 2022 windows             | (1000, 17) | 19.7% |    21.9% |    44.4% |     10.8% |
In [51]:
computer_df = pd.DataFrame(
    computer_stats, columns=["label", "size", "del", "username", "removed", "nr_del"]
)
print(computer_df.mean(axis=0, numeric_only=True))
size        1000.000000
del           19.993333
username      21.160000
removed       38.093333
nr_del        12.353333
dtype: float64
In [52]:
all_df = pd.concat([health_df, computer_df])