A concerned researcher of Reddit might want to avoid quoting or citing posts likely to be deleted by their authors (even when pseudonymous), especially on sensitive topics. How many authors delete their posts on Reddit and by when are they likely to have done so?
import collections
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import matplotlib.ticker as mticker
def percent_true(series): # in pd True=1 False=0, so mean() is percent True
return "{:.1%}".format(round(series.mean(), 3))
Let's read in no more than the first 1000 posts starting in 2018-April on r/Advice. The year 2018 is ancient history on Reddit (with less activity) and back then 1000 submissions could span many weeks.
# reddit-query.py -a 2018-03-01 -b 2018-05-31 -r Advice -l 1000
df18advi = pd.read_csv("reddit_20180301-20180531_Advice_l1000_n1000.csv")
df18advi.shape
(1000, 17)
All data is from the Reddit and Pushshift APIs via using reddit-query.py
.
Pushshift usually ingests submissions within a day and permits advanced queries, including over time periods.
The Reddit API can't do this, but it can then be queried per submission id
for the latest state of each submission.
print(percent_true(df18advi["del_text_p"])) # "_p" suffix means pushshift dat
print(df18advi[df18advi["del_text_p"] == True]["elapsed_hours"].max())
14.8% 19
Within the first 19 hours (Pushshift's longest delay before ingesting a post in this data), 15% of redditors had already deleted their posts.
print(percent_true(df18advi["del_text_r"])) # "_r" means reddit dat
51.6%
Presently, 52% of those posts are deleted on Reddit.
print(percent_true(df18advi["rem_text_r"]))
9.7%
10% of the messages have been removed by moderators.
Reddit shows an author as deleted if (1) they deleted their post or (2) they deleted their account. (To know if a user actually deleted their account we'd have to query each user during data collection, which is not presently done.)
print(percent_true(df18advi["del_author_r"]))
63.3%
63% of authors from this period in 2018 are shown as deleted in their submissions, this includes the 51% above as well as some genuinely deleted user accounts, who messages might survive if they didn't delete those first.)
Are these numbers also true of popular posts? Let's look at no more than 1000 posts with more than 50 comments in all of 2018 (as there are few matches). (Note, Pushshift records score
and num_comments
at ingest, but updates num_comments
as it ingests arriving comments, so num_comments
is a more recent representation of reddit and a better proxy for popularity.)
# reddit-query.py -a 2018-01-01 -b 2018-12-31 -r Advice -c ">50" -l 1000
df18advi_c50 = pd.read_csv("reddit_20180101-20181231_Advice_c50+_l1000_n599.csv")
df18advi_c50.shape
(599, 17)
print(percent_true(df18advi_c50["del_text_r"]))
38.7%
print(percent_true(df18advi_c50["del_author_r"]))
46.6%
There's only 599 posts that were sufficiently commented on back then and 39% of the posts are deleted. 47% of the authors' identities show deleted and given that Reddit deletes authors from a message when they delete their post, this excess shows there are some actual account deletions as well. In any case, popular posts are often deleted, just not as much.
Let's do the same for r/Advice and r/relationship_advice.
And since removed (moderate) message might be less likely to be deleted by the user because they already think it is gone, let's also calculate the percentage of deleted non-removed messages (i.e., nr_del
). First, will setup some pretty printing.
Note: Given the graphs below from tracking actions on Reddit, it appears that at 90 days after posting, Reddit removes extant messages who authors' accounts have been deleted. This policy is undocumented, not known to the community, and it isn't clear when the policy was implemented and to what extent it is retroactive.
ROW_HEADER = (
f"| {'year subreddit':24} | {'shape': ^10} | {'del': ^6}"
f"| {'username': ^8} | {'removed': ^8} | {'nr_del': ^9} | "
)
ROW_SPACING = "| {:24} | {: >10s} | {: >5s} | {: >8s} | {: >8s} | {: >9s} |"
def print_df_stats(label, df):
print(
ROW_SPACING.format(
label,
str(df.shape),
percent_true(df["del_text_r"]),
percent_true(df["del_author_r"]),
percent_true(df["rem_text_r"]),
percent_true(df[~df["rem_text_r"]]["del_text_r"]),
)
)
return (
label,
df.shape[0],
float(percent_true(df["del_text_r"]).strip("%")),
float(percent_true(df["del_author_r"]).strip("%")),
float(percent_true(df["rem_text_r"]).strip("%")),
float(percent_true(df[~df["rem_text_r"]]["del_text_r"]).strip("%")),
)
Now, let's look at the specifics.
print(ROW_HEADER)
# reddit-query.py -a 2018-03-01 -b 2018-05-31 -r Advice -l 1000
df18advi = pd.read_csv("reddit_20180301-20180531_Advice_l1000_n1000.csv")
print_df_stats("2018 Advice", df18advi)
# reddit-query.py -a 2018-03-01 -b 2018-05-31 -r AmItheAsshole -l 1000
df18aita = pd.read_csv("reddit_20180301-20180531_AmItheAsshole_l1000_n1000.csv")
print_df_stats("2018 AmItheAsshole", df18aita)
# reddit-query.py -a 2018-03-01 -b 2018-05-31 -r relationship_advice -l 1000
df18rela = pd.read_csv("reddit_20180301-20180531_relationship_advice_l1000_n1000.csv")
print_df_stats("2018 relationship_advice", df18rela)
| year subreddit | shape | del | username | removed | nr_del | | 2018 Advice | (1000, 17) | 51.6% | 63.3% | 9.7% | 55.8% | | 2018 AmItheAsshole | (1000, 17) | 45.8% | 62.1% | 13.9% | 51.9% | | 2018 relationship_advice | (1000, 17) | 55.9% | 64.5% | 9.5% | 60.9% |
('2018 relationship_advice', 1000, 55.9, 64.5, 9.5, 60.9)
All the advice subreddits have significant deletion. r/AmItheAsshole was more aggressively moderated. When looking at non-moderated messages, the percentage of deleted posts is slightly higher. This suggests some users might not bother to delete if already removed.
It's 2022-06-21, let's see how the advice subreddits are doing more recently. (The 2022 data was refetched on 2022-11-19 for subsequent comparisons with
print(ROW_HEADER)
# reddit-query.py -a 2020-03-01 -b 2020-05-31 -r Advice -l 1000
df20advi = pd.read_csv("reddit_20200301-20200531_Advice_l1000_n1000.csv")
print_df_stats("2020 Advice", df20advi)
# reddit-query.py -a 2020-03-01 -b 2020-05-31 -r AmItheAsshole -l 1000
df20aita = pd.read_csv("reddit_20200301-20200531_AmItheAsshole_l1000_n1000.csv")
print_df_stats("2020 AmItheAsshole", df20aita)
# reddit-query.py -a 2020-03-01 -b 2020-05-31 -r relationship_advice -l 1000
df20rela = pd.read_csv("reddit_20200301-20200531_relationship_advice_l1000_n1000.csv")
print_df_stats("2020 relationship_advice", df20rela)
print()
print(ROW_HEADER)
# reddit-query.py -a 2022-03-01 -b 2022-05-31 -r Advice -l 1000
df22advi = pd.read_csv("reddit_20220301-20220531_Advice_l1000_n1000.csv")
print_df_stats("2022 Advice", df22advi)
# reddit-query.py -a 2022-03-01 -b 2022-05-31 -r AmItheAsshole -l 1000
df22aita = pd.read_csv("reddit_20220301-20220531_AmItheAsshole_l1000_n1000.csv")
print_df_stats("2022 AmItheAsshole", df22aita)
# reddit-query.py -a 2022-03-01 -b 2022-05-31 -r relationship_advice -l 1000
df22rela = pd.read_csv("reddit_20220301-20220531_relationship_advice_l1000_n1000.csv")
print_df_stats("2022 relationship_advice", df22rela)
| year subreddit | shape | del | username | removed | nr_del | | 2020 Advice | (1000, 17) | 53.0% | 57.3% | 12.3% | 52.6% | | 2020 AmItheAsshole | (1000, 17) | 48.9% | 53.6% | 47.1% | 44.2% | | 2020 relationship_advice | (1000, 17) | 58.9% | 63.5% | 9.8% | 59.0% | | year subreddit | shape | del | username | removed | nr_del | | 2022 Advice | (1000, 17) | 47.4% | 49.5% | 42.8% | 17.7% | | 2022 AmItheAsshole | (1000, 17) | 43.1% | 46.6% | 78.4% | 13.9% | | 2022 relationship_advice | (1000, 17) | 53.7% | 55.2% | 48.0% | 23.7% |
('2022 relationship_advice', 1000, 53.7, 55.2, 48.0, 23.7)
The deletion percentages are significant but lower in 2022. Perhaps this is because people simply delete their user accounts, which is still high. Or users who make it past moderation are more likely to abide by norms and leave their posts up (i.e., r/AmItheAsshole).
As an aside, how many 2020/2022 removed/moderated posts where then deleted by their authors?
print(
"2020 Advice "
+ percent_true(df20advi["rem_text_r"] & df20advi["del_text_r"])
)
print(
"2020 AmItheAsshole "
+ percent_true(df20aita["rem_text_r"] & df20aita["del_text_r"])
)
print(
"2020 relationship_advice "
+ percent_true(df20rela["rem_text_r"] & df20rela["del_text_r"])
)
print()
print(
"2022 Advice "
+ percent_true(df22advi["rem_text_r"] & df22advi["del_text_r"])
)
print(
"2022 AmItheAsshole "
+ percent_true(df22aita["rem_text_r"] & df22aita["del_text_r"])
)
print(
"2022 relationship_advice "
+ percent_true(df22rela["rem_text_r"] & df22rela["del_text_r"])
)
2020 Advice 6.9% 2020 AmItheAsshole 25.5% 2020 relationship_advice 5.7% 2022 Advice 37.3% 2022 AmItheAsshole 40.1% 2022 relationship_advice 41.4%
# reddit_20210803-20210804_Advice_n__l500_sampled_.csv --dry-run
# reddit-message.py -i reddit_20210803-20210804_AmItheAsshole_n__l500_sampled_.csv --dry-run
# reddit-message.py -i reddit_20210803-20210804_relationship_advice_n__l500_sampled_.csv --dry-run
The commands above yields information for this table:
subreddit | unique users | deleted | throwaways | |
---|---|---|---|---|
Advice | 492 | 93 (19%) | 13 (03%) | |
AmItheAsshole | 480 | 51 (11%) | 38 (08%) | |
relationship_advice | 484 | 149 (31%) | 57 (12%) |
If a concerned researcher wanted to avoid quoting or citing posts likely to be deleted by their authors (even when pseudonymous), how long ought they wait before including them in their data? We only have two snapshots: when Pushshift ingested the posts (typically within a day of posting) and what's on Reddit now.
reddit-watch.py
was used to fetch 24 hours of the most recent messages from the three subreddits and then watch their status over time.
This dataset was collected atleast twice a day, from 2022-Jul-11 to 2022-Sep-02.
# reddit-watch.py --init "Advice+AmItheAsshole+relationship_advice"
def watched_print(fn: str) -> None:
df = pd.read_csv(
fn,
parse_dates=[
"created_utc",
"found_utc",
"checked_utc",
"del_author_r_utc",
"del_text_r_utc",
"rem_text_r_utc",
],
)
subreddit = fn.split("-")[1]
year = fn.split("-")[2][0:4]
print(ROW_HEADER)
print_df_stats(f"{year} {subreddit}", df)
# Rarely, oddly, created_utc > found_utc on the first few most recent submissions
# I'm not sure why my computer time and Reddit are off by < 1m, but clipping at 0.1
def days_since_created(df: pd.DataFrame, key: str) -> int:
return (df[key] - df["created_utc"]).astype("timedelta64[D]").clip(0.1)
df["del_author_days"] = days_since_created(df, "del_author_r_utc")
df["del_text_days"] = days_since_created(df, "del_text_r_utc")
df["rem_text_days"] = days_since_created(df, "rem_text_r_utc")
# elapsed_days is time between check and found of last row
# elapsed_days = (df["checked_utc"] - df["found_utc"]).dt.days.iloc[-1]
elapsed_days = 100 # cutoff at 100
print(f"{elapsed_days=}\n")
def count_elapsed(elapsed_range: list[int], column: pd.Series) -> list[int]:
count_d = {}
for counter in elapsed_range:
count_d[counter] = column[
column.between(counter - 1, counter, inclusive="right")
].count()
# print(f"{count_d=}")
return count_d.values()
elapsed_range = range(0, elapsed_days + 1) # Max span of days to plot
elapsed_df = pd.DataFrame(index=elapsed_range)
Label = collections.namedtuple("Label", ["column", "color", "desc"])
rem_text_l = Label("rem_text_days", "#bc5090", "removed text") # red
del_auth_l = Label("del_author_days", "#ffa600", "deleted author") # yellow
del_text_l = Label("del_text_days", "#58508d", "deleted text") # blue
labels = [rem_text_l, del_auth_l, del_text_l]
# New dataframe of elapsed_counts for plotting
for label in labels:
print(f"{df[label.column].count()=:5} {label.column} ")
elapsed_df[label.column] = count_elapsed(elapsed_range, df[label.column])
elapsed_df[f"{label.column}_cp"] = round(
100 * (elapsed_df[label.column].cumsum() / len(df))
)
print(f"{elapsed_df}")
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html
ax1 = elapsed_df[[label.column for label in labels]].plot.bar(
figsize=[7, 4],
width=0.9,
color={label.column: label.color for label in labels},
title=f"r/{subreddit} actions on elapsed day",
xlabel="day",
ylabel="actions (log-10)",
logy=True,
xticks=np.arange(0, elapsed_days, 7),
rot=0,
legend=True,
)
ax1.yaxis.set_major_formatter(mticker.ScalarFormatter())
ax1.legend(
labels=["removed text", "deleted author", "deleted text"], loc="upper center"
)
ax2 = ax1.twinx()
ax2.set_ylabel(r"cumulative % deleted text")
ax2.plot(
elapsed_df["del_text_days_cp"],
linestyle="dotted",
color=del_text_l.color,
linewidth=3,
label=r"cumulative % deleted text",
)
ax2.set_ylim(1, 101)
ax2.tick_params(axis="y")
ax2.legend(loc="center", bbox_to_anchor=(0.5, 0.7))
plt.tight_layout()
plt.savefig(f"{fn}.png", dpi=150)
# plt.savefig(f"{fn}.svg")
plt.show() # uncomment to print graph
watched_print("watch-Advice-20220712_n1160.csv")
| year subreddit | shape | del | username | removed | nr_del | | 2022 Advice | (1160, 15) | 44.3% | 47.1% | 33.5% | 29.3% | elapsed_days=100 df[label.column].count()= 389 rem_text_days df[label.column].count()= 546 del_author_days df[label.column].count()= 514 del_text_days rem_text_days rem_text_days_cp del_author_days del_author_days_cp \ 0 0 0.0 0 0.0 1 153 13.0 298 26.0 2 3 13.0 27 28.0 3 3 14.0 15 29.0 4 0 14.0 13 30.0 .. ... ... ... ... 96 8 33.0 0 47.0 97 4 33.0 2 47.0 98 3 33.0 1 47.0 99 4 34.0 1 47.0 100 0 34.0 0 47.0 del_text_days del_text_days_cp 0 0 0.0 1 283 24.0 2 28 27.0 3 15 28.0 4 10 29.0 .. ... ... 96 0 44.0 97 2 44.0 98 1 44.0 99 1 44.0 100 0 44.0 [101 rows x 6 columns]
watched_print("watch-AmItheAsshole-20220712_n1689.csv")
| year subreddit | shape | del | username | removed | nr_del | | 2022 AmItheAsshole | (1689, 15) | 38.7% | 41.2% | 70.3% | 19.9% | elapsed_days=100 df[label.column].count()= 1187 rem_text_days df[label.column].count()= 696 del_author_days df[label.column].count()= 654 del_text_days rem_text_days rem_text_days_cp del_author_days del_author_days_cp \ 0 0 0.0 0 0.0 1 1072 63.0 493 29.0 2 0 63.0 19 30.0 3 1 64.0 16 31.0 4 0 64.0 10 32.0 .. ... ... ... ... 96 1 70.0 1 41.0 97 1 70.0 1 41.0 98 2 70.0 1 41.0 99 0 70.0 0 41.0 100 0 70.0 0 41.0 del_text_days del_text_days_cp 0 0 0.0 1 466 28.0 2 18 29.0 3 13 29.0 4 9 30.0 .. ... ... 96 1 39.0 97 1 39.0 98 1 39.0 99 0 39.0 100 0 39.0 [101 rows x 6 columns]
watched_print("watch-relationship_advice-20220712_n2541.csv")
| year subreddit | shape | del | username | removed | nr_del | | 2022 relationship_advice | (2541, 15) | 55.7% | 57.6% | 52.2% | 35.6% | elapsed_days=100 df[label.column].count()= 1326 rem_text_days df[label.column].count()= 1463 del_author_days df[label.column].count()= 1416 del_text_days rem_text_days rem_text_days_cp del_author_days del_author_days_cp \ 0 0 0.0 0 0.0 1 762 30.0 1012 40.0 2 14 31.0 56 42.0 3 1 31.0 36 43.0 4 0 31.0 25 44.0 .. ... ... ... ... 96 10 51.0 1 57.0 97 6 52.0 2 57.0 98 10 52.0 1 58.0 99 5 52.0 1 58.0 100 0 52.0 0 58.0 del_text_days del_text_days_cp 0 0 0.0 1 979 39.0 2 56 41.0 3 36 42.0 4 20 43.0 .. ... ... 96 1 56.0 97 2 56.0 98 1 56.0 99 1 56.0 100 0 56.0 [101 rows x 6 columns]
The largest study of sensitive subreddits that I am familiar with is Gaur et al.'s (2019) "Knowledge-Aware Assessment of Severity of Suicide Risk for Early Intervention". The deletion rates for their study's 15 subreddits -- minus r/cripplingalcoholism which was made private in 2022. Given that Reddit now removes some messages after 90 days, let's consider the rates of action prior to that trigger.
health_stats = []
print(ROW_HEADER)
df22Anxy = pd.read_csv("reddit_20220814-20221109_Anxiety_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 Anxiety", df22Anxy))
df22BPDD = pd.read_csv("reddit_20220814-20221109_BPD_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 BPD", df22BPDD))
df22Bipt = pd.read_csv("reddit_20220814-20221109_BipolarReddit_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 BipolarReddit", df22Bipt))
df22Bips = pd.read_csv("reddit_20220814-20221109_BipolarSOs_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 BipolarSOs", df22Bips))
df22Stom = pd.read_csv("reddit_20220814-20221109_StopSelfHarm_l1000_n105.csv")
health_stats.append(print_df_stats("2022 StopSelfHarm", df22Stom))
df22Suih = pd.read_csv("reddit_20220814-20221109_SuicideWatch_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 SuicideWatch", df22Suih))
df22addn = pd.read_csv("reddit_20220814-20221109_addiction_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 addiction", df22addn))
df22asps = pd.read_csv("reddit_20220814-20221109_aspergers_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 aspergers", df22asps))
df22autm = pd.read_csv("reddit_20220814-20221109_autism_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 autism", df22autm))
df22bipr = pd.read_csv("reddit_20220814-20221109_bipolar_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 bipolar", df22bipr))
df22depn = pd.read_csv("reddit_20220814-20221109_depression_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 depression", df22depn))
df22opis = pd.read_csv("reddit_20220814-20221109_opiates_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 opiates", df22opis))
df22scha = pd.read_csv("reddit_20220814-20221109_schizophrenia_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 schizophrenia", df22scha))
df22selm = pd.read_csv("reddit_20220814-20221109_selfharm_l1000_n1000.csv")
health_stats.append(print_df_stats("2022 selfharm", df22selm))
| year subreddit | shape | del | username | removed | nr_del | | 2022 Anxiety | (1000, 17) | 35.0% | 37.9% | 5.2% | 35.4% | | 2022 BPD | (1000, 17) | 41.5% | 43.6% | 20.4% | 38.8% | | 2022 BipolarReddit | (1000, 17) | 25.8% | 30.0% | 2.1% | 25.8% | | 2022 BipolarSOs | (1000, 17) | 38.7% | 40.1% | 35.6% | 27.5% | | 2022 StopSelfHarm | (105, 17) | 12.4% | 13.3% | 2.9% | 12.7% | | 2022 SuicideWatch | (1000, 17) | 42.9% | 46.7% | 17.9% | 40.2% | | 2022 addiction | (1000, 17) | 32.1% | 33.9% | 9.1% | 33.7% | | 2022 aspergers | (1000, 17) | 30.7% | 33.8% | 8.5% | 28.4% | | 2022 autism | (1000, 17) | 24.5% | 26.7% | 4.0% | 23.1% | | 2022 bipolar | (1000, 17) | 32.8% | 34.6% | 25.1% | 28.2% | | 2022 depression | (1000, 17) | 38.9% | 42.0% | 19.9% | 37.8% | | 2022 opiates | (1000, 17) | 36.2% | 37.1% | 32.1% | 28.9% | | 2022 schizophrenia | (1000, 17) | 32.1% | 34.1% | 35.1% | 22.7% | | 2022 selfharm | (1000, 17) | 30.0% | 31.7% | 8.8% | 30.6% |
health_df = pd.DataFrame(
health_stats, columns=["label", "size", "del", "username", "removed", "nr_del"]
)
print(health_df.mean(axis=0, numeric_only=True))
size 936.071429 del 32.400000 username 34.678571 removed 16.192857 nr_del 29.557143 dtype: float64
computer_stats = []
print(ROW_HEADER)
df22Andd = pd.read_csv("reddit_20220820-20221110_Android_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 Android", df22Andd))
df22Datr = pd.read_csv("reddit_20220820-20221110_DataHoarder_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 DataHoarder", df22Datr))
df22appe = pd.read_csv("reddit_20220820-20221110_apple_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 apple", df22appe))
df22aude = pd.read_csv("reddit_20220820-20221110_audiophile_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 audiophile", df22aude))
df22buic = pd.read_csv("reddit_20220820-20221110_buildapc_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 buildapc", df22buic))
df22eles = pd.read_csv("reddit_20220820-20221110_electronics_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 electronics", df22eles))
df22gads = pd.read_csv("reddit_20220820-20221110_gadgets_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 gadgets", df22gads))
df22hare = pd.read_csv("reddit_20220820-20221110_hardware_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 hardware", df22hare))
df22ipad = pd.read_csv("reddit_20220820-20221110_ipad_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 ipad", df22ipad))
df22linx = pd.read_csv("reddit_20220820-20221110_linux_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 linux", df22linx))
df22macc = pd.read_csv("reddit_20220820-20221110_mac_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 mac", df22macc))
df22sysn = pd.read_csv("reddit_20220820-20221110_sysadmin_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 sysadmin", df22sysn))
df22tect = pd.read_csv("reddit_20220820-20221110_techsupport_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 techsupport", df22tect))
df22webb = pd.read_csv("reddit_20220820-20221110_web_design_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 web", df22webb))
df22wins = pd.read_csv("reddit_20220820-20221110_windows_l1000_n1000.csv")
computer_stats.append(print_df_stats("2022 windows", df22wins))
| year subreddit | shape | del | username | removed | nr_del | | 2022 Android | (1000, 17) | 14.8% | 15.6% | 70.1% | 4.0% | | 2022 DataHoarder | (1000, 17) | 14.6% | 15.5% | 13.3% | 12.3% | | 2022 apple | (1000, 17) | 25.4% | 27.0% | 81.3% | 5.9% | | 2022 audiophile | (1000, 17) | 26.2% | 26.6% | 42.7% | 17.3% | | 2022 buildapc | (1000, 17) | 21.7% | 22.3% | 5.7% | 20.8% | | 2022 electronics | (1000, 17) | 20.7% | 21.3% | 55.1% | 0.4% | | 2022 gadgets | (1000, 17) | 21.1% | 21.9% | 12.0% | 11.7% | | 2022 hardware | (1000, 17) | 20.1% | 21.0% | 44.6% | 15.2% | | 2022 ipad | (1000, 17) | 25.3% | 26.1% | 72.6% | 18.6% | | 2022 linux | (1000, 17) | 21.1% | 23.7% | 61.5% | 10.6% | | 2022 mac | (1000, 17) | 18.6% | 20.4% | 4.4% | 16.4% | | 2022 sysadmin | (1000, 17) | 11.8% | 13.1% | 9.6% | 11.8% | | 2022 techsupport | (1000, 17) | 18.9% | 20.1% | 10.2% | 17.6% | | 2022 web | (1000, 17) | 19.9% | 20.9% | 43.9% | 11.9% | | 2022 windows | (1000, 17) | 19.7% | 21.9% | 44.4% | 10.8% |
computer_df = pd.DataFrame(
computer_stats, columns=["label", "size", "del", "username", "removed", "nr_del"]
)
print(computer_df.mean(axis=0, numeric_only=True))
size 1000.000000 del 19.993333 username 21.160000 removed 38.093333 nr_del 12.353333 dtype: float64
all_df = pd.concat([health_df, computer_df])