Scraping 2 Million Substack Articles

What predicts Substack success: pricing, post frequency, word count, and category, from millions of posts and thousands of publications.

Apr 05, 2025

Read this on my blog for the full experience — proper typography, the complete reference list with every paper linked, supplementary deep-dives that go beyond this post, and footnotes that actually work. Much better than Substack.

TL;DR

Frequency > Length: Posting more often (even daily) seems better than writing fewer, longer posts. Consistency in timing isn't key, but volume is.
Price Matters: Higher average subscription prices strongly correlate with higher estimated revenue.
Momentum is Real: A post's likes are overwhelmingly predicted by the average likes of the previous 10 posts (explaining ~86% of the variance!).
Substack Boosts the First Post: Your first post gets a huge boost – make it count.
Paid Post Sweet Spot: Aiming for roughly 50% paid posts appears optimal for maximizing revenue, though the relationship isn't perfectly linear.
Category Counts: Culture, US Politics, and Finance Substacks tend to have higher revenue potential, while Fiction, Philosophy, and Travel lag behind in our model. For individual posts, Comics and Health Politics see the biggest like boosts relative to the baseline (Arts).

Introduction

Substack has exploded, becoming a go-to platform for writers, journalists, and creators looking to build direct relationships with their audience and monetize their work. But with thousands of newsletters vying for attention, what actually separates the breakout hits from the ones that fizzle out?

Is it brilliant prose? Niche topics? A relentless posting schedule? Or just plain luck?

As usual, rather than relying on anecdotes, we decided to dive into the data. We scraped information from a vast number of Substack publications and their posts, connecting it with backend data on pricing, subscriber counts (where available), and post statistics. We then built a couple of models to try and decode the patterns behind Substack success:

Predicting Substack Revenue: What publication-level factors (price, age, frequency, category, etc.) correlate with higher estimated earnings?
Predicting Post Likes: What makes an individual post resonate more with readers (length, paid status, timing, category)?

Let's see what the numbers tell us.

Method

To tackle this, we gathered data on posts (like counts, word counts, publish dates, paid status) and publications (subscriber estimates, pricing plans, categories, creation dates).

Data Prep: We cleaned the data, converted currencies to USD, calculated average subscription prices, and estimated revenue based on Substack's own "Paid Rank" tiers (e.g., "Thousands of paid subscribers" ~ 1000). We focused only on non-podcast newsletter posts.
Substack Revenue Model: We built a linear regression model predicting the (log-transformed and standardized) lower-bound estimated annual revenue. Predictors included average price, publication age (observation period), average time between posts, variance in time between posts, percentage of paid posts, average word counts (imputed where necessary), average description length, and category.
Post Likes Model: We built another linear regression model, this time predicting the (log-transformed) number of likes (reactions) on a post. Predictors included word count, description length, paid status, category, whether it was the first post, and crucially, the moving average of likes from the previous 10 posts.

We used standard statistical techniques, including log transformations to handle skewed data (like revenue and likes) and imputation for missing values. The goal wasn't perfect prediction but identifying significant drivers.

Results: What Drives Substack Success?

Predicting Which Substacks Earn More

Our model looking at publication-level revenue (Adjusted R-squared: 0.314, r=0.56, meaning it explains about 31.4% of the variance) revealed several significant factors:

Table 1: Price (β=0.47) and posting frequency (β=-0.30 on interval) dominate revenue prediction; word count effects are modest (β≈0.05–0.08).

(Significance codes: 0 '\\\' 0.001 '\\' 0.01 '\' 0.05 '.' 0.1 ' ' 1. Estimates represent change in standardized log-revenue for a one-unit change in the predictor. Categories compared relative to Arts & Entertainment baseline in the full model)

Key Takeaways for Substacks:

Charge More: Price remains a powerful lever. However, it should be noted that by default estimated revenue is a function of price. So this correlation may exist regardless.
Post Often: Reducing the average time between posts (posting more frequently) still shows a strong positive association with revenue.
Paid Percentage: The positive correlation holds – more paid posts generally link to higher revenue in the model, though the visual plots (below) still suggest a potential curve peaking around 50-60%.
Word Count: Longer posts (both free and paid, after log transformations) still show a statistically significant positive correlation with revenue, but the effect sizes are smaller than before. Frequency likely remains more impactful than length alone.
Consistency? Maybe Not: The slight positive correlation for more variance in posting intervals persists. Frequency seems to matter more than rigid timing.

Category Matters Too:

When we included categories in the model, the relative revenue potential (compared to the Arts & Entertainment baseline) showed this pattern:

Table 2: Culture and U.S. Politics earn ~1.0 log-units above Arts baseline, while Fiction (-0.37) and Philosophy (-0.20) lag significantly.

(Significance codes: 0 '\\\' 0.001 '\\' 0.01 '\' 0.05 '.' 0.1 ' ' 1)

Culture and US Politics remain top categories for revenue potential in this model, while Fiction and Philosophy show significantly lower potential.

Visualizing the Trends:

The visual patterns remain largely the same:

Posting Interval vs. Revenue: Shorter intervals trend higher.

plot_interval_revenue — Figure 1: Mean posting interval vs. estimated revenue. Publications that post more often (shorter intervals) earn substantially more.

Average Price vs. Revenue: Strong positive correlation.

plot_price_revenue — Figure 2: Average subscription price vs. estimated revenue. Higher pricing correlates strongly with higher earnings, partly by construction of the revenue estimate.

Substack Age vs. Revenue: Older Substacks tend slightly higher.

plot_age_revenue — Figure 3: Publication age vs. estimated revenue. Longer-running newsletters earn modestly more, reflecting compounding audience growth.

Percent Paid Posts vs. Revenue: Linear trend up, but smoothed curve suggests a peak.

plot_percent_paid_revenue — Figure 4: Share of paywalled posts vs. estimated revenue. Revenue rises with paid share but the smoothed fit peaks near 50–60%.

Word Count vs. Revenue: Positive trend, now reflecting the corrected word count scale.

plot_wordcount_revenue — Figure 5: Average post word count vs. estimated revenue. The slope is positive but shallow; length helps less than frequency or price.

Top Earning Substacks (Based on Estimates):

Audience size first:

Table 3a: Audience size for top estimated earners — note the order-of-magnitude spread in free subscribers (from ~1k for niche premium to 2.4M for mass-market politics).

And the corresponding monetization for each:

Table 4 — Table 3b: Monetization for the same earners — two paths to ~$30M revenue: premium pricing on small lists vs. cheap subs on huge lists.

(Note: Revenue is estimated based on Substack's tiers and average pricing; actual figures may vary. Paid subscriber estimates are based on order of magnitude.)

Predicting Post Popularity (Likes)

Our second model looked at factors predicting the number of likes (log-transformed) a post receives. This model was incredibly predictive (R-squared: 0.859)!

Table 5 — Table 4: Past performance dominates likes prediction (β=0.94 for prior-10 moving average), with a large first-post bonus (β=1.64) and a paywall penalty (β=-0.11).

(Note: Categories also included, showing varied effects relative to Arts baseline. Comics +0.15, Health Politics +0.06, Business -0.09, US Politics -0.06)

Key Takeaways for Posts:

Momentum is Everything: The single biggest predictor by far is the moving average of likes on the previous 10 posts (MA_10_posts). Success breeds success. If your recent posts did well, your next one likely will too. This explains ~86% of the variance alone!
Nail Your First Post: The first_postTRUE coefficient is enormous. Your very first post gets a massive visibility boost (algorithmic or otherwise). Don't waste it!
Paid Wall Hurts Likes: Unsurprisingly, putting a post behind a paywall significantly reduces its like count. This is the trade-off for monetization.
Length & Description: Longer posts get slightly more likes, while posts with slightly shorter descriptions do better. Keep the summary punchy?
Category Effects: Comics and Health Politics posts tend to get more likes than average, while categories like Business, US Politics, and Technology get fewer, holding other factors constant.

Visualizing Post Popularity:

Moving Average vs. Actual Likes: The relationship is incredibly tight. Past performance is the best predictor of future performance.

plot_ma_reactions — Figure 6: 10-post moving average of likes vs. current-post likes. The near-linear relationship explains roughly 86% of the variance — momentum dominates.

Model Predictions vs. Actual Likes: Our model tracks actual likes very well, especially given the dominance of the moving average predictor.

plot_predictions_reactions — Figure 7: Predicted vs. actual log-likes. Points hug the identity line, confirming the model's R-squared of 0.86 holds across the full like distribution.

Most Liked Posts (Raw Counts):

Table 6 — Table 5: Top-liked posts skew heavily political/resignation-themed; Heather Cox Richardson tops the list at 37k likes, nearly double the next entry.

Best Performing Posts (Relative to Model):

Table 7 — Table 6: Posts with the largest positive residuals (≈6+ log-units above prediction) — likely viral breakouts driven by topic or off-platform amplification.

These might represent posts with particularly viral topics, exceptional writing, or perhaps successful off-platform promotion.

Conclusion: Key Strategies for Substack Growth

Synthesizing the updated model results and observations:

Post Frequently: Still appears highly beneficial for revenue. Volume likely trumps perfectionism.
Build Momentum: Crucial for post visibility, given the massive impact of past likes.
Optimize Your First Post: Leverage that initial algorithmic (?) boost.
Price Strategically: Higher prices strongly correlate with higher revenue potential. Note the huge price difference between nextplayinvesting and heathercoxrichardson despite similar estimated revenue tiers – audience size and willingness to pay interact complexly.
Balance Free vs. Paid: The ~50% paid post mark still looks like a reasonable target based on visual inspection of the plots, despite the linear model showing a positive coefficient overall.
Consider Your Category: Significant differences in revenue potential exist between categories.
Word Count Matters (a little): Longer posts have a small, positive association with revenue, but don't sacrifice frequency for extreme length.

Ultimately, data provides patterns, not guarantees. Quality content and audience connection are paramount. However, understanding these underlying dynamics can help you make more informed decisions as you navigate the Substack landscape. Good luck!

Want more? My blog has the full supplementary materials — methodology, robustness checks, code, and figures that did not fit here — plus the complete reference list with every paper linked. All in one place, properly formatted.

Leon Voß

Apr 5, 2025Edited

it would be interesting to use AI perhaps with some kind of validator to judge each writer's average complexity or writing IQ. For example, validate the classifier by showing as a strong correlation with an aggregate human judged complexity of a random subset of articles. This could replicate some old data I have where I asked my audience to rate the average content complexity of several different Substack authors in the right wing politics/intellectual sphere and found that there is a negative strong correlation between sophistication and popularity.

1 reply by Uncorrelated

1 more comment...

Uncorrelated

Discussion about this post

Ready for more?