I Web Scraped 2 Million Substack Articles. This is What I Learnt.
We use data from millions of Substack posts and thousands of publications to determine what predicts success – from post frequency and pricing to word counts and category choice.
***Note, this post was written with the assistance of Gemini 2.5 Pro. All data was sourced by me, as were the plots, model designs, Gemini assisted with writing***
TL;DR
Frequency > Length: Posting more often (even daily) seems better than writing fewer, longer posts. Consistency in *timing* isn't key, but *volume* is.
Price Matters: Higher average subscription prices strongly correlate with higher estimated revenue.
Momentum is Real: A post's likes are overwhelmingly predicted by the average likes of the previous 10 posts (explaining ~86% of the variance!).
Substack Boosts the First Post: Your first post gets a *huge* boost – make it count.
Paid Post Sweet Spot: Aiming for roughly 50% paid posts appears optimal for maximizing revenue, though the relationship isn't perfectly linear.
Category Counts: Culture, US Politics, and Finance Substacks tend to have higher revenue potential, while Fiction, Philosophy, and Travel lag behind in our model. For individual posts, Comics and Health Politics see the biggest like boosts relative to the baseline (Arts).
Introduction
Substack has exploded, becoming a go-to platform for writers, journalists, and creators looking to build direct relationships with their audience and monetize their work. But with thousands of newsletters vying for attention, what actually separates the breakout hits from the ones that fizzle out?
Is it brilliant prose? Niche topics? A relentless posting schedule? Or just plain luck?
As usual, rather than relying on anecdotes, we decided to dive into the data. We scraped information from a vast number of Substack publications and their posts, connecting it with backend data on pricing, subscriber counts (where available), and post statistics. We then built a couple of models to try and decode the patterns behind Substack success:
1. Predicting Substack Revenue: What publication-level factors (price, age, frequency, category, etc.) correlate with higher estimated earnings?
2. Predicting Post Likes: What makes an individual post resonate more with readers (length, paid status, timing, category)?
Let's see what the numbers tell us.
Method
To tackle this, we gathered data on posts (like counts, word counts, publish dates, paid status) and publications (subscriber estimates, pricing plans, categories, creation dates).
Data Prep: We cleaned the data, converted currencies to USD, calculated average subscription prices, and estimated revenue based on Substack's own "Paid Rank" tiers (e.g., "Thousands of paid subscribers" ~ 1000). We focused only on non-podcast newsletter posts.
Substack Revenue Model: We built a linear regression model predicting the (log-transformed and standardized) lower-bound estimated annual revenue. Predictors included average price, publication age (observation period), average time between posts, variance in time between posts, percentage of paid posts, average word counts (imputed where necessary), average description length, and category.
Post Likes Model: We built another linear regression model, this time predicting the (log-transformed) number of likes (reactions) on a post. Predictors included word count, description length, paid status, category, whether it was the *first* post, and crucially, the moving average of likes from the previous 10 posts.
We used standard statistical techniques, including log transformations to handle skewed data (like revenue and likes) and imputation for missing values. The goal wasn't perfect prediction but identifying significant drivers.
Results: What Drives Substack Success?
Predicting Which Substacks Earn More
Our model looking at publication-level revenue (Adjusted R-squared: 0.314, r=0.56, meaning it explains about 31.4% of the variance) revealed several significant factors:
**Key Takeaways for Substacks:**
**Charge More:** Price remains a powerful lever. However, it should be noted that by default estimated revenue is a function of price. So this correlation may exist regardless.
**Post Often:** Reducing the average time between posts (posting more frequently) still shows a strong positive association with revenue.
**Paid Percentage:** The positive correlation holds – more paid posts generally link to higher revenue in the model, though the visual plots (below) still suggest a potential curve peaking around 50-60%.
**Word Count:** Longer posts (both free and paid, after log transformations) still show a *statistically significant* positive correlation with revenue, but the effect sizes are smaller than before. Frequency likely remains more impactful than length alone.
**Consistency? Maybe Not:** The slight positive correlation for *more* variance in posting intervals persists. Frequency seems to matter more than rigid timing.
**Category Matters Too:**
When we included categories in the model, the relative revenue potential (compared to the Arts & Entertainment baseline) showed this pattern:
Culture and US Politics remain top categories for revenue potential in this model, while Fiction and Philosophy show significantly lower potential.
**Visualizing the Trends:**
The visual patterns remain largely the same:
*Posting Interval vs. Revenue:* Shorter intervals trend higher.
*Average Price vs. Revenue:* Strong positive correlation.
*Substack Age vs. Revenue:* Older Substacks tend slightly higher.
*Percent Paid Posts vs. Revenue:* Linear trend up, but smoothed curve suggests a peak.
*Word Count vs. Revenue:* Positive trend, now reflecting the corrected word count scale.
**Top Earning Substacks (Based on Estimates):**
Predicting Post Popularity (Likes)
Our second model looked at factors predicting the number of likes (log-transformed) a post receives. This model was incredibly predictive (R-squared: 0.859)!
**Key Takeaways for Posts:**
**Momentum is Everything:** The single biggest predictor by far is the moving average of likes on the previous 10 posts (`MA_10_posts`). Success breeds success. If your recent posts did well, your next one likely will too. This explains ~86% of the variance alone!
**Nail Your First Post:** The `first_post` coefficient is enormous. Your very first post gets a massive visibility boost (algorithmic or otherwise). Don't waste it!
**Paid Wall Hurts Likes:** Unsurprisingly, putting a post behind a paywall significantly reduces its like count. This is the trade-off for monetization.
**Length & Description:** Longer posts get slightly *more* likes, while posts with slightly *shorter* descriptions do better. Keep the summary punchy?
**Category Effects:** Comics and Health Politics posts tend to get more likes than average, while categories like Business, US Politics, and Technology get fewer, holding other factors constant.
**Visualizing Post Popularity:**
*Moving Average vs. Actual Likes:* The relationship is incredibly tight. Past performance is the best predictor of future performance.
*Model Predictions vs. Actual Likes:* Our model tracks actual likes very well, especially given the dominance of the moving average predictor.
**Most Liked Posts (Raw Counts):**
(If you want to copy+paste the links, visit the table on my site)
**Best Performing Posts (Relative to Model):**
(If you want to copy+paste the links, visit the table on my site)
These might represent posts with particularly viral topics, exceptional writing, or perhaps successful off-platform promotion.
Conclusion: Key Strategies for Substack Growth
Synthesizing the updated model results and observations:
**Post Frequently:** Still appears highly beneficial for revenue. Volume likely trumps perfectionism.
**Build Momentum:** Crucial for post visibility, given the massive impact of past likes.
**Optimize Your First Post:** Leverage that initial algorithmic (?) boost.
**Price Strategically:** Higher prices strongly correlate with higher revenue potential. Note the huge price difference between `nextplayinvesting` and `heathercoxrichardson` despite similar estimated revenue tiers – audience size and willingness to pay interact complexly.
**Balance Free vs. Paid:** The ~50% paid post mark still looks like a reasonable target based on visual inspection of the plots, despite the linear model showing a positive coefficient overall.
**Consider Your Category:** Significant differences in revenue potential exist between categories.
**Word Count Matters (a little):** Longer posts have a small, positive association with revenue, but don't sacrifice frequency for extreme length.
Ultimately, data provides patterns, not guarantees. Quality content and audience connection are paramount. However, understanding these underlying dynamics can help you make more informed decisions as you navigate the Substack landscape. Good luck!
---
*Disclaimer: This analysis is based on publicly available data and statistical modeling. Revenue figures are estimates based on Substack's tiers and may not reflect actual earnings.*
it would be interesting to use AI perhaps with some kind of validator to judge each writer's average complexity or writing IQ. For example, validate the classifier by showing as a strong correlation with an aggregate human judged complexity of a random subset of articles. This could replicate some old data I have where I asked my audience to rate the average content complexity of several different Substack authors in the right wing politics/intellectual sphere and found that there is a negative strong correlation between sophistication and popularity.
> Higher average subscription prices strongly correlate with higher estimated revenue.
What does this mean? I haven't read the rest of the post yet.