Web Feeds in 2026: A Survey

Sunday, 10 May 2026

A couple of weeks ago, I made a straw-man proposal for a new Web feed autodiscovery mechanism. I got some encouragement and some pushback (usual for this sort of thing). One of the issues raised was regarding internationalisation – a few people said that I should support multiple languages in the format, rather than relying on HTTP content negotiation.

I felt my approach was adequate, given the specifics of the use case. However, it bugged me: I didn’t have data to back me up – AFAICT there’s no significant information about how feeds are used on the Web today.

I realised that this didn’t have to be. Two things helped: a friend at Common Crawl, who assured me that they do indeed crawl feeds, and AI. I don’t have nearly enough time to learn the ins and outs of CC dumps, map/reduce, and the assorted data science bits, but I can babysit a couple of agents¹ through it while I do other things. So I did.

You can peruse the full (inaugural?) survey report. Below are the major takeaways from my perspective.

Update: The numbers below have been updated to account for sniffed feeds, which were omitted from the last run.

Web Feeds are Still a Thing…

In the top 500,000 web sites (per Tranco) seen by Common Crawl, the run analyzed 196,598 registrable sites and found 534,195 parseable feeds. 35.9% of sites exposed feed autodiscovery, and 19.7% of analyzed HTML responses had feed links.

That’s huge – more than a third of sites advertising a feed is a big statement about the nature of the Open Web.

… But a LOT of Them are Abandoned.

High-quality feeds are a minority. Using a quality metric that considers feed recency, content, and metadata, 120,832 feeds scored above 0.5 – only 22.6% of parsed feeds. Only 203,483 (38.1% of parsed feeds) had any freshness signal within a 365-day cutoff, and only 141,380 were both fresh and had entries – 26.5% of parsed feeds.

So there are a lot of abandoned feeds on the Web.

I suspect a major contributor is the tendency of content management systems to automatically expose feeds, even when publisher customisation means those feeds no longer reflect the useful parts of the site. For example, only 22.1% of feeds that we could fingerprint as being created by WordPress exceeded our quality bar (a measure of recency, content, and metadata) – although autodiscovered feeds on WordPress were a bit better, at 36.5%. Drupal was a bit higher but still very low at 25.1%, and for Blogger, it was an abysmal 5.9%.

On the other hand, some platforms had higher quality, albeit with small sample sizes. Autodiscovered feeds from Substack pages led to 90.0% high-quality feeds; Ghost: 66.7%; Squarespace: 90.0%. I suspect that this could be because many of these platforms either don’t offer much customisation, or don’t create feeds by default.

In other words, when you go to a site hosted by a popular CMS, chances are uncomfortably high that the feed it creates will be stale, empty, or otherwise not very useful. The takeaway here is simple but urgent: CMS software should not silently create and advertise feeds that publishers never see or maintain; feeds should be visible, testable, and consciously enabled. Fixing that in the next release of a couple of platforms could dramatically increase the quality of feeds on the Web in a short time.

Autodiscovery is Not a Quality Signal.

17.8% of feeds had HTML feed autodiscovery pointing at them, but those links don’t necessarily lead to higher-quality feeds. Although there was a slight bump in measured quality for autodiscovered feeds, mean measured quality was still low – 0.246 vs. 0.220 for feeds without.

This compromises autodiscovery as a user-facing affordance; if people experience lots of stale or zero-entry feeds when they use autodiscovery information (and that is my personal experience!), they won’t rely upon it.

That failure is why I made a proposal for a new feed autodiscovery mechanism, with prototype implementation as an extension for major browsers. The reasoning is straightforward: if autodiscovery is more deliberate and in a central place on the site, it has a better chance of leading to working, useful feeds.

An aside: autodiscovery overwhelmingly means rel=alternate. Feed autodiscovery via rel=alternate appeared on 81.94M pages; rel=feed was tiny by comparison at 12,796 pages. WHATWG should deprecate the feed link relation; the cowpath is well and truly paved.

Most Feeds Parse.

The run checked 543,577 feed URLs and parsed 534,195 RSS/Atom feeds, a 98.3% parse success rate. So broken XML exists, but outright parse failure is not the main quality problem.

This was a major concern in the early RSS and Atom days; XML was still new, it was complex,² and implementations – both of XML parsers and feed software – weren’t quite baked. From what we see here, this doesn’t seem to be a concern any more.

The biggest problem – over 45% of the errors encountered – was ‘XML declaration allowed only at the start of the document’. Next at about 12% was ‘EntityRef: expecting ‘;’’ and then ‘CData section not finished’ at about 8%.

So, yes, sites should still use feed validators. As an ecosystem, however, we shouldn’t worry too much about this aspect of feed quality.

RSS and Atom Co-Exist.

Of 534,195 parsed feeds, about 412k were RSS-family and 122k were Atom. RSS 2.0 alone was 378,287 feeds. There are minimal quality differences between them. So the feed wars rage on more than twenty years later, and RSS still has the upper hand – or, alternatively, no one really cares.

The takeaway for sites is simple: Don’t over-fixate on choosing a format. Choose one – either will do – and don’t double up the feeds. Consuming software will support both.

Sites could do a better job using proper Content-Type headers, though: 232,910 feed URLs were discovered from generic XML/text/plain/octet-stream responses, about 42.9% of all feed URLs checked. Most were RSS: 212,171 RSS vs. 18,253 Atom.

One thing I specifically tried to check for was pages that advertised both Atom and RSS versions of the same feed. From what we can see here, that isn’t happening much; only 127 sampled pages appeared to be doing this, with 113 linking to both RSS and Atom.

Feeds are Monolingual.

Finally, to the question that sparked this for me – 13.5% of feeds have HTTP Content-Language, while almost exactly 50% have feed-level language information (e.g., dc:language, xml:lang, or a RSS language tag). However, only 1.1% (5,698 feeds) have entry-level language information and only 5,645 feeds — 1.1% of parsed feeds — showed multiple entry languages.

So, based upon the current Web, mixing languages in a feed isn’t a widespread practice.

A Few Words About Methodology

Like all statistics, these should be taken with a grain (or three) of salt. The code is AI-written (although guided by me, therefore I get to take the blame). I have not reviewed every line of code; this is a side project. Common Crawl doesn’t see the whole Web (although the open Web nature of feeds means it’s a good match!). I filtered to the Tranco top 500,000 sites to reduce the influence of very low-quality domains. The run completed 43,717 WARCs and skipped/failed 68, so the missing-WARC rate was about 0.16%. I’m sure there are many more caveats, but you get the idea. Take a look at the code and file an issue if you see something.

First Gemini, then Claude, then Codex. In a decidedly increasing order of effectiveness. ↩
Still is. ↩

Mark Nottingham