Saving the Open Web

AI and other advancements that can help save the open web

Jul 31, 2023

This is a weekly newsletter about the business of the technology industry. To receive Tanay’s Newsletter in your inbox, subscribe here for free:

Hi friends!

Last week, I wrote about how AI might further the decline of the open web. But all hope isn’t lost. This week, I’ll discuss some things underway that could help save the open web, including AI.

1/ Decentralization and the Fediverse

Even though the web has been open, a lot of the activity has happened on certain centralized platforms such as Twitter, Facebook, etc. Users on these platforms were always at the risk of being de-platformed and thereby losing their audience and also their voice.

Today, more platforms have started supporting the concept of a decentralized Fediverse where users can leave the platform (but take their audience/graph) with them and have their thoughts on one platform be read by those on other platforms that are part of that Fediverse.

A 5-minute tour of the Fediverse | Opensource.com

The most notable examples of these have been Mastodon and Instagram’s Threads, which intends to support this soon (though its users are dropping quickly after a splashy launch).

2/ Growth of Open Protocols

Another promising trend is the continued growth of podcasting and email newsletters. In both cases, users can choose the app of their choice to receive content in and are not necessarily subject to algorithms from centralized entities1.

The number of newsletters (and indeed people writing newsletters) and podcasts that are free has grown significantly over the past few years, making a wealth of this information accessible via open protocols for free. Companies suh as Apple, Spotify, Substack, and Beehiiv have certainly helped make this the case, but at the same time the former two have also tried to make podcasting closed in some ways through gated podcasts and similar. But by and large, the growth of these formats has led to a more open web.

US Podcast Listener Numbers (2022–2026) [Updated Jan 2023]

One similar but related theme is that even as Twitter, Discord, Reddit and others maybe move towards being more closed, other UGC platforms such as TikTok and Threads and community platforms such as Outverse are all public in that the content on them can be indexed by search engines.

3/ LLMs making more content accessible

One of the more interesting ways the web could become more open is through the use of LLMs. Specifically, there are two ways I envision LLMs helping here:

A. Making it easier to consume/access the vast knowledge on the internet:

There’s so much great content available in video and audio form which traditional search engines haven’t done a great job with in the past. LLMs could make it possible to actually leverage all that untapped knowledge in these podcasts, Youtube, TikTok, etc, and deliver the right subset at the right time in response to a specific question asked to a chatbot (or whatever the interface might be). The improvement of speech-to-text models and growth of techniques such as RAG2 to use models to answer questions on arbitrary datasets and the improvement in infrastructure to embed the internet and store and retrieve them means that a world where we can actually tap all the information in these formats is already here.

A toy example of what this might look like on the Huberman podcast knowledgebase (via Addcontext)

B. Opening up access to closed/paid information

There’s a lot of data and content that is behind a paywall. This includes media content but also various datasets that require enterprise licenses to purchase. One of the interesting possibilities with LLMs is that this data could be made available to the LLM, with the user only charged on a pay-per-query basis. So for example, rather than a user paying for tends of news subscriptions, the LLMs could output information from the articles if relevant to the question asked and only charge for that query.

In this way, I’m optimistic that LLMs could actually be a conduit for some kind of micropayment model potentially taking off, which could make it easier to consume paywalled information when needed by only paying for what you need.

4/ Progress on AI data use

In AI and the decline of the open web, I wrote that one accelerant of companies like Twitter and Reddit increasing API prices and gating content was to better monetize their data for use by LLMs and to curtail risks of scraping. We’ve seen similar issues with images, with the likes of Getty filing lawsuits against Stability AI after their Stable diffusion generated images that had Getty’s watermarks.

I’m optimistic that we can make progress on the use of AI such that the risk of companies shutting up shop to models goes down over time.

Some examples of advancements that I’m excited about include:

Standards around controlling data use: Today, website owners can control whether search engines are allowed to certain webpages via tags such as ‘noindex’. I expect similar standards to appear for modes, where content owners can restrict their content from either being indexed when training an LLM, or from being retrieved as part of an answer via RAG. I do hope the emerging standards make the distinction between the two, and that separate tags emerge for both. For example, it could look like the below:

Search Engines
<meta name=”robots” content=”noindex”> // don't crawl for search engines

AI models
<meta name=”robots” content=”notrain”> // don't use for training
<meta name=”robots” content=”noretrieve”> // don't use for retrieval

Attribution and Licensing by Model Providers: Companies that train models will likely end up actively licensing content from the appropriate data sources, and there should be an increase in commercially-safe models3 that potentially have even figured out attribution mechanisms to make payments based on what images/text inspired the generation of a specific output.

Two side by side images show soccer players fighting for a ball. The left is a real photograph but the right is an AI-generated version, with distortion of the players’ bodies and faces. — Stable diffusion generating the Gerry images watermark

Thanks for reading! If you liked this post, give it a heart up above to help others find it or share it with your friends.

If you have any comments or thoughts, feel free to tweet at me.

If you’re not a subscriber, you can subscribe for free below. I write about things related to technology and business once a week on Mondays.

However, with podcasts, most consumption happens via Spotify and Apple Podcasts, so there is some element of centralization.

A good explainer of Retrieval Augmented Generation is here

Adobe’s Firefly which was trained on their Adobe Stock images is one good example