Is the Web becoming a walled garden?

A walled garden, in the context of the Internet and freedom of data, is an environment where access to data within the ‘garden’ is controlled and can only be accessed via approved apps or websites.

There are many well-known examples of these, such as app stores, social media platforms, and messaging and collaboration platforms (including Slack, Teams, WhatsApp). It is worth pointing out that the content being ‘walled off’ is generally not created by the companies providing the tools; rather, by the users inside.

The reasons for doing this are plentiful: controlling the flow of data makes it much easier to retain users, charge for access or promotion, analyse traffic, serve ads, and build profiles of users.

In this article, I’m not looking too much at social media platforms; I am much more concerned about the increasing centralisation of the Web and what it means in terms of privacy and censorship.

It is becoming ever more apparent that even content existing outside of the services that typically host walled gardens is now subject to at least some control by a handful of ‘Big Tech’ companies.

Implicit Censorship

A website is essentially just code and data delivered on demand to your browser, and then rendered by a browser. To get this code, your browser connects to another computer (a server) and asks to download it. This server has to be online (hosted) somewhere - this is what we mean by hosting.

As more and more hosting converges on a handful of large providers, they can essentially dictate which website is and is not online; smaller providers are often unwilling to go against this due to risk of targeted denial of service attacks. Even if a website has hosting, it still needs a registrar and a DNS provider (so you know where the hosting is) before a typical end user can see it.

Even then, a webpage is unlikely to be found if search engines don’t display it on the front page of results, if social media platforms silently discard links to the page, if mobile apps pointing to the site are removed from app stores, and if large email providers divert anything containing those links to spam.

Additionally, the higher the percent of email accounts hosted by a few providers, the more justified they may feel in ‘distrusting’ emails sent from anywhere else - directing them straight to spam, or rejecting them outright.

It is easy to think that the above will only apply to controversial websites, apps, or emails - which, in itself, is still problematic and goes against the freedom of the Internet - but there is little transparency in what, exactly, is filtered out. It’s not impossible to consider a future where popular search engines only show results on the first page if their hosting is within an ‘approved network’ consisting of a handful of public cloud, CDN and limited datacentre providers - which could simultaneously decide that a particular app is no longer welcome.

Data Collection and User Profiling

Even if censorship is not a concern, there’s the question of how much companies can determine about a user based on their visited websites and emails. Most people are painfully aware of cookies by now, but what about when cookies are not needed?

It takes surprisingly little of your browsing history for your personality and demographic to be determined, for example: “users’ personality traits and demographic information can be predicted based on the browsing logs, even when the URLs in the logs are preprocessed by a many-to-one pseudonym”.

The more traffic that flows through a particular network, the greater the potential for analysis - this is even more pronounced when traffic is flowing through a handful of endpoints (such as Cloudflare), even if the final destination is somewhere else.

Everyone who hosts their email with Gmail can have their emails read by Google, even if Google announced in 2017 that this analysis will no longer be used for targeted ads. However, this also applies to everyone who sends (including CC or BCC) email to someone using Gmail (which, with GSuite allowing custom domains for Gmail, may not always be obvious)- which often leads to people modifying their behaviour, as is highlighted in Life After Gmail: Why I Opted for a Private Email Server

The experience made me wonder if Google’s data collection practices had been restricting my thoughts. This seemed half-crazy until I started asking around about the idea. A 2017 study published in The Cambridge Handbook of Surveillance Law showed that web searches for health-related terms fell after the 2013 disclosure by National Security Agency whistleblower Edward Snowden revealed previously unknown levels of government spying on internet activity.

goes on to add:

For years, I realized, I’d been self-censoring my emails, too, keeping certain thoughts and feelings out of even personal correspondence because of a fear that they might wind up in a hack, or a lawsuit, or some advertiser’s data dump. People do this at work all the time, but it seems slightly insidious as more of our personal communication moves to electronic forms

How much of a problem is this?

None of the issues I’ve described here are new, but I was interested in how widespread they are, and if they are getting worse.

Recently, in collaboration with The Internet Society I re-designed a crawler (a program to systematically browse the Internet) to compare IPv6 support for the top million domain names. I ran a crawl in January 2022 that allowed me to provide some answers to this:

  • Fewer than 28% of all collected IP addresses were unique;
  • Three companies working together could at least temporarily disable DNS for 17.04% of the top million sites;
  • Almost 6% of the top sites could have their users tracked via just 7 IPs (which is only three providers);
  • If Google and Microsoft decide your emails shouldn't be read, you’d be unable to send email to at least 36% of the most popular domains (and you may not even know that your emails were not being received).

I personally find this direction really concerning; it may only represent the top million domains but I would expect less centralisation here as bigger companies are much more likely to host on their own infrastructure.

In the technical write up, I provide some comparison with 2010 (did you know almost 10% of the top domains were Blogspot back then?), and if you want to run your own crawler, you can find instructions here - and the Internet Society would love to hear from you.