The Emergence of a Permanent Web (Your Teachers Were Right)
“Be careful with what you post on the Internet. Once it’s up there, it’s up there forever.”
This was the refrain employed by school teachers to discourage the modern web’s first wave of children from publishing compromising content. Looking back, they were ostensibly, but not entirely, correct.
In the short term, content is pretty sticky on the web.
Content, especially popular content, is nearly impossible to scrub from the Internet once it’s in the hands of the masses. Even unremarkable content, like my coworker’s cringeworthy and now-deleted Instagram post supporting Donald Trump, can often be tracked down with about five minutes of Google-sleuthing. In the short term, content is pretty sticky on the web.
But over time, these pages have a tendency to disappear. This is known as “link rot” and even Pulitzer Prize-winning work is not immune to it.
One study of content hosted by Reuters found the median age of a web page to be about nine years (and this is content under the management of Reuters). Other studies have found that 25% to 50% of web pages eventually disappear or move, depending upon the sample of pages analyzed. In the case of JournalSpace, 100% of its content was permanently destroyed by a single disgruntled employee.
While this may be good news for angsty teen bloggers (raises hand) or my wayward Trump-supporting, Instagram-post-deleting coworker—for the professional creators, curators, and publishers of the web this can be a significant concern.
The Internet Archive is hoping to slow link rot by enabling its famous archive of the Internet (aka The Wayback Machine) to be queried via keywords—akin to a “historical view” search engine. This is big news, as currently the Wayback Machine can only be queried if the user knows the precise URL to access, a significantly more difficult piece of information for a user to input than, say, “coworker name + Donald Trump”.
While this has tremendous value for preserving art, culture, and knowledge (not to mention dank memes), the archive will also be saving everything else: public blog posts, social media musings, defamatory content, and so on. Just about everyone has a list of web content they’d prefer be left permanently inaccessible. Today, all you generally need is time and thick skin. In the future, that may not be enough.
What can the average user of the web do to protect their privacy?
This raises the question of how information deserves to be accessed on the web, especially when it’s of a personal nature. Several categories of web content (impersonation, revenge porn, mugshot sites, etc.) have been the subject of legal scrutiny almost as long as the web has existed. Most notably, in 2014 the European Court of Justice declared that individuals have a Right to be Forgotten from search engines, extending privacy legislation to cover search results. This has not been the case in the United States, where a vigorous debate continues between this right to be forgotten and the right for public access. For instance, a consumer may deserve to be aware of a business’ shady past even if the business has served its legal penance.
Throughout all of this, the question remains: what can the average user of the web do to protect their privacy? Unfortunately, few options exist and most of them reside comfortably within the neighborhood of Common Sense, USA:
- If something is posted publicly, assume it is permanent. Consider employers and future generations of your family able to access it.
- Once you’ve ignored that step, frequently review the privacy settings of the platform you’re publishing on. This assumes that none of your Internet friends will republish your content elsewhere and asshole coworkers won’t screenshot your posts for blackmail (or lulz). For example, Tumblr allows users to publish content with a noindex tag, which discourages major search engines from indexing and surfacing the content.
These are not exactly comforting options.
Fortunately, if you are an individual webmaster or blog owner, you have a little bit more control. The Internet Archive collects all of this data by using web crawlers (aka “bots”—their bot is called archive.org_bot) that download content from web pages and store it for processing and indexing later, just like Google and Bing. With access to a site’s Robots.txt file, a webmaster can disallow The Internet Archive from accessing the site’s content, making it impossible for the bot to store the content for later access.
Here are a few ways to do it:
Manual File Edit
1) Access the site’s root directory. Personally, I use an FTP client
2) Navigate to the robots.txt file and open it (if no text or file exists, open a new .txt file and skip to step #5)
4) If there is text, copy and paste it into a new .txt file
5) Add the following line to the bottom of the file:
User-agent: archive.org_bot
Disallow: /
6) Upload the file
7) If a robots.txt file already exists, overwrite it
WordPress Plugin
1) Install Yoast SEO Plugin
2) Follow these steps to navigate to an edit screen for the robots.txt file
3) Add the following line at the bottom of the file:
User-agent: archive.org_bot
Disallow: /
4) Save!
Drupal Plugin
1) Install the Robots.txt Module for multiple sites (for a single site, head straight to the Admin panel)
2) Add the following line to the bottom of the Robots.txt file:
User-agent: archive.org_bot
Disallow: /
3) Save!
Alternatively, a meta tag can be added to the <head> of any page you don’t want to be archived:
<meta name=”archive.org_bot” content=”noindex”>
This will allow other bots to index your site content, but will specifically discourage Archive.org from indexing (i.e. archiving) the page.
So there you have it. We’re all basically writing in an unlocked journal for future generations to see, and unless you own the site, there’s nothing to be done.
Well, almost nothing.
Kyle Risley is an SEO expert at Vistaprint and also provides freelance SEO consulting. When he is away from his keyboard, he’s usually at a concert or digging through records.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.