Welcome to DU! The truly grassroots left-of-center political community where regular people, not algorithms, drive the discussions and set the standards. Join the community: Create a free account Support DU (and get rid of ads!): Become a Star Member Latest Breaking News General Discussion The DU Lounge All Forums Issue Forums Culture Forums Alliance Forums Region Forums Support Forums Help & Search

blogslut

(37,981 posts)
Sun Apr 23, 2017, 10:57 AM Apr 2017

Internet Archive to ignore robots.txt directives

https://boingboing.net/2017/04/22/internet-archive-to-ignore-rob.html

Robots (or spiders, or crawlers) are little computer programs that search engines use to scan and index websites. Robots.txt is a little file placed on webservers to tell search engines what they should and shouldn't index. The Internet Archive isn't a search engine, but has historically obeyed exclusion requests from robots.txt files. But it's changing its mind, because robots.txt is almost always crafted with search engines in mind and rarely reflects the intentions of domain owners when it comes to archiving.

Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files. We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore. We receive inquiries and complaints on these “disappeared” sites almost daily.

A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to [email protected]). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.
9 replies = new reply since forum marked as read
Highlight: NoneDon't highlight anything 5 newestHighlight 5 most recent replies

PSPS

(13,577 posts)
2. Their description of what they're doing makes perfect sense.
Sun Apr 23, 2017, 11:33 AM
Apr 2017

Malicious spiders have always ignored robots.txt anyway and Archive will continue to honor requests. This change seems targeted toward parked inactive domains to keep its history available, which is a very good idea.

emulatorloo

(44,057 posts)
6. I don't. Things on the net shouldn't be allowed to disappear down the memory hole
Sun Apr 23, 2017, 12:09 PM
Apr 2017

I am not a fan of whitewashing, gaslighting or revisionist history.

If a company or a person can't own up to what he put on his website in the past, then maybe he shouldn't have put it up in the first place.

MineralMan

(146,248 posts)
7. If you do not want content hosted on a domain you own to
Sun Apr 23, 2017, 12:11 PM
Apr 2017

be visible to search engines and others, there are far, far better ways than using robots.txt. In fact, that file can be used to find out what isn't visible on a website and go peek into it.

One of my old websites exists now only on archive.org. When I shut down a website, I remove all of the content on it, post a notice at the root URL that the site is gone, and then discontinue hosting. However, I do still occasionally look at that site as it appears on archive.org. I can go back to its early, early days in the mid 1990s, if I like. I like having that archive available, just for historical interest.

 

bitterross

(4,066 posts)
9. I disagree with you. Your argument is not logical.
Sun Apr 23, 2017, 01:24 PM
Apr 2017

If a person or company is so naive as to believe there is any such thing as an expectation of privacy on the Internet they are foolish. Once you publish something to the web it is now part of history. For better or worse and whether you like it or not.

What you are actually saying is the same as if you felt that when a business puts up a racist, bigoted, etc. sign in its front window then I shouldn't be able to take a picture of that and publish it later in a newspaper. I should honor their wishes to hide the fact they are racist, bigoted, etc. by not publishing the picture or even just keeping a copy of it. That makes no sense at all.

More importantly, if politicians and governments are using the robots.txt to try to prevent their content from being archived so it can be disappeared later then this move makes even more sense.

MineralMan

(146,248 posts)
5. robots.txt is a very, very weak tool.
Sun Apr 23, 2017, 12:06 PM
Apr 2017

While search engines choose to use it to identify content the website owner wants to be made public and ignore content that it doesn't want to be public, that's only a convention. Any web crawling spider can freely ignore robots.txt and do as it pleases. Many do ignore it.

Website owners need to choose a different method to put content that is not meant for the public behind closed doors. Smart website owners do that. robots.txt is essentially useless as any sort of security measure, since observing it is voluntary.

hunter

(38,301 posts)
8. I'm unexpectedly ambivalent about this.
Sun Apr 23, 2017, 12:17 PM
Apr 2017

Maybe it's because the bad guys of the internet never paid attention to robots.txt files anyways.

I haven't paid attention to robots.txt on my personal sites for many years now, but I do remember when I used to care.

Latest Discussions»General Discussion»Internet Archive to ignor...