General Discussion

blogslut

(37,981 posts) Sun Apr 23, 2017, 10:57 AM Apr 2017

Internet Archive to ignore robots.txt directives

https://boingboing.net/2017/04/22/internet-archive-to-ignore-rob.html

Robots (or spiders, or crawlers) are little computer programs that search engines use to scan and index websites. Robots.txt is a little file placed on webservers to tell search engines what they should and shouldn't index. The Internet Archive isn't a search engine, but has historically obeyed exclusion requests from robots.txt files. But it's changing its mind, because robots.txt is almost always crafted with search engines in mind and rarely reflects the intentions of domain owners when it comes to archiving.

Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files. We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore. We receive inquiries and complaints on these “disappeared” sites almost daily.

A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to [email protected]). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.

9 replies

= new reply since forum marked as read

Highlight:

Internet Archive to ignore robots.txt directives (Original Post) blogslut Apr 2017 OP

I disagree with this. If you own a web site, it should be your decision. 50 Shades Of Blue Apr 2017 #1

Their description of what they're doing makes perfect sense. PSPS Apr 2017 #2

I disagree. I think they should respect robots.txt files. 50 Shades Of Blue Apr 2017 #3

I don't. Things on the net shouldn't be allowed to disappear down the memory hole emulatorloo Apr 2017 #6

If you do not want content hosted on a domain you own to MineralMan Apr 2017 #7

I disagree with you. Your argument is not logical. bitterross Apr 2017 #9

I've got a "book" (80 pp) on Internet Archives. I don't understand the article, but am interested. UTUSN Apr 2017 #4

robots.txt is a very, very weak tool. MineralMan Apr 2017 #5

I'm unexpectedly ambivalent about this. hunter Apr 2017 #8

50 Shades Of Blue

(9,916 posts)

1. I disagree with this. If you own a web site, it should be your decision.

Reply to blogslut (Original post)

Sun Apr 23, 2017, 11:08 AM

Apr 2017

PSPS

(13,577 posts)

2. Their description of what they're doing makes perfect sense.

Reply to 50 Shades Of Blue (Reply #1)

Sun Apr 23, 2017, 11:33 AM

Apr 2017

Malicious spiders have always ignored robots.txt anyway and Archive will continue to honor requests. This change seems targeted toward parked inactive domains to keep its history available, which is a very good idea.

50 Shades Of Blue

(9,916 posts)

3. I disagree. I think they should respect robots.txt files.

Reply to PSPS (Reply #2)

Sun Apr 23, 2017, 11:38 AM

Apr 2017

emulatorloo

(44,057 posts)

6. I don't. Things on the net shouldn't be allowed to disappear down the memory hole

Reply to 50 Shades Of Blue (Reply #3)

Sun Apr 23, 2017, 12:09 PM

Apr 2017

I am not a fan of whitewashing, gaslighting or revisionist history.

If a company or a person can't own up to what he put on his website in the past, then maybe he shouldn't have put it up in the first place.

MineralMan

(146,248 posts)

7. If you do not want content hosted on a domain you own to

Reply to 50 Shades Of Blue (Reply #1)

Sun Apr 23, 2017, 12:11 PM

Apr 2017

be visible to search engines and others, there are far, far better ways than using robots.txt. In fact, that file can be used to find out what isn't visible on a website and go peek into it.

One of my old websites exists now only on archive.org. When I shut down a website, I remove all of the content on it, post a notice at the root URL that the site is gone, and then discontinue hosting. However, I do still occasionally look at that site as it appears on archive.org. I can go back to its early, early days in the mid 1990s, if I like. I like having that archive available, just for historical interest.

bitterross

(4,066 posts)

9. I disagree with you. Your argument is not logical.

Reply to 50 Shades Of Blue (Reply #1)

Sun Apr 23, 2017, 01:24 PM

Apr 2017

If a person or company is so naive as to believe there is any such thing as an expectation of privacy on the Internet they are foolish. Once you publish something to the web it is now part of history. For better or worse and whether you like it or not.

What you are actually saying is the same as if you felt that when a business puts up a racist, bigoted, etc. sign in its front window then I shouldn't be able to take a picture of that and publish it later in a newspaper. I should honor their wishes to hide the fact they are racist, bigoted, etc. by not publishing the picture or even just keeping a copy of it. That makes no sense at all.

More importantly, if politicians and governments are using the robots.txt to try to prevent their content from being archived so it can be disappeared later then this move makes even more sense.

UTUSN

(70,641 posts)

4. I've got a "book" (80 pp) on Internet Archives. I don't understand the article, but am interested.

Reply to blogslut (Original post)

Sun Apr 23, 2017, 11:55 AM

Apr 2017

MineralMan

(146,248 posts)

5. robots.txt is a very, very weak tool.

Reply to blogslut (Original post)

Sun Apr 23, 2017, 12:06 PM

Apr 2017

While search engines choose to use it to identify content the website owner wants to be made public and ignore content that it doesn't want to be public, that's only a convention. Any web crawling spider can freely ignore robots.txt and do as it pleases. Many do ignore it.

Website owners need to choose a different method to put content that is not meant for the public behind closed doors. Smart website owners do that. robots.txt is essentially useless as any sort of security measure, since observing it is voluntary.

hunter

(38,301 posts)

8. I'm unexpectedly ambivalent about this.

Reply to blogslut (Original post)

Sun Apr 23, 2017, 12:17 PM

Apr 2017

Maybe it's because the bad guys of the internet never paid attention to robots.txt files anyways.

I haven't paid attention to robots.txt on my personal sites for many years now, but I do remember when I used to care.

Reply to this discussion