General Discussion
Related: Editorials & Other Articles, Issue Forums, Alliance Forums, Region ForumsInternet Archive to ignore robots.txt directives
https://boingboing.net/2017/04/22/internet-archive-to-ignore-rob.htmlOver time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archives goal is to create complete snapshots of web pages, including the duplicate content and the large versions of files. We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. In other words, a site goes out of business and then the parked domain is blocked from search engines and no one can look at the history of that site in the Wayback Machine anymore. We receive inquiries and complaints on these disappeared sites almost daily.
A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to [email protected]). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.
50 Shades Of Blue
(9,916 posts)PSPS
(13,577 posts)Malicious spiders have always ignored robots.txt anyway and Archive will continue to honor requests. This change seems targeted toward parked inactive domains to keep its history available, which is a very good idea.
50 Shades Of Blue
(9,916 posts)emulatorloo
(44,057 posts)I am not a fan of whitewashing, gaslighting or revisionist history.
If a company or a person can't own up to what he put on his website in the past, then maybe he shouldn't have put it up in the first place.
MineralMan
(146,248 posts)be visible to search engines and others, there are far, far better ways than using robots.txt. In fact, that file can be used to find out what isn't visible on a website and go peek into it.
One of my old websites exists now only on archive.org. When I shut down a website, I remove all of the content on it, post a notice at the root URL that the site is gone, and then discontinue hosting. However, I do still occasionally look at that site as it appears on archive.org. I can go back to its early, early days in the mid 1990s, if I like. I like having that archive available, just for historical interest.
bitterross
(4,066 posts)If a person or company is so naive as to believe there is any such thing as an expectation of privacy on the Internet they are foolish. Once you publish something to the web it is now part of history. For better or worse and whether you like it or not.
What you are actually saying is the same as if you felt that when a business puts up a racist, bigoted, etc. sign in its front window then I shouldn't be able to take a picture of that and publish it later in a newspaper. I should honor their wishes to hide the fact they are racist, bigoted, etc. by not publishing the picture or even just keeping a copy of it. That makes no sense at all.
More importantly, if politicians and governments are using the robots.txt to try to prevent their content from being archived so it can be disappeared later then this move makes even more sense.
UTUSN
(70,641 posts)MineralMan
(146,248 posts)While search engines choose to use it to identify content the website owner wants to be made public and ignore content that it doesn't want to be public, that's only a convention. Any web crawling spider can freely ignore robots.txt and do as it pleases. Many do ignore it.
Website owners need to choose a different method to put content that is not meant for the public behind closed doors. Smart website owners do that. robots.txt is essentially useless as any sort of security measure, since observing it is voluntary.
hunter
(38,301 posts)Maybe it's because the bad guys of the internet never paid attention to robots.txt files anyways.
I haven't paid attention to robots.txt on my personal sites for many years now, but I do remember when I used to care.