To Whom it may concern:
I've noted what Archive.Org appears to be doing, and it deserves some comment as both a fact and a trend.
The Internet Archive, aka archive.org
is compiling a very extensive database of web pages, both old
and new. Their collection goes back (in my case) as far as 1996.
They're a 501c3 nonprofit organization; which doesn't mean they
can't charge money or collect a paycheck. For instance a check
on GuideStar
reveals that they paid approximately $78,000 in salaries, $44,000
in rent, $11,000 in professional services and $83,000 for technology
consulting in 1999 (there's nothing nefarious about any of this;
I do note that their directors didn't receive any compensation)
according to what they filed on their federal form 990.
Their stated purpose for doing what they're doing is "the gathering, archiving, and serving of digital information as an educational and historical resource", also from their form 990.
According to their own website and that of Alexa, a considerable portion of their content is "donated" by Alexa from Alexa's crawls of the web; Alexa in turn is owned by Amazon.Com.
If you visit content there, you'll find it rewritten in that links which you follow stay within the archive.
Looking around their site, you'll find various commentary about fair use, copyright, the alleged "failure" of copyright holders to preserve works, the seemingly perpetual extension by Congress of copyright duration, the need to balance copyright and public right, and various other noble, high-minded and public-spirited sentiments. Most of these sentiments I'm inclined to agree with, at least as general principles; but then when I start ruminating closely on it, I find I have various and sundry reservations.
How serious are these reservations? In the grand scheme of things, not very. I'm inclined to leave things as they are, I don't see that the material harm is great at this time. I do worry about the trend, I worry about where it came from and exactly what the philosophical underpinnings truly are. I see that material harm could be done. But for instance copyright law itself changed about a decade ago; and although an exceedingly strict reading might have put the whole Internet out of business, that obviously didn't happen.... at least not yet. Get my point? In principal this is all great, but out there where "meat space" is met things are a lot murkier.
So with that background information, I have some perhaps disjointed but I think necessary observations to make and opinions to state, in the sections below. I welcome response from the principals at archive.org or Alexa, or any other large search engine, cacheing, data mining or archiving concern: put up a page with your comments, and I'll provide a link to it at the end of this document.
If you run a web site and check the logs, you notice rather quickly that you get "crawled" by various automated agents from time to time. A few of these are personal or targeted, and we'll put those under the heading of data mining and be done with them. The rest of them are either indexers or archivers; it's not always easy to know which is which. They come by and visit all or most of your web site, and then they go away. What are they doing with this information? More generally, how does this information actually get to somebody who uses it? How do they use it? How does that impact you as a content provider and creator of intellectual property?
Indexers are the search engines such as Google, Yahoo and AltaVista (there are others). Some of these may cache content; but their primary function is to provide an index to your web content.
Caches are "transparent" from the standpoint of logs. To "optimize" content delivery, large ISPs hit your site once, and then serve that copy to multiple users; as a consequence you usually only see one hit in your log for the page, in a given period of time. The polite ones usually have the word "cache" somewhere in the resolved domain name.
Archivers are basically coming through and copying your entire web site and making a permanent record of it.
I came to the conclusion a couple of years ago that the "Home Page" is dead: search engines are the de facto home pages for most internet users, and forcing people to enter a site through a home page is nothing more than the proprietor's personal vanity and serves primarily as an irritant for people who come seeking information. An examination of the visiting habits at my site confirms this: people seldom visit the "home page", they know what they want, they get it, and they leave.
Crawling by search engines therefore is a "necessary evil", and not really evil at all. Furthermore when people come in to your site from a search engine the referror field in the log usually contains the search terms, which can be very good intelligence to receive: in this sense search engines are "good neighbors", sharing valuable information on a quid pro quo basis. The fact that Google for instance offers cached pages is therefore tolerable; it also allows those who for whatever reasons cannot visit the actual page to still view the content. Furthermore I use Google. All in all, I view their cacheing as a rental of content in consideration of the services they provide.
Cacheing, in its invisible manifestation, is understandable but onerous: it deprives people of the actual content. If you don't think this matters, try updating a web page on a third-party web site from behind the firewall of an ISP which performs this "service" on behalf of its users; pretty soon you'll find yourself serving content on some weird port to try and get around it. Although the "polite" ones name their crawlers "cache", as a content provider you still don't know exactly how many hits you've got, nor do you get statistics on how the ISP's users used your site (the order in which they visited pages). As a counterpoint to this, for corporate, governmental and other users (is it a coincidence that since 9/11 .gov and .mil traffic has dropped to near zero, and the number of unresolved IP addresses has jumped?) this provides some degree of anonymity. Are they entitled to that? In my personal case, I don't care a lot about specifically who is visiting me; but yes I would like to know exactly how much traffic I'm really getting and what the usage patterns are (I'll give specific reasons in the section Benefits and Harms). This sort of "service" is not the "real world wide web", and I think that ISPs who represent otherwise are being misleading; but that's not my problem, really.
Archiving simply and plainly doesn't sit well with me. Under the law as I understand it it is a theft and conversion of intellectual property, plain and simple. That doesn't mean that under some circumstances it isn't forgivable. There is no standard for when it is forgivable and when it is not: that is up to the content provider, not the usurper. I believe that it has been decided by the courts that for instance printing a book of usenet posts without the author's permissions (all of them) is unforgivable. Pressing CDs of "all the web" is presumably also unforgivable. Usenet is specified as a store-and-forward medium; latency of content is implicit. Every usenet server can be considered a cache or archive; not so with the web.
Archives, caches and some "in betweens" may engage in rewriting of content. This includes actual rewriting, display of other's content in frames, redirecting links to stay within the cache or archive, and other practices. Again, my understanding is that under the law this is unlawful creation of derivative works. You don't even technically have an inherent right to link to anyone else's content, although most people welcome referrals and desire to be cited for attribution, indeed that's part of the point of the whole world wide web (I've never received nastymail from anyone for a link, not even NOSC); nonetheless TicketMaster and Microsoft went to court over this point several years ago and that's the facts, Jack.
In response to these and other concerns, the web crawler writers
and vendors developed a standard for a robots.txt
file which allows content providers to tell robots what to crawl
and what not to crawl. But the onus is on the content provider,
which is akin to the spammers and telemarketers providing "opt-out":
it's strictly voluntary on the part of the transgressor and does
nothing about "rogue" operations, and for every one
you tell "put me on your DO NOT CALL list", another
one springs up to take its place.
Given that the state of the art in crawlers is sufficiently advanced that Alexa can discern my contact information from page content, crawlers should be able to discern a standardized copyright statement; that they choose not to do so should indicate that archivers know their position is precarious, that there is so much content out there with explicit copyright statements that if they honored them all then they would "lose" a significant portion of the content that they consider valuable! So, they "voluntarily" choose not to do so.
Archives, caches and search engines which do not provide referral information have the technology to provide content providers with aggregate surfing and use pattern information; it's the least they should be prepared to do. I'm prepared to forgive an archive which meets my standards for curation and scholarship, but it should still be prepared and willing to furnish me with information about use.
There has been a lot of debate over whether or not electronic media would be the death of copyright; however one thing remains indisputable, which is that editorship is as important as ever.
Editorship ranges from personal links pages to the criteria search engines use when indexing. In between are a large number of journals whose content is largely links to or reprints of content elsewhere: the key nature of a journal being that it is kept "fresh" on at least a semi-regular basis. Is archive.org's home page in the nature of a journal or a museum exhibit? I don't know, that's why I'm asking your opinion!
Libraries and museums are also historically curated, which amounts to largely the same function: deciding what is worthy of preservation and display. It is this human investment of judgement which distinguishes a library or museum from a junk heap.
I'll discuss this further in the section Copyright, scholarship, fair use and all that but I don't think you can pretend to be a library or museum without exercising such faculties before the fact: your collection has to be informed by such considerations as it is built, not afterward as you decide how to present it to your audience. Traditional "brick and mortar" libraries and museums have been constrained in this regard not only by funds but also by space.
Theft of antiquities and public treasures is often rationalized in the service of science and scholarship; in the second hand trade it results in seizures, fines and jail terms.
In the preceding I have made a point of differentiating my definition and understanding of education and research. Indeed, I cannot presume that my own definition is the same as anyone else's. The definition of education, research, scholarship, fair use, et cetera is not up to me or to any one party; ultimately it will have to be decided either by legislation or else by legal precedent, to the extent that it hasn't already thus been decided.
This will likely please noone completely; but it's the way it is. Hereafter, I am presuming that such a generally accepted definition does exist.
Precedent for fair use of transitory works goes as far back as the early days of radio, wherein it was established that it was acceptable to make a recording or transcription of a radio transmission which one listened to for reference purposes; but that this recording or transcription could only be shared with others who had likewise listened to the same transmission. Obviously some exceptions exist for the case when the content is specifically intended for someone who is not present at the time of the transmission.
Had the technology to monitor to multiple frequencies simultaneously and record them all existed at that time, trawling the airwaves in such a fashion would probably have been viewed in a less than favorable light. The concept of time shifting practiced by home users with VCRs to view commercial broadcasts at a more convenient time is a relatively recent development, and the fact that it has a special term of reference is no accident. But still, there is an undeniable intent by the home user to make such a recording for their own, personal use, and not for resale or redistribution.
If blatant copying for "scholarly" purposes is "fair use", then I suppose we can conclude that the academics who write and the publishers who publish textbooks used in the classroom don't expect to receive compensation for their works.
Ignoring for a moment the nature of copyright under the Berne Convention, I don't put copyright statements on all documents, only on some of them. Is there some reason to think that when I go to the trouble to do so, that I don't mean what I say? For instance, is the statement
C/o/p/y/r/i/g/h/t/ /(/c/)/ /2/0/0/1/ [intentionally munged to indicate that 1) no copyright is asserted in this work, and 2) so as not to upset any crawler which might be behaving in the desired manner] by Fred Morris, DBA Fred Morris Consulting. Tip O' Th' Hat to whoever was answering the mail at jericho@attrition.org for help with conceptualization. Fred Morris hereby grants you (this means you) a perpetual and royalty-free license to derive from, modify and redistribute it provided that you adhere to the following two conditions: 1) you must reproduce this copyright statement and additionally make visible alterations sufficient to prevent confusion with the original work, and 2) you hold Fred Morris and Fred Morris Consulting harmless for any damages incurred as a result of your use or attempt to use this page.
unclear in any manner? Can we assume that the archive, by making a copy, is agreeing to such a license? If so, then why are they displaying it without making the necessary visible alterations?
Again, I can forgive it in a specific instance if it is not done for profit or with the intent to usurp authorship or the author's right to publish or sell the material, in the ordinary course of scholarly or academic research, and within the accepted guidelines for fair use. But even given that, I am not waiving the assumption of liability. So who is assuming that liability?
Is it a museum or library, or a junk pile of counterfeit and stolen merchandise? Does the wholesale trawling and duplication of content which is still available (not "out of print") and at no appreciable cost, by automated crawlers, and which is not intended for use by the person making such copies constitute necessary curatory diligence? Does accepting the goods from a third party materially alter the situation?
There are benefits and harms from the current practices of archiving and cacheing. This list is not represented to be complete or authoritative.
A notion is bandied about that we are (or were) in some "Dark Ages" of the Internet, and a library of the alleged scope of the Library of Alexandria will somehow bootstrap us into a "Renaissance" if not modernity. From a historical perspective, I think this argument is flawed in that the Library of Alexandria was destroyed relatively early in the Renaissance (wouldn't you agree?).
During the Dark Ages, nobody could read and write beyond the priesthood.. at least not in Europe, although elsewhere in the world it was a different story. During the Renaissance literacy spread beyond the priesthood and the exchange of ideas was no longer constrained to such a great extent by political/power boundaries. During the Gutenberg Age literacy came to the masses.. the extent to which intellectualism came with it is still debatable. The number of (written) works surviving from either the Dark Ages or the Renaissance would not seem to be particularly indicative of much, the vastly greater number surviving from afterwards being more an artifact of mechanization than of curatory virtue.
If there was a Dark Ages of the Internet, I would put it back in the time when PDPs still roamed the earth and LSD was as common as Unix on the Berkely campus; the quintessential (byte) hacker had long hair, a leather jacket, rode a motorcycle, and was funded by DARPA; throughout the '80s, literacy existed, just not in the Europa which now fashions itself the Internet, and instead it flourished on bulletin board systems and proprietary services such as CompuServe. The Renaissance can be laid to Clinton's decriminalization (or commercialization, if you prefer) of the Internet and the subsequent spread of the technology beyond government and academia. We've been in the Gutenberg Age since the first GUI-based HTML editor. Indeed, crawlers are the harbinger of the Industrial Revolution. As with the real Ages of the same name, a centralized library of grand design has played little practical role: unless McNealy is right and the whole internet is viewed as a library, the network as the computer.... interesting to generalize that as a "group mind" to the development of thought and identity in the Middle Ages, but I digress.
Characterizing the period from 1996 to the present as an Internet Dark Ages is condescending, self-serving, arrogant and presumptuous. It doesn't win any points with me. I think it's just plain wrong.
An historical, versionable archive of the Internet is a sort of World Wide Web in Amber.
Personally, yeah sure, the fact that the Hot List
is there, in all of its editions from 1996 to the present (first
at http://www.halcyon.com/m3047/hot-list.html and
then at http://www.inwa.net/~m3047/hot-list.html)
is pretty neat. I'd thought of putting up all of the versions
myself to better illustrate the evolution of the Web,
but I didn't think there was much point since none of the content
it linked to was around any more; being able to actually check
out some of the linked content makes it work.
The site I'd like to see come back would be the NOSC's Planet Earth.. but not trapped in amber, but instead current and up-to-date, and open to the public.
It's also personally gratifying to see that the old issues of the Slime from when I was the editor are up somewhere. Personally I think it's fine, I think some of the interviews are still interesting to read; but I can't speak for the people who I interviewed, and I think they've got a right to be listened to when a work is republished in what is essentially a new medium, not envisioned at the time. Screw the law, you can't suddenly resort to hiding behind it now!
Regardless, a Web In Amber is not the Web. Hyperlinks exist exactly so that documents can be independently maintained and edited, and the text as a whole stays current; go ask NCSA if you don't believe me. That's why "under construction" signs on web sites are so stupid: of course it's under construction, it's the Web!
There is no "scholarly" or "fair use" right to record a performance, in contravention of the performer's expressed wishes. You can try; and your ticket to the performance as well as your recording equipment can be confiscated. You can face legal action if you attempt to deal in your transcription of the performance. I do not believe that all content on the Web, even if it is presented in the relatively (relative to what?) fixed medium of HTML, is immune from characterization as performance of a transitory nature.
I fervently hope that Archive.Org obtained releases from each and every content provider who is part of the 9/11 "collection". God forbid it should happen, but if somebody's heartfelt eulogy to a lost loved one, left on public display as part of mourning and then removed as part of the process of healing, should present itself to them again unbidden as a consequence of your "collection", they would have every right to condemn you for the sick bastards trading in other people's misery that you would manifestly be.
Some flowers are meant to fade, wither, dry and blow away. You'd have to be inhuman not to realize that.
What was in the mythical Library at Alexandria? Certainly some fine things, and perhaps some not so fine: tax records, criminal records, records that made you a bastard because your grandmother was a whore?
Who could go there, was it really open to all? Were the plebes allowed? What did it cost?
Were all of the contents traded for fairly, or was perhaps larceny, desecration, even murder rationalized away for the acquisition of some of the prized gems of the collection?
As I stated at the beginning of this essay, non-profit does not equate to free. As a content provider, am I allowed at least limited free use? Will I be denied use altogether at some point because I am not affiliated with an "approved" academic institution (just a lowly intellectual serf)?
Will you lose some lawsuit, or go bankrupt, or need desperately to raise cash: in other words will your "collection" come onto the market as an item to be sold to the highest bidder?
In the end, will your Internet Archive become nothing more than yet another surveillance tool of the State, and plaything of the elite?
Then burn it, I say! The Renaissance will come, unmediated by sanctimonious pretenders to intellectual mastery.
Really, I wish the Internet Archive all the best in living up to your lofty ideals. Although I have my misgivings, and fear you tread on unsteady ground, you have a vision and you have fine words. Here's hoping that your actions match your words, and that the details sort themselves out.
FRED MORRIS, 16-Jan-2002
m3047@inwa.net
a071b540083a728d2a02d824558347b7 dated-auth