I have reservations about WebCite

Via BBGM, I hear of WebCite, an on-demand Wayback Machine for web content cited within academic publications. It’s important to make sure that links to web content in academic publications don’t fail to resolve to their intended content over time, but how valuable is it, and whose responsibility is it?

If the citing author feels it’s important, they should make a local copy. They have the same right to make a local copy as a repository does. If the cited author feels the link is important, he should take steps to maintain accessibility of his content. If neither of these things happen, this raises the question whether the value of the potentially inaccessible content is greater than the cost of a high-availability mirror of the web whose funding will come from as yet unspecified publishing partners.

These things aside, there are some important technical flaws with the project:

  • The URL scheme removes any trace of human readable information. It’s another one of those damn http://site.com/GSYgh4SD63 URL schemes.
  • All sites have downtime. Is the likelihood of any given article being available made greater by putting it all under one roof?
  • What about robots.txt excluded material? A search engine isn’t allowed to archive it, and many publishers have somewhat restrictive search engine policies.
  • Of course, it’s much easier to find flaws in a solution than to come up with a solution in the first place, but it seems to me that a DOI-like system where semantic permalinks could be used that would always point to content wherever it moved around the web would work better, lead to a more complete index, and be much cheaper to run, as well. I know they chose archiving as opposed to redirecting because they wanted to link to the version of the page on the day it was cited, and that’s a good idea, but if having a copy of the page as it was is important, the author needs to make a local copy, rather than hope some third-party will do it for him.

    John Udell likes it, but I’m feeling like it needs a little work.

    About Mr. Gunn

    Science, Scholarly Communication, and Mendeley

    15. November 2007 by Mr. Gunn
    Categories: Uncategorized | Tags: , , , | 15 comments

    Comments (15)

    1. It’s worth noting that you don’t have to use the shortened URLs provided for accessing WebCite archives. You can pass query parameters that use the original URL plus a date (in which case the nearest cached copy to that date will be returned) or an identifier of the referring paper for which the pages were cached. For examples of the latter see the Journal of Biology article linked from http://blogs.openaccesscentral.com/blogs/bmcblog/entry/webcite_links_provide_access_to

    2. That’s certainly much more useable, but the central point of failure problem, exemplified by the recent TinyURL downtime, remains.

      Then there’s the issue of the archive maintainer flogging his favorite political candidate on the front page.

    3. [I just found this in my spam folder. My apologies, Gunther.]
      Point 1: “The URL scheme removes any trace of human readable information. It’s another one of those damn http://site.com/GSYgh4SD63 URL schemes.”

      Response:
      If you check the WebCite technical documentation at http://www.webcitation.org/doc/WebCiteBestPracticesGuide.pdf you will see that WebCite also supports a transparent format like http://www.webcitation.org/query?url=this&date=that (retrieving THIS url at THAT date). The abbreviated format using an ID instead (http://www.webcitation.org/aSDHJE) is handy for print publications to save space, and is also used by publishers who in the references section cite the original URL plus the WebCited (archived) URL (see for example references in http://www.jmir.org, or also the BMC journals).

      I guess we could (and will) also make the database table public (the table which maps WebCite snapshot IDs to URLs/dates).

      Point 2: “All sites have downtime. Is the likelihood of any given article being available made greater by putting it all under one roof?”

      WebCite was developed to be used for the references section of an academic paper. Today, authors mostly cite an URL, which might be gone or changed if the reader accesses it. We advocate that authors/editors/publishers also publish a link to the archived version of that URL, as the author saw it when he cited it, by adding a WebCite link in addition to the original URL. So this will indeed increase the likelihood that the cited webdocument is available to the reader.

      Point 3: “What about robots.txt excluded material? A search engine isn’t allowed to archive it, and many publishers have somewhat restrictive search engine policies.”

      WebCite does honor robots exclusion standards, no-cache tags etc. (for copyright reasons). So these webdocuments cannot be cached, unless legislators change the copyright law to the effect that archiving for scholarly reasons is permissible. I would think that in these cases scholars should think twice if they want to cite material which is not archived and likely not available to the reader in the future.

      .

      Gunther Eysenbach
      WebCite initiator

    4. WebCite also supports a transparent format

      yes, I was a little hasty to condemn this, but I think it’ll still be irritating in practice, if people tend to use the shortened URLs.

      adding a WebCite link in addition to the original URL

      This is the main problem your service solves, and I agree that it does serve that purpose, but I still wonder how this will scale, how your index will remain clear of irrelevant material, and how a service like this would protect future visitors from visiting dangerous links if the service became compromised or changed hands.

      scholars should think twice if they want to cite material which is not archived and likely not available to the reader in the future.

      Or they could just make a personal copy citing fair use rights and send it into the journal for private storage and display only if the original page goes down. One can imagine research being done on pages which isn’t flattering to the page authors, so respecting robots.txt makes quite a few pages ineligible for indexing. If they’re publishing the page, shouldn’t researchers be able to report on it?

    5. Or they could just make a personal copy citing fair use rights and send it into the journal for private storage and display only if the original page goes down.

      I wonder why no journal is currently offering / encouraging / mandating their authors to do this? Perhaps because editors and publishers (and authors) are not overly excited about the idea of having to monitor cited URLs, to answer requests of readers when a cited URL is broken, having to send archived material out individually by email (or having to invent a infrastructure that displays cited material automatically when an URL goes 404 – wait a moment, doesn’t such an infrastructure already exist? yes, and it is called WebCite). I am an editor and publisher myself, and I developed the WebCite infrastructure (and made it available to other publishers) exactly to save me the work to do this.

      I also think the notion that individual publishers (including micro-publishers, or the 1000+ small self-published open access journals are more sustainable and stable than a central archive run by a consortium of publishers (which is what WebCite aspires to be). Individual publishers are ceasing to exist all the time, are taken over by larger publishing houses, etc. I don’t think that submitting copies of cited webpages to individual publishers and journals is in any way a more sustainable, durable, efficient way of preserving scholarly important material. How many of the existing journals, publishers, and authors will still exist in 20, 50 or 100 years to monitor their cited URLs to display the cited URL if the original page goes down or changes the content from the cited version?

      how your index will remain clear of irrelevant material

      WebCite is specifically and explicitly designed to archive material that is cited in an academic context. Should this (archiving “irrelevant” material) really become a problem (at the moment it isn’t), it would be quite trivial to scan the references of scholarly journals, books, theses to positively flag “relevant” material, and to move “irrelevant” material to a secondary archive until we get a user request proving that it has been used in a reference. Storage space is cheap, and I do not see any scalability problems.

    6. PS: Here is a view from another participating publisher (BioMed Central), who uses WebCite.

      the author needs to make a local copy, rather than hope some third-party will do it for him

      Making a local copy is not the problem. The problem is how to ensure accessibility for the reader, including when the citing author might be no longer around.

    7. Perhaps because editors and publishers (and authors) are not overly excited about the idea of having to monitor cited URLs

      But you’ll have to do this for copyright takedown requests, so you’re not really saving yourself work or keeping things available, are you?

      How many of the existing journals, publishers, and authors will still exist in 20, 50 or 100 years?

      How many papers published by now defunct publishers are still available? Because research papers are self-contained and present in libraries all over the world, their content remains no matter what happens to the publisher. This approach has worked since the invention of the library, and I don’t see why it shouldn’t continue.

      For online publications, the archiving has been done by the publishers instead of by the libraries, and perhaps this isn’t the best infrastructure, but it’s far better than depending on one central archive to hold everything.?

    8. But you’ll have to do this for copyright takedown requests, so you’re not really saving yourself work or keeping things available, are you?

      WebCite has to respond to takedown requests, but not individual publishers. And the work publishers save the work to disseminate archived copies.

      How many papers published by now defunct publishers are still available? Because research papers are self-contained and present in libraries all over the world, their content remains no matter what happens to the publisher.

      This is not entirely true – I do know of electronic journals which have ceased to exist and where the articles are lost forever.
      But this is not the point. Even if it is true that nowadays most academic articles are archived somewhere using systems like LOCKSS, we are talking here about non-journal material (webpages, blogs), cited in scholarly articles. There is no archiving infrastructure which would ensure that this material is archived, which is exactly the problem WebCite tries to solve. And yes, WebCite does feed its content to libraries.

    9. I do know of electronic journals which have ceased to exist and where the articles are lost forever.

      That shows the problems associated with centralized archives. If the e-journal had been archived in the libraries like with print journals, for example, there would still be a copy of all their issues. WebCite may save publishers the work of disseminating copies, but one could argue that disseminating articles is kinda the job of a publisher.

      More to the point, however, if the articles were self-contained like I’m arguing they should be, the failure of the journal wouldn’t result in the loss of content. You wouldn’t need the publisher to read the paper just like you don’t need the publisher to be in business to read a print copy.

      As you say, it’s not the job of WebCite to archive journals, but cited websites. So how can the uptime of millions of websites be improved by creating one more? Mirrors, sure, but that’s a different matter.

    10. WebCite may save publishers the work of disseminating copies, but one could argue that disseminating articles is kinda the job of a publisher.

      It is getting a bit confusing. We are not talking about disseminating articles, we are talking about disseminating cited webpages. If you are saying that it is the responsibility of the publisher to make these available, then I would agree – this is why WebCite – run by publishers – has been created. If you are advocating a low-tech solution such as including cited webpages etc. as supplementary files, then I would argue that this creates more problems than it solves.

      So how can the uptime of millions of websites be improved by creating one more?

      Citing a webcited archived copy of the original URL (which is also fed into libraries) in addition to providing original URL obviously increases the overall likelihood that either of these is still available for the reader. Perhaps you misunderstand the system as replacing the cited URL with a WebCite link. However, this is not how this is implemented. See http://www.jmir.org/2007/2/e15#ref1 for an example.

    11. I would argue that this creates more problems than it solves.

      Maybe. It’s just an idea borrowing from the publisher+library system that’s been so successful over centuries and regime changes and wars and so on. I would be out of my depth to comment further, but it could just be that you don’t think having a central point of failure is that big of a problem. I would argue that this is all fine and good for TinyURL, but making it distributed is the main architectural challenge in making something like this work for academics.

      I know you’re not expecting to replace the URL with a WebCite URL, but lets remember how services can become taken for granted, experience “mission creep”, and become depended on. I’d hate to see publishers make the mistake the Twitter team did by coding TinyURL support into Twitter and being embarrassed when the site went down.

    12. You keep bringing up the “central point of failure” argument, but I don’t think that WebCite is any more a central point of failure than for example http://dx.doi.org/, through which references in scholarly papers are cross-referenced to other papers, using their DOI. Whenever you see a [CrossRef] link or something similar behind a reference in a publication, this is routed through http://dx.doi.org/, which resolves it to the “real” URL (e.g. http://dx.doi.org/10.2196/jmir.9.4.e29 resolves to the paper with the DOI 10.2196/jmir.9.4.e29, which happens to be at http://www.jmir.org/2007/4/e29/ ). Publishers got together and agreed on this. Surely they created a central point of failure – if doi.org is down or gets hacked, the system breaks down. But it works.

      I’d hate to see publishers make the mistake the Twitter team did by coding TinyURL support into Twitter and being embarrassed when the site went down.

      The model is that WebCite is owned by publishers, so they have control over whether of not it “goes down” (and have a vested interest in keeping it alive).

    13. At this point, I don’t know if I’m just too stupid to understand(and maybe Clifford Lynch is too), but I can’t help thinking about what would have happened to the state of the world’s knowledge if the Romans had gathered all the books from all the monasteries and private villas around the empire and put them in one central library in Rome. I’m fairly certain that if anyone had asked, “But what if Rome gets sacked and the library burned to the ground?”, they also would have been reassured that that could never happen.

      In this, I suppose we’ll just have to acknowledge that we have a difference of opinion, but I would rather see the data format be the unifying factor, rather than the domain. Then you could have a protocol where you could ask any one of a number of services for a piece of data identified by some unique identifier(s) and any of those services could direct you to a currently functioning repository, somewhat like the way p2p systems work today.

      Since you’re the expert, can you explain to me why a centralized domain lookup service is better than a distributed hash-based system?

    14. You wrote:

      “…but I can’t help thinking about what would have happened to the state of the world’s knowledge if the Romans had gathered all the books from all the monasteries and private villas around the empire and put them in one central library in Rome.”

      I don’t know where the impression comes from that net archivists gather everything in a “central library”. In fact, the International Internet Preservation Consortium (IIPC) (of which WebCite is a member) consists of a number of national libraries as well as transnational initiatives such as WebCite and the Internet Archive, so this is pretty decentralized and there are a lot of redundancies. I assume in the future there will be a fair amount of data exchange between these organizations.
      WebCite already allows hash-based lookup, and we certainly work towards an infrastructure where other Internet archives are queried. What we cannot change overnight is the way scholarly authors currently cite webpages, blogs, wikis etc. right now, which is not by hash, but by URL and date – so we have to work with these realities, that’s why the primary recommended entry point to retrieve an archived / webcited document is something like webcitation.org/?date=..&url=…, rather than /?hash=…. (where the part after cache is the hash sum of the cited document). This is much easier to implement for publishers/editors, as they know the URL and the date of the material the author is citing, but not the hash – no scholarly author currently cites something using the hash. If this changes then we are certainly open to changing our recommendation in that our “preferred” recommended use of WebCite could be the query-by-hash format which is something like http://www.webcitation.org/cache/73e53dd1f16cf8c5da298418d2a6e452870cf50e
      However, even if editorial/style guidelines on how to cite Internet material would be changed asking authors to include the hash, a system like WebCite would still be needed to determine the hash and to make sure that at least one archive actually has a copy of the cited document.

    15. Well, that’s one way to do it, but then you run into the problem YouTube et al. have, where the content at the endpoint is behind some inscrutable hash that shouldn’t be public-facing. I’m not saying I’d like a hash-based system, just that I’d prefer that over a centralized one, i.e, one which requires use of URLs all resolving at the same domain.

      I appreciate your taking the time to go over these things with me. I hope you understand my standpoint on things is that of a writer, not a librarian or publisher, and I’m speaking out because it seems like it’s only librarians and publishers involved in the discussion at the moment, and if it continues this way, we’ll end up with some painful-to-use system that is more complicated than the current one, and no one will use it until forced by the publishers. I’m aware there are significant technical reasons for the way things are currently set up.

      As a content-creator, I’d like to maximize both the availability and exposure of my work by having human-readable URLs that are resolveable by a distributed system of resolvers, OpenURL-style. That’s the viewpoint I’d like to put out there, for anyone working on services for academics.

    Leave a Reply

    Required fields are marked *