I’ve written before about what seems to be the most persistent and error-proof way to handle citing journal articles in blog posts and blog posts in journal articles (1,2), because it seems like some people have gone to quite extensive efforts to address this problem, apparently without looking to see if someone else has already gotten started on a solution. I’m glad to see that people are starting to talk to one another about how to handle things, as opposed to creating their own version of the wheel.
But what if we provided a different service for more informal content? Recently we have been in talking with Gunther Eysenbach, the creator of the very cool WebCite service about whether CrossRef could/should operate a citation caching service for ephemera.
As I said, I think WebCite is wonderful, but I do see a few problems with it in its current incarnation.
The first is that, the way it works now, it seems to effectively leech usage statistics away from the source of the content. If I have a blog entry that gets cited frequently, I certainly don’t want all the links (and their associated Google-juice) redirected away from my blog. As long as my blog is working, I want traffic coming to my copy of the content, not some cached copy of the content (gee- the same problem publishers face, no?). I would also, ideally, like that traffic to continue to come to to my blog if I move hosting providers, platforms (WordPress, Moveable Type) , blog conglomerates (Gawker, Weblogs, Inc.), etc.
The second issue I have with WebCite is simpler. I don’t really fancy having to actually recreate and run a web-caching infrastructure when there is already a formidable one in existence.
The people at Crossref know about Purl.org, Archive.org, and they share my rather dim opinion of the NLM’s recommendation’s for citing websites. However, the people at WebCite.org apparently didn’t know that you can deposit things upon request into Archive.org. If CrossRef goes forward with their idea, perhaps working with Purl.org like they did with DOI, it would pretty much make WebCite irrelevant, and I wouldn’t have to be frustrated by seeing http://webcitation.org/f973p4y in a paper and never knowing if it’s worth following the link or not(at least there’s a greasemonkey fix for YouTube Links).
“Bloggers for Peer-Reviewed Research Reporting strives to identify serious academic blog posts about peer-reviewed research by offering an icon and an aggregation site where others can look to find the best academic blogging on the Net.”
It is all great except that it already exists and for a long time before BPR3. You can go to the papers section in Postgenomic and select papers by the date they were published, were blogged about, how many bloggers mentioned the paper or limit this search to a particular journal. I have even used this early this year to suggest that the number of citations increases with the number of blog posts mentioning the paper.
CrossRef isn’t exactly a public service, you know…I wouldn’t like to be relying on a commercial service whose only customers are journal pubishers to curate links to blog articles on the web. But maybe that’s just me being paranoid 🙂
The deposit-on-demand service from archive.org that you link to seems to be aimed at publishers (to archive their content as it’s published) unlike WebCite which is aimed at researchers (to archive other people’s content when they access it – rather than when the paper is written/published).
When I’ve used WebCite links in the past, it’s always been alongside normal web links and marked as an archived/cached version of the page, so you don’t have to worry that the original link context will be lost.
The main problem with WebCite, I think, is that it’ll just get full of junk because there’s no way to discriminate between valid caching for research and random movies/porn/music/etc storage. A service where users paid an appropriate fee to be able to archive pages on demand would make sense, I think, but would open up possible copyright difficulties.
CrossRef is a business, but they didn’t invent DOI, they’re just a major user of it, right? The point I’m getting at is that both the smart people who are rolling their own archive services, and the experienced people who are working with established services, need to get together.
A DOI-like system for URLs just seems to me like a better way right now to handle persistent links than WebCite, because how can an archiving solution handle both links to pages like this one but also dynamically changing displays of data generated from a database? Surely it doesn’t make sense to store a separate copy of a database for each way of filtering or displaying the data?
By no means am I implying that we should just sit back and wait for CrossRef to do something, just that people such as the developers behind bpr3.org or WebCite might benefit from talking to people who’ve previously worked on closely-related projects. I don’t want some half-baked idea to become a de facto standard that I have to use just because the people who came up with it were well-connected.
Journals already include supplementary material. How about storing a copy of the cited pages/databases, too?
I’m sure the people who built WebCite (I worked with them for a while, though not on this project) have done their background research.
The point of WebCite is that a researcher can archive a copy of a web page *at the time they visited it*. In theory if browsers made it easy to save and read archived web pages in Mozilla Archive Format or similar (ie all zipped up in an archive) then journals could add those archived pages as supplmentary information. In the absence of that method, WebCite is the next best thing, and the only existing solution to this problem that I know of.
Thanks for your comments, Alf. I know you’ve thought about this stuff for a long time, and have worked on many different aspects of the project(Zotero, etc), so I’m always grateful for your comments.
Since persistence of content is the combined responsibility of the researchers and the publishers, and it’s already easy for someone to save an archived copy of a page, the publishers need to get to work on their end, don’t they?
Using a designated third-party archive is fine, and if handled right, could solve both the problem of which content gets archived and the problem of paying for a whole second mirror of the web. What if publishers made a deal with Archive.org, with whom many have an existing business relationship, to make an archiving tool for authors?
Since there’s no way to distinguish research content from general web content, do you think you’d end up with something useable by restricting who could use the repository, maybe by giving an invite to people with their first paper or something, and then spam-filtering the repository?
That’s the problem – it’s not already easy for someone to save an archived copy of a page. However, a central archiving service also has a problem: the server might not be seeing the same thing as the signed-in user.
Maybe Zotero’s shared online space, with snapshots taken client-side of bookmarked pages, will be a good place to handle this. If they can’t guarantee persistence then publishers can store the archived snapshots as supplementary data.
This is where mark-up and metadata comes in, right, so that when the snapshot is taken, the archive knows what to make of the various parts of the page?
That certainly helps, yes.
Sorry I’m late to this thread. I just wanted to clarify something concerning baoilleach’s worry about us (CrossRef) being a “commercial service.” CrossRef is, in fact, a non-profit. We are a membership organization and our members are made up of non-profit publishers, commercial publishers, open access publishers, etc. We are completely business model neutral. The only reason we charge for our service is to ensure that it is sustainable. Clearly, we would have to charge for a WebCite-type service if we were to use the Internet Archive (or portico) as the back-end archiving option if for no other reason than the fact that *they* charge for these services and we’d need to recover our costs.
I think the main point is that there needs to be at least some redundancy built in if there’s going to be persistence, so a combination of IA or Zotero online storage with individual publisher archiving would achieve that.
Of course, some links are meant to be ephemeral.
WebCite is actually older than archive.org – I mentioned WebCite in a BMJ article in 1998 (a prototype was available in 1998 under the old URL webcite.net) See http://www.bmj.com/cgi/content/full/317/7171/1496 . So much to the point that “the people at WebCite didn’t know about archive-it etc. Archive-it is a relatively recent development and Alf is correct that the purpose is somewhat different.
We are working on and have implemented some specific features for the academic publishing environment (including metadata extraction, assigning DOIs to cached material, ability for publishers to submit manuscripts which will be scanned for URLs which will be cached etc.), and this is where I see the value of WebCite.
WebCite is already used by hundreds of journals and there is no way back… We hope that some organizations such as CrossRef or any other non-profit organization will take over the future development of the service (as I am primarily a researcher), but if not, then we will have to create our own non-profit organization steering the future development of the service.
Thanks for that background, Gunther. Would you care to address in further detail how webcite would scale and your plans for the sustainability and guaranteed availability of the service?
What about how to identify academic content vs. non?
I have made several attempts to “donate” the service and technology to CrossRef, as I think it is ultimately the publishers who should run the service.
There is a range of possible business models out there which I don’t want to go into detail here (context sensitive advertising being the obvious one, but also mining the data to determine “impact” of webpublications, creating a royalty scheme for authors of copyrighted cached work (and keeping a small percentage of this to fund the service), value adding services for paying members etc.
In terms of guaranteed availability we are talking to the guys who have the most experience in archiving material for centuries: libraries. Any library interested in this can sign up as an archiving partner and can host a dark mirror.
As Gunther said, CrossRef has had some discussions with WebCite about taking over the service and this is the reason that I’ve been raising this issue on the CrossTech Blog (cited above) and with people like Jon Udell (http://itc.conversationsnetwork.org/shows/detail1772.html). If we are going to try to make the case to the board and our membership (again- we are a non-profit membership organization) that we should run a service like this, it would be helpful to have some community input (e.g. this entire comment thread).
And again, to be clear- I didn’t mean to imply that we were thinking of working with IA, Portico, etc. *instead* of WebCite. I’m sorry if my post was misleading on that point. What I was trying to get across was that I suspect (and this has been confirmed by this thread) that people would be more comfortable with such a service if it were backed by multiple, redundant archiving options (library, commercial, etc.) and that this might cost money which we would have to recover (e.g. by charging publishers).
(Thanks for the clarification Geoffrey. Apologies for the error.)
Geoffrey – You’re exactly right. I’m interested on this because I recognize the value of preventing link rot, especially from within scientific publications, but the people uniquely able to determine which links they want to save are the people writing and publishing the paper, so an archive containing author-submitted archived pages maintained and curated by publishers seems like a more sustainable and confidence-inspiring approach compared to a private entity that archives anything sent to it.
Whatever archiving system the publishers want to use, be it individual archiving as supplementary info or use of a third-party service such as WebCite/IA isn’t really my concern. The things that are important to me are meaningful links and a shared, open archive format so if WebCite gets hacked or, decades down the road, becomes owned by someone of less integrity, all those links out there(even the ones printed using actual ink) don’t become useless or dangerous. Lacking centralization, this could never happen to the whole archive at one time unless every publisher was compromised simultaneously.
There is value in a central archive for interlinking and so that all the archived material is in the same format, but if a centralized system is used, care must be taken so that no one publishing company is favored by the archive maintainer.
This is great, but you’ll still need a curated archive or people will just game the system.
This is fine, too, but how would it work? The currency of the web is links and traffic. Paying people whose stuff is linked to the most is a great incentive to contribute quality content, but where would the money come from? Making people pay for access devalues the content because it’ll be linked to less and making people pay for inclusion turns the whole thing into a weird pyramid scheme.
Thanks, that is exactly the idea behind WebCite
WebCite has meaningful links, in addition to an abbreviated format; the latter is designed to be used in conjunction with the original URL (so the citation format would be Author, Title, original URL: xxx, Archived at: http://www.webcitation.org/ID). Even if WebCite would cease to exist, anybody could easily write a harvester to create a a mapping table. We will also make the mapping tables downloadable to create redundancies.
WebCite is “open” in a sense that everybody can go into the archive and archive whatever we archive. The difficulty for us is to strike balance between enabling “openess” / creating multiple copies and redundancies, and still being within the boundaries of copyright law. If we would create the ability to download our entire archive as XML files, I am not sure what this would do to our liability and capacity to remove copyrighted material from the database, should a copyright holder request removal from the archive. These are all extremely complex issues. WebCite is absolutely open to constructive ideas how to strike the balance.
That’s an interesting research question. I don’t think that people would be more able to game the system than journal editors are currently gaming the system to inflate their journal impact factor, but that’s a different matter…
Copyright holders of archived material sometimes contact WebCite to request removal of archived material from WebCite – not because they fear to loose links and traffic, but because they have taken down the content on their site and hope to make money with it. There are some publication / business models out there where content is initially free, but becomes “closed” after a while (many newspapers work that way: you can read the content of the current day, but to request content from an issue published a year ago you have to pay). The idea would be that WebCite could offer these copyright holders (who request removal of their archived content from WebCite because they fear that this affects their revenues from selling archived content themselves) to make their archived content available on a pay-per-view basis (rather than removing it), where users pay to retrieved the archived version (much as they would do on the newspaper site, requesting a past issue from the archive). WebCite could then keep a fraction of this as a commission. Of course, in an ideal world, everything in the archive would remain free. But in realitiy, if a copyright holder approaches WebCite to remove it, there is no choice – so the better way would be to keep it in the archive and to pay royalties to the author.
This is just one of the many business models which are possible to support WebCite in the long run – apart from context sensitive advertising (e.g. to advertise journals, books and articles which are related to the cited or archived webpage!).
Thanks for taking the time to discuss these issues with me, Gunther. Obviously the copyright issue is the thorniest one, because dealing with takedown requests is a big problem that I know you’re anxious to avoid. It just seems like using a distributed archive, instead of having everything under the same domain, would get around this problem. I don’t know how to ensure that links always resolve without giving up the distributed redundancy that is the strength of the web, but I think that’s a problem that needs a technical solution, not the expedient of simply archiving everything under one roof. I recognize that you’ve put a great deal of effort into working with publishers to ensure the longevity of your service, but what you’re basically asking us to believe is the problem of web addresses occasionally failing to resolve can be fixed by using one address for everything. That’s a hard sell, even if it were the NLM proposing it. We need a better way of doing this, because the solution to web site unavailability can’t be achieved by making another website.
What are your plans for publisher-curation of your archives?
How will you enforce use of the long format citations where space isn’t at a premium?
Acknowledging that gaming occurs in every search engine index and ranking system, what safeguards are you going to take to prevent that in your index?
Have you compared the likelihood that an item will remain in your archive over 10 years with the likelihood that a given link will resolve properly over 10 years? It seems like takedown notices could be a bigger problem than link rot, something that could be entirely avoided by fair use author archiving.
I think at the heart of the issue is that a research paper should be self-contained. If the information is important, it should be included with the paper and one shouldn’t simply hope that a website will remain up, no matter if it’s some guy’s blog or some guys web archive.
No research paper is fully self-contained. All papers rely on (cite) previous and related research or previous opinions without including the results or full text of the work they are citing. They can do this because there is an infrastructure in place that gives authors reasonable reassurance that any reader can retrieve the cited information in libraries, because journal articles are usually archived somewhere. The same assurance does not exist for webreferences. Which is exactly what WebCite tries to address.
I don’t see WebCite “enforcing” anything (although, if it would be run by CrossRef, there would certainly be ways to do it). Rather, we could make the mapping tables (ID to URL/DATE) public, to address concerns that if WebCite ceases to exist, cited URLs can still be reconstructed.
WebCite is around since 9 years.In one study of webpages cited in scholarly articles, after 27 months 13% of Internet references were inactive..
But that is not the point, because the approach is not to REPLACE the cited original URL, but to ADD a link to an archived copy (see the reference section in JMIR articles or <a href=”“>the BMC implementation as an example – all use the WebCite link in addition to the original URL). Surely everybody would agree that by ADDING a WebCite link increases the overall likelihood that the cited material is still available through either of these channels!
Mathematically speaking: If p is the likelihood that WebCite still exists in 10 years, and q is the likelihood that the original URL does not change or disappear within 10 years, then the likelhood that at least one of these is still available is
p + q – 1/(p+q), which is ALWAYS greater than q, regardless of what probabilities you plug in.
(we are dealing with independent events here, and the term on the right is subtracted to account for the likelihood that both event occur together).
Thanks for your other questions, which are certainly interesting, and plenty of fodder for information science PhD students to work on (interested students looking for a thesis please contact me! we are working with the Faculty of Information Science in Toronto on some of these issues). I also have been thinking of a napster like storage architecture and some other ideas.
I think my bottom line is that, as Matt Cockerill writes, a basic principle of digital archiving is that the sooner you start, the less you loose. So we can either wait until we have figured out all the details on how to create the perfect solution, or we can start archiving now, and iteratively refine our approach. The alternative model – authors archive the stuff they are citing themselves (or publishers do it) – has been around since the start of the Web, and has been shown to be ineffective.