Web Spiders and Crawlers in E-Discovery
Tuesday, May 22, 2007 at 08:03PM
Ira P. Rothken in Issues

In the course of analyzing web sites and pages in your electronic investigations you will likely come across important evidence that you will need to preserve - from web site text manifesting trademark infringement to web site photos related to copyright infringement or other illegal conduct to web site files demonstrating security breaches.

You may need to analyze, after the fact, the way certain web pages and sites appeared on a certain date and time. You may also need to use certain web site manifestations in evidence in Court.

Given the dynamic and voluminous nature of the web it does beg the question:

How do you preserve web pages and whole web sites for later use as evidence?

The answer can be found in web spider and crawler software programs you can use to "mirror" web pages and whole sites for a given date and time. The terms web spider and web crawler (and web robot) are used interchangeably and in essence mean the same thing - a program or script that browses the world wide web from link to link in an automated fashion.

Generally speaking most web spider and crawler "capture" programs work the same way - you provide a starting URL, presumably either the "front page" of the site or a deep link for the main offending page and then you "tell" the program how many levels deep of the site or third party sites you want to spider and capture.

You can use web spider software to capture a single page or an entire web site, or even to follow links to third party sites. Needless to say that you need to be careful when you input capture criteria and start the spidering software as the amount of memory grows exponentially based on decisions related to levels to crawl and linked pages to store and whether to follow links to third party sites and servers.

You should also be mindful of the web spider side effect that in some instances you may be placing a large load on a target web server who is getting "pounded" with requests from your IP address which may prematurely reveal your investigation.

I have found that in many instances, especially in the civil litigation context, that the simplest and narrowest method of spidering is usually sufficient - especially when coupled with a web video produced using a software tool like Camtasia Studio providing a video exemplar of web site browsing clearly showing the offending content .

In other words in the current civil litigation climate it may be enough to use the the built in "web capture" tool in Internet Explorer or in Adobe Acrobat to capture a small number of pages or a single web page and such an approach will rarely be seriously disputed from an evidentiary perspective. The added advantage in using Adobe Acrobat for web spidering and capture is that you can bates stamp the resulting PDF pages and have your e-discovery document production of the target web pages ready to go in an easy to use format.

Adobe Acrobat does a conversion of the pages from their native format and thus may not be suitable if anything other than the general manifestation of the visible content on the target web pages is at issue - so choose a more robust solution if, for example, the underlying page formatting, page metadata, site directory structure, link structure, or dynamically generated code for the respective target web pages is at issue.

But, again, rarely does someone waste time challenging the integrity or admissibility of the Internet Explorer or Adobe Acrobat capture of the target web page(s) manifestation. If they do challenge the integrity of such a spidering effort then subpoena or request the production of documents to get a mirror or copy of the legacy web code to see if any of the alleged differences are material to the issues in dispute.

More complex spidering tools allow for much more accurate page and site capture and are particularly useful for copying a large number of web pages and sites over a long period of time and fully automating the process - including periodic programmatic review of the target web sites for changes and subsequent copies. Some of the leading programs in this category including Grab-a-Site, Web Copier, and WebWhacker.

Programs like Grab-a-Site use  specialized methods of web spidering and capture that preserves the integrity of the original site and thus may reduce or mitigate attacks on evidence admissibility some of these methods include maintaining actual filenames, server directory structure, and Unix compatibility .

Web Whacker may change the filenames in the spidering and capture process but does allow, using a proxy server technology, for a relatively realistic off line simulation of the online browsing experience for the captured site. Thus there is an argument that you should use both Grab-a-site and Web Whacker to capture a target web site given the pros and cons of each.

If you use Linux you may want to consider using the "wget" command - it is free and there is also a version for Windows. For example, to capture for your own off line review this site moredata.com you can use the command:

 % wget -m
http://www.moredata.com

If there problems due to internal links on the mirrored site having absolute links pointing back to the web - that can be handled by changing the wget command switch by using:

% wget -m -k
http://www.moredata.com

The above change in the wget command will likely change the links on the captured site so that they are no longer identical to the original site. Thus, if you want to err on the side of evidentiary conservatism you may want to mirror target sites and pages using both wget methods around the same period of time.

It is important to remember that unless you ethically hack a server your web spidering software cannot capture much of the web site source code - like server side scripts and therefore depending on the issues in your case you may need to use litigation methods like subpoenas and document requests to get the original web source code. Indeed, the web spidering software is usually limited to what could be manifested in a user's browser and thus usually static HTML, browser side scripts, or what is dynamically generated by the web site's underlying source code - but often this is enough evidence to support the material allegations of illegal conduct on a given web site.

If you need to get access to historical manifestations of web pages and sites you can try searching the Google cache or use the Wayback machine.

You can search the Google cache by using the method "cache:URL". Here is an example of this site's home page in the Google cache.

Here is an example of my law firm web site's home page in the Wayback machine.

Given the likelihood that you will need to find, mirror, and preserve electronic evidence obtained from the web you should seriously consider acquiring as part of your electronic investigation toolkit the proper web spider software to accomplish such a task.

Article originally appeared on Moredata - Electronic Discovery and Evidence (http://www.moredata.com/).
See website for complete article licensing information.