http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=512 By Simson L. Garfinkel, Ph.D. ACM Queue vol. 5, no. 7 November/December 2007 A computer used by Al Qaeda ends up in the hands of a Wall Street Journal reporter. A laptop from Iran is discovered that contains details of that country's nuclear weapons program. Photographs and videos are downloaded from terrorist Web sites. As evidenced by these and countless other cases, digital documents and storage devices hold the key to many ongoing military and criminal investigations. The most straightforward approach to using these media and documents is to explore them with ordinary tools - open the word files with Microsoft Word, view the Web pages with Internet Explorer, and so on. Although this straightforward approach is easy to understand, it can miss a lot. Deleted and invisible files can be made visible using basic forensic tools. Programs called carvers can locate information that isn't even a complete file and turn it into a form that can be readily processed. Detailed examination of e-mail headers and log files can reveal where a computer was used and other computers with which it came into contact. Linguistic tools can discover multiple documents that refer to the same individuals, even though names in the different documents have different spellings and are in different human languages. Data-mining techniques such as cross-drive analysis can reconstruct social networks - automatically determining, for example, if the computer's previous user was in contact with known terrorists. This sort of advanced analysis is the stuff of DOMEX, the little-known intelligence practice of document and media exploitation. The U.S. intelligence community defines DOMEX as "the processing, translation, analysis, and dissemination of collected hard-copy documents and electronic media, which are under the U.S. government's physical control and are not publicly available."1 That definition goes on to exclude "the handling of documents and media during the collection, initial review, and inventory process." DOMEX is not about being a digital librarian; it's about being a digital detective. Although very little has been disclosed about the government's DOMEX activities, in recent years academic researchers - particularly those concerned with electronic privacy - have learned a great deal about the general process of electronic document and media exploitation. My interest in DOMEX started while studying data left on hard drives and memory sticks after files had been deleted or the media had been "formatted." I built a system to automatically copy the data off the hard drives, store it on a server, and search for confidential information. In the process I built a rudimentary DOMEX system. Other recent academic research in the fields of computer forensics, data recovery, machine translation, and data mining is also directly applicable to DOMEX. This article introduces electronic document and media exploitation from that academic perspective. It presents a model for performing this kind of exploitation and discusses some of the relevant academic research. Properly done, DOMEX goes far beyond recovering documents from hard drives and storing them in searchable archives. Understanding this engineering problem gives insight that will be useful for designing any system that works with large amounts of unstructured, heterogeneous data. Why "Exploitation?" When researchers say that their work is centered on information or document "exploitation," eyebrows invariably raise. The word exploitation is provocative, attracting unwarranted attention to a process that could just as easily be classified as "computer forensics" or even "data recovery." But, in fact, the word is apropos. The words exploit and exploitation imply using something in a manner that's "unfair or selfish."2 And it's true. People who are in the business of document and media exploitation really do seek to make unfair use of computer documents and electronic storage devices. Fair, after all, means following the rules. The "rules" of a computer system are the APIs, the data-storage standards, the file permissions, and other interfaces that were intended to be used by the file's creator. When a file in the computer's electronic trash is deleted by "emptying the trash," the rules say that the file's contents should no longer be accessible. The "undelete" command that is part of every forensic toolkit takes advantage of the fact that computer systems generally do not overwrite the contents of deleted files. This is a common problem in computer systems, affecting not only deleted files in file systems but also deleted paragraphs in word processors and even unallocated pages in virtual memory systems. Computer forensic practitioners working for police departments and litigation support firms also make their living by recovering intentionally deleted data, but even these processes follow rules - though those involved in exploitation might choose to ignore them. The goal of computer forensics is to assist in some kind of investigation, which usually begins because a crime was committed and, hopefully, ends with the perpetrator being convicted in a court of law. With conviction as a goal, forensic practitioners must be concerned with the evidentiary integrity and chain of custody - and they need to limit their search to information that is relevant to that investigation. In many cases the evidence will have been obtained under a search warrant or discovery procedure, the terms of which may limit the forensic examiner's actions or even which kinds of files may be examined. Evidence obtained by breaking the rules may even be suppressed. For example, in the case of U.S. v. Carey, an investigator executing a warrant on narcotics discovered files with a JPG extension that contained child pornography. Carey was indicted and convicted for possession of child pornography, but the appellate court reversed the ruling and remanded the case back to the trial court, arguing that "the seizure of evidence was beyond the scope of the warrant."3 The evidence should have been suppressed. Unlike the investigators in the Carey case, those engaged in document and media exploitation are not bound by any rules other than laws of physics and nature. The goal of information exploitation is to get and use the data - the ends justify the means. It's OK if these results aren't good enough for a conviction. Exploitation rarely seeks to prove or disprove the details of a case; instead, it seeks to make the fullest use of all the data that has been obtained. The standard of success is the usefulness of the result, not the reliability of the process. If you find the preceding paragraph alarming, remember that DOMEX is about exploiting data, not people. "Exploitation" is precisely the attitude that you want when you take a crashed hard drive to a data-recovery firm. If you've just lost the only copy of a 400-page manuscript, it's probably OK with you if the firm is able to recover the first 200 pages of the September 20 version and the last 180 pages of the August 19 version. Although a good defense attorney might be able to suppress a document that was made by stitching together those two halves, you probably don't care about that if you are the author and the alternative is rewriting the 400 pages from memory. Likewise, if you are using some kind of desktop search system to index the files on your hard drive, you don't mind if the product makes a mistake or two and shows you files that you aren't "allowed" to see - just as long as you find what you're searching for. [...] __________________________________________________________________ Visit InfoSec News http://www.infosecnews.org/
This archive was generated by hypermail 2.1.3 : Thu Dec 13 2007 - 00:21:56 PST