Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-l] uncovering MS-Word document info



On 04/11/2007 03:26 PM, Zajac, Richard wrote:

> By opening the file under the "recover all text" option in Word, I can find
> a wealth of file information beyond what is generally listed in the Document
> Properties, such as previous file names and directories, none of which is
> conclusive. Is there more information that can be extracted from the
> non-text content of the file?

There are about 100 categories of metadata that /might/ be there.

Some documents have the "Track Changes" option in effect. It's more
common in "workplace" documents than in "student documents". If
you find such a document, you'll have a cornucopia of forensic
information.

The same goes for the "Fast Save" option.

Other a few things like that, it's mostly a matter of picking up
breadcrumbs in the forest, which requires some nontrivial effort
and some understanding of arcane details. It probably won't be
conclusive by itself; a better idea would be to get a detailed
statement from the student of where the document came from, and
see if that is /consistent/ with the metadata.

Does anyone have any suggestions for extracting additional document info
from a MS-Word document?

1) I'm not an expert. I don't do windows.

2) If it's the latest version, i.e. office 2007, you're in luck,
because its .doc files are xml and you can just eyeball them.

3) Older .doc files are binary. To read them you need some
sort of tool.

3a) The most-obvious options include looking at the document
with msword, abiword, openoffice, or the like, and asking to
see the metadata, the Tracked Changes, et cetera.

3b) For non-critical tasks the linux tools are the quickest:
extract -V
antiword -r -s -f

3c) Also: I'm told Document Detective
http://www.stg.srs.com/eds/docdet/
is pretty good. I'm not 1000% sure how well it works for
_examining_ the hidden contents. It's touted as a tool
for _detecting_ and _eliminating_ hidden data, and there
are lots of situations where I might want others to detect
and/or eliminate hidden data in my files, without necessarily
wanting them to rummage through any hidden data that was
found. In any case, simple detection is already valuable,
because it tells you whether it is worth additional effort
to track down the details.

3d) Apache Jakarta POI is an open-source thingy that can
understand .doc binary format. I've never tried it, but
I reckon it could be taught to spill 100% of the available
beans.

==========

The student claims to have used someone else's file only as a formatting
aid, that he forgot to change the header info, and that the text and data
are in fact his own work.

That part is not implausible. I haven't written a document
"from scratch" in years and years. Normally I just copy an
older document and modify it. Usually I copy something of
my own, or copy the "formatted example" provided by the
journal publisher ... but if somebody looked closely at
the changelogs they would see some rather peculiar lineages.

My point is, peculiar is not necessarily sinister.

In any event, it would be unwise to focus too much on a single
document. A proper forensic investigation would consider
many sources of information, including a close look at the
computer on which the document was prepared.

A super-simple and super-powerful check would be to ask the
student how he obtained the other student's file. Then go
through the same steps to obtain another copy of the file,
and just compare it to what was turned in. Since we have a
name, a date, and other details, it should be straightforward
to locate the right file. If this hasn't already been done,
it would IMHO be well worth doing.