File Analysis: File Identification and profiling

Please read pp. 283-378 of MF.

Positing the same analysis of a dubious file presented on p. 284 of MF, "Hot New Video", what are the first initial items to look at?

What is this file? Broad steps...

  1. Details about where we found this file.
  2. Hash it to obtain "fingerprint" of this file to find identical files.
  3. "Fuzzy" hash it to find similar files.
  4. Classify its format, target architecture (hardware and software), the language it was written in, and the compiler/assembler used to create it.
  5. Scan the file for malware signatures
  6. Analyze the file for malware properties (are there oddly obfuscated areas? are there odd calls?)
  7. Extract strings and any symbol table information
  8. Look for armoring, such as wrappers, packers, or encryption — very few legitimate executables have any need for these
  9. Look at the linking: is it statically linked? Is it dynamically linked? If dynamically linked, what shared libraries does it use?
  10. Look online to see other people have more information about the file

Ahem, what is an executable?

A compiled executable is (usually) a creation from some sort of source code, such C, Java, C++, C#, assembly language, or a whole host of other languages.

Compilation and its backscatter

Compilation and linking both (generally) introduce recognizable structure, such as symbol tables, linkage maps, and debugging information. These can help you distinguish what compilers and linkers might have been involved in the creation of an executable.

Of course, malware authors often remove identifying information (though not in this case apparently! Also, the main page for analyzing this trojan is worth reading: http://www.skullsecurity.org/blog/?p=627.)


As we have mentioned before, file "types" as indicated by extensions such .exe or .jpg are not all reliable for identifying files.

In the Unix/Linux world, we have long had file (and now we also have the very useful objdump for looking at executable characteristics), but there's no completely equivalent program from Microsoft for the Windows world. Outside of Microsoft's port of Unix file, MF lists GT2, which appears to offer quite a bit of functionality.

Malware identification

Because of the vast proliferation of malware, many businesses and other types of organizations have sprung up to help fight the problem.

These can be good sources of signatures to try to identify malware; the databases that, for instance, ClamAV (aka known as ClamWin), provides can be used to identify problematic software.

I don't personally recommend using any online engines. You have no idea what's in the suspected malware, and you have no idea who you are giving the information to. If it turns out that keylogging information, for instance, has been buffered in the file you submit, you can be exposing sensitive information to unknown parties.

Of course, one thing to look for (as always) are strings. What might be in the strings that you find?

From MF, pp. 314-316:

MF points out on page 316 that strings can also be intentionally misleading.


While some malware is written in straight assembly language that potentially has very little linkage (and none if the writer can figure out how to have it executed directly rather than having the operating system load the executable into a process), other malware is written in higher-level languages that almost always require some linkage (there aren't any malware writers using Forth or Mint, I guess!)

In the Windows world, DUMPBIN can be of a lot of help for a standard binary produced from Microsoft's programming tools. It can list sections and what is linked in. The Linux program ldd has been ported to the Windows world, and it is helpful about showing what is dynamically linked (but note that your own environment's PATH information can also influence dynamic linking characterstics!) It can also list symbolic information, much like nm can in the Unix/Linux world.


Metadata is not limited to Windows executables (although those binaries certainly can contain a surprising amount of metadata); also note with Windows Office files that they can embed a lot of metadata — although it is often very "stale" since many people simply re-use Office files in order to use embedded formatting. Other formats, though, such as EXIF information in JPEG and TIFF files, can also contain useful metadata. Adobe's formats also may metadata that can be useful.

What to look for:


There is even (at least occasionally) an Obfuscated C Code Contest, so obfuscation is not solely the province of hackers. Other folks, such as Zend, the PHP folks, have provided mechanisms such as bytecoding for the explicit purpose of obfuscation.

But the malware malefactors have taken such obfuscation to entirely new levels, creating "digital armor", using packing, encryption, and "binders".

Obfuscation: Packing

The idea of packing is actually an old one. Back in the day, we had limited resources, be those memory, disk, or bit transportation (often via the U.S. mail and sneakernet.) Packing anything and everything conserved these resources. Even these days we use programs like zip, gzip, and other compressors to reduce the size of files.

But malefactors use packing for an entirely different purpose. They want to obscure their malware (and perhaps minimize its footprint), and packing is an efficient method to do so. They also use "embedded" packers, which are far less common in legitimate applications today. The decompression routine typically appear at the end of the file; the routine decompresses the executable and then executes it.

Obfuscation: Encryption

Encryption accomplishes roughly the same goal for malware as packing: it obscures the nature of the program.

Unlike packers, which remove redundancy and thus are detectably different from binaries (which typically have plenty of redundancy), encryptors don't (generally) produce low redundancy — if anything, they can reduce redundancy.

Using tools to detect "unusual" binaries

MF mentions several tools to detect anomalous binaries, including tools of dubious origin. One that is without cost is Mandiant.com's "Red Curtain" tool (in the text, this is usually abbreviated "MRC".)

Binders, joiners, and wrappers

The concept here is very reminiscent of the old days of computing, when we used "overlays" to separate and recombine bits of binaries in order to conserve memory space.


COFF, which has been used in several operating systems, was superceded by PE (portable executable) format in the Windows world.

This format has the typical ideas that one expects: sectioning of binaries into areas such as .text, .data, and .bss sections; unfortunately, its quest for backware compatibility, it also kept things such as the "MS-DOS stub" area, which can be used by attackers.

CFF Explorer is used extensively in the MF book for extracting data from Window's executables.