Jan 4 2019

A dive into the world of MS-DOS viruses

Translations are available in: русский

This post is a textual version of a talk I gave at The 35th Chaos Computer Congress at the end of 2018. You can watch the talk that was recorded by the wonderful C3VOC team below if that’s your preferred medium:

Or watch using the C3VOC/media.ccc.de player

intro slide

So I have an admission to make, MS-DOS does slightly outage me, regardless MS-DOS malware has always fascinated me to some degree, but first we must ask: “What is DOS?”

DOS is the “one up” of CP/M, another very old operating system
The DOS family covers a wide range of vendors, just because it’s DOS does not mean it’s going to be running on a 8086 CPU or better
Some of these DOS vendors share API compatibility, meaning that some have shared malware!

But really, most of our memories of the DOS era is strong aesthetic for how the computers of the looked at the time:

old DOS computers

This is the era of “computing beige” and the Model M keyboard, that may be famous or infamous depending on if you enjoy loud keyboards or not.

DOS prompt

Some of us may have memories of using DOS, and some might still use DOS!

WordStar for MS-DOS

For example, George R R Martin who wrote Game of Thrones reportedly uses Wordstar on DOS to write the book!

QBASIC for MS-DOS

We also cannot overlook QBASIC, for many this would have been their first exposure to programming!

techno virus

But sometimes life using DOS was not so great, sometimes you would be using DOS and all of a sudden things like this would happen. This sample also plays a small tune on the PC speaker while it’s printing, so this could be really embarrassing in a office environment.

ambulance virus

Some are a little more “cute”, this example just shows a ascii art ambulance scrolling across the screen, and then allows the program you ran to continue, at worst a mild inconvenience.

Thanks to a bunch of archivists for malware running under the name VX Heavens, we have a good historical archive of DOS Malware, or at least we would until the Ukrainian Police would raid the site:

Friday, 23 March, the server has being seized by the police forces due to the criminal investigation (article 361-1 Criminal Code of Ukraine - the creation of the malicious programs with an intent to sell or spread them) based on someone’s tip-off on “placement into the free access malicious software designed for the unauthorized breaking into computers, automated systems, computer networks”.

Luckily, there are still copies of the sites database around on popular torrent websites that can provide us a lovely dataset:

$ tar -tvf viruses-20070914.tar | wc -l
66714

$ ls -alh viruses-20070914.tar
6.6G viruses-20070914.tar

However to begin to take a look into these samples, we need to at first understand the typical propagation flow of these samples, giving that these programs are running in a pre-internet era:

dos malware lifecycle

Once you have got an infected file on your system and run it, the malware will either actively search or install syscall hooks to programs you run after. It will often do this in a subtle and non visible way to avoid detection. The importance of subtlety is important since to spread this malware need to either be given to another system through media (floppy disk) copy, or uploaded to another distribution point like a BBS

malware desision tree, payload highlighted

At runtime, the malware has two options; it can either stay hidden and infect new files, or it can display it’s payload.

Some of the payloads are quite pretty! With the below example using fancy features such a 256 color:

havoc the chaos malware

Or this one that is playing around with your screen buffer:

burma virus

malware desision tree, infect highlighted

However for the most part the malware will stay quiet and try and find files to infect. Infection of most files are super easy, for example, if you view a COM file as a long tape of machine code:

COM file tape

Then “all you need to do” is insert a JMP at the start of the program, and append the data to the end of the program. Leaving you with something that looks like this:

COM file tape that has been infected

Some code was smarter and would find “empty space” in a binary and rewrite itself there, this prevented a binary from getting bigger, a possible red flag for a antivirus to use.

malware desision tree, infect highlighted

However thinking back before, I also mentioned syscall hooking. Even though the execution runtime of MS-DOS is very basic, and carries almost no protection at all (you can trivially boot Linux from a COM file). It still carries a full API to prevent applications from needing to have their own file system implementations. Here is what some of the syscalls functions look like:

MSDOS syscall small list

These work by calling a software interrupt, in where the program will ask the CPU to jump to another section of system memory to handle something:

dos to syscall graph

However MS-DOS also offers the ability to add/modify these calls (with another call), allowing the system to be extended so that new drivers can be loaded in at runtime. However this also is a perfect place to add hooks for malware:

dos to syscall infected graph

This was a well used trick, since you could hook the “Open File” call and then use that to discover new binaries being run on the system… and infect them.

As a quick example of how these are used, let’s look at a simple “Hello World” program:

MS-DOS hello world

As we can see there are two int calls here. We use 21h (h = hex) as the master syscall number, and we can specific what action we want MS-DOS to do based on the value of Ah

MS-DOS syscall highlighted

In this case, the program calls a call to print a string, and then a exit with a 0 (unset) return code.

As previously mentioned. When you call int 21h the CPU will lookup in the IVT table for where to jump to, inside that handler is often a router type segment, that directs different major calls around, in the case of Int 21h it routes to different functions based on the value of ah. Once we get there a actual call handler will deal with the task at hand, then it will run iret to return back execution to the main program, often leaving behind registers about the results of the call:

DOS interrupt call loop

So. If we wanted to see all syscalls a program ran, we can breakpoint the start of the Interrupt handler and check what the value of ah is:

Interrupt handler hightlighted

We do this because the Interrupt handler is always in a fixed location in MS-DOS (this is way before the era of ASLR and Kernel ASLR) and the program location is not.

Syscall highlight

Once we run it, we can see the calls this sample made. While we can see on the screen it only printed out a Goat file notice (Goat files are a file designed to be infected, like a sacrificial goat). We also see that this program is doing more than just printing a string. It’s checking the DOS version (likely for compatibility checks) and then opening, reading and writing data!

Interesting syscalls highlighted

This is interesting! But we would like to know more about what the syscalls in red are doing, since they must have input data in them for things like filenames, and data to write to the files/screen.

For this we need to look at the other registers during the syscall:

16bit registers

Using the “Print String” as a simple example, we can see what the usage looks like:

MS-DOS print string docs

What is DS:DX ? Why are there two registers here, and how do we get the data location out of these two?

For this we need to understand a little more about the 8086 CPU.

real mode memory layout

The 8086 CPU is a 16 bit CPU, but with 20 bits of memory addressing. This means the CPU can only hold values that point to 64KB, this is a problem when the memory space is up to 1MB.

To get around this, we need to understand segmentation registers:

16bit cpu registers

The 8086 CPU has 4 Segmentation registers that we will need to care about:

CS - Code Segment
DS - Data Segment
SS - Stack Segment
ES - Extra Segment ( In case you need another one to pass around )

There is a whole bunch of other “general purpose” registers too, that save you from using the memory too much, and let you pass along parameters to other functions.

Segmentation registers work by changing a sliding window across the RAM:

segment moving the memory window

This is allows a 16 bit CPU to see all 20 bits of RAM, by ensuring that for every value of DS, the window is shifted by 16 bytes.

DSDX Explaination

In the case of this call DS is used as a pointer inside the 16 bit window as to where the start of the string is. The string printer will then scan until it finds a $ symbol and then stops. This is similar to other systems that use a null byte instead of a $.

CPU arch upgrade gif

Not much has changed as the x86 ISA aged, instead as the bit size of the CPUs have gone up, the same registers have just gotten wider.

So with that known, we can build a “todo” list for tracing these programs:

tracing checklist

With this setup, we can throw some big computers at the problem for a few hours, and collect up the results!

big computer htop

And after around a CPU core month, we get…

lack of malware activations

Nothing.

That’s disappointing. We burned at least a hamsters worth of power and got almost no cool activations!

syscall smoking gun

If we look at some of the samples, we see a smoking gun here. A decent chunk of samples are checking for the date or time.

If we take a look at the documentation for these calls, we see that the syscall returns the values as registers to the program:

date or time syscall docs

So we can brute force this! All we need to do is something like this:

sample testing life cycle

But there is one problem with this method.

red highlighted sample life cycle

The sample testing stage takes around 15 seconds since it is using a full qemu emulation process, and it could take up to 15 seconds for the program to fully run in the VM. Since DOS does not have power saving features, this means when DOS is idle, it is in a busy loop

So we could look at this problem in a different way, by looking at what code would be run after a date/time request.

Since our tracer is placed in the Interrupt Handler, we do not know out of the box where the program is:

call location map

For that we need to look at the stack, where there is the CS and IP registers waiting for us!

registers x86

Once we grab these two off the stack, we can use them to obtain the return code, making our checklist look like this:

new tracing checklist

Once we have done that and re run the testing on the dataset, we get to see what some of the return code looks like!

x86 date code

Here is a sample of one. Here we can see a comparison is being done on DL and 0x1e.

date or time syscall docs

If we look back to our documentation, we can see that DL is the day of the month, meaning we can parse the top 3 opcodes as the following:

translated code

We could go and manually review all of these, but there are a lot of these samples that check for the time, around 4700:

bar chart of malware types

So instead, we need to do something different. We need to write something… We need to write…

the worlds worst x86 emulator

The world’s worst x86 emulator, dubbed BenX86 is a emulator that is designed exactly for our needs, and not much else:

benx86 features

But it does have some advantages in it’s speed

benx86 advantages

final counts

We added 10k different execution tests based on paths we found with bruteforce using BenX86. So I’ll finish up with some of my favorite discoveries that are time activated:

happy new year malware

This sample activates on new year’s day and hangs your system after displaying a greeting. This might be a good thing if you are stuck in the office for new years day, or might be a bad thing if you really needed to do something on new year’s day

self removing malware

This sample was very surprising to me, It activates at the start of 1995 and informs the user of all of the infected files that it had infected, and then removes the infection (by removing the jump at the start), and then does nothing else, though for some reason it does say you should buy McAfee, clearly this message didn’t age well.

hate icon malware

This one frankly really confuses me, On 8th of November of any year, it will turn all 0’s on the system into tiny “hate” glyphs. This really confuses me, if you know why you would do this, let me know…

eating drive c malware

This one is might my nightmare output of any program, this program upon start will tell you that it failed to “eat” your primary drive. This would be incredibly alarming to see out of the blue.

navy seal copy pasta

Finishing off, we have what I’m pretty sure if the Navy Seal Copypasta version of DOS malware. Unsure what this author dislikes Aladdin, but whatever, you do you person.

If you are interested in the code that ran behind this, I have released my tooling on github, with no guarantees, if you want make this code yourself, you will need to do some work to ensure it works with your MS-DOS install (correcting the handler breakpoint)

However if you are just looking to see what I saw when looking at this project, I have archived the webui interface here: https://dosv.benjojo.co.uk/

If this kind of obscurity is the thing you dig, then you may like the rest of my blog, If you want to stay up to date for the new stuff I post, you should either use my sites RSS or follow me on twitter

Until next time!

Related Posts:

From VNC to reverse shell (2018)

x86 assembly doesn't have to be scary (interactive) (2018)

Random Post:

The year of RPKI on the control plane (2019)