Since the 2013 version, Microsoft Outlook users have had the option to activate automatic reminders concerning message where attachments may have been forgotten.
When sending a message, the following warning can be shown:
With a bit of investigation (playing with variations of texts), you find out that:
- only English is supported (perhaps other languages will be added later);
- if keywords are misspelled, the warning will not activate (to be exact, Microsoft Outlook only accounts for one potential error, “ATACHMENT”);
- the system does not consider all of ‘your’ keywords as key;
- the algorithm is sophisticated, but far from ideal.
Jumping ahead, let’s just say right now that the keywords and the algorithm are hard-coded, and you cannot add keywords or exceptions. All that you can do is either turn the entire function on or off. With that in mind, some questions might arise: why do “see picture” and “see gif” activate the warning, but “see photo” and “see pdf” do not; why does “file not attached” keeps the system quiet, but “file attached not” triggers a reaction. How does it work?
How it’s built
The algorithm is executed in an MSFAD.DLL library (we looked over a version of this file from November 1, 2013), which is located in the Microsoft Office folder. This library contains only one function, HasAttachments, which receives the subject and body of the message. As a response, the function returns a decision: show the warning to the user or not. The size of the library is more than 300 kilobytes. That is a bit big to be simply searching for one string in a given text. There was a time when huge programs could fit in 300 kilobytes. Can it really be that all the program is doing is checking the message text for keywords?
But that really is the case. Eighty-six kilobytes of the library are occupied by data directly connected to analysis of the text. You cannot see the keywords in the body of the library, even if you have a hexadecimal editor. The vocabulary list is stored in compressed form and contains about 650 keywords. So if the words, even in encoded form, take up a little more than 5 kilobytes, what is in the other 80Kb?
The answer can be derived from the names of the functions that are found in the library code: ChunkGrammarRule, ChunkGrammarLevel, CompoundAnalyzer, StringAnalyzer, TemplateLexiconBasedStringAnalyzer, FlatLexiconStringAnalyzer, MorphLayerStringAnalyzer, ScriptStringAnalyzer, PartOfSpeechDisambiguator. The 80 Kb turns out to be data for the native language processing system!
There it is! Almost at the level of artificial intelligence! But is it really appropriate for this task?
How others do it
Reminders about forgotten attachments have existed for about 15 years in all manner of plugins for Microsoft Outlook. For example, in the Swiss army-knife program known as MAPILab Toolbox for Outlook, there is an “Attachments Forget” component, the settings for which are shown in the image.
It works quite simply. If a given string is found in a message – you get a warning. There is no natural language analysis, and ‘fooling’ it is very simple.
Nonetheless, it works in spite of this simplicity, and quite effectively. Additionally, it can learn your message-writing style and operate in whatever languages you use. If you are simply sending an invoice by email, you can teach MAPILab Toolbox to react to the phrase “see invoice” in 2 clicks. As for all of the sophisticated natural-language analysis in Microsoft Outlook 2013, it still can’t be taught to react to “see invoice” and it will never learn from your message-sending habits. It has no self-learning capability.
Looking deeper under the hood
Initially being very intrigued and impressed by the new capabilities in Microsoft Outlook, we were left rather disappointed after a giving it a practical work-out.
There are some ‘power’ words which, if they are shown in an otherwise empty message body, trigger the warning. There are nine of them: ATTACHED, ATTACHMENT, ATTACHMENTS, FYI, ATTACHING, REATTACHING, ENCL, ENCLOSURE, and ENCLOSURES. These words can trigger a warning in short phrases. For example, the phrase “WHUSGD YODJHHW IS ATTACHED” triggers the warning. This is, however, not much different from the algorithm in MAPILab Toolbox. It also knows 10 words, and it can learn dozens of combinations.
Returning to the natural-language analysis. The phrase “HE WAS VERY ATTACHED TO THE OLD LADY” does not trigger the warning, whereas “THEY FOUND A FIRE IN THE ATTACHED GARAGE OF A SINGLE-FAMILY HOME” does. As the system has a limited vocabulary, this phrase looks like: “HE WAS VERY ATTACHED TO THE * *” and “THEY FOUND A * IN THE ATTACHED * OF A * *” (the asterisks represent words unknown to the system). The analyzer, it seems, can tell the difference between “very attached” and “in the attached”. Here we see that the system deals with syntax reasonably well, but that semantics are beyond its capabilities. A working vocabulary of 650 words is not enough.
Now let’s get away from words associated with ATTACHMENT, and let’s see how the system does. The somewhat incorrect phrase “I SEND YOU THE FILE” does not trigger the warning, but “I AM SENDING YOU THE FILE” does. It should be noted that the system knows English very well, and if there is a missing article, then often even an obvious phrase will go unnoticed.
Many of the words in the system are processed by the same semantic code. For example, it is the same for the words CONTRACT, DOCUMENT, EXCEL, FILE, FORM, PHOTO, RESUME, SPREADSHEET, WORKBOOK and several others. So changing the word FILE in “I SEND YOU THE FILE” to any of these words will not have any effect. The list of words is quite limited, and we can easily find variations which will not trigger the warning. The phrase “I AM SENDING YOU THE BILL” or “I AM SENDING YOU THE NON-DISCLOSURE AGREEMENT” both go without warnings.
The vocabulary list
In the image below about one fifth of the words in the system’s vocabulary are listed, sorted by semantic code (CODE, its value being arbitrary). Here we took the beginning, middle and end of the full list:
The list, in our view, seems small considering that it is being applied to a massive set of emails in the real world. Half of the words are there for analysis of syntax. The other half consists of words closely associated with attachments in emails. Only the most popular such words made it into the list. Words such as HOME, GIRL, CAR, WORLD, and PEACE did not make the cut. So “ATTACHED GARAGE” and “ATTACHED STATEMENT” look the exact same to the system – a combination of words with a known first word and unknown second one.
The system produces a large number of false positive results (sending a warning when unnecessary), as well as false negatives (not reacting to phrases such as “THIS EMAIL CONTAINS AN IMPORTANT ATTACHMENT”).
Comparing the algorithm used with a primitive key word search, the results are quite similar. Why then did Microsoft choose such a complicated path, with a thousand times more code, to achieve mediocre results?
Is Google to blame for this?
An attachment reminder appeared in Gmail in 2010 (though it came out in Gmail Labs 2 years before that). HotMail (now Outlook.com) got this functionality a year later. The competition between these two giants shows up even in little things. If Google has done something “the simple way”, then Microsoft will come out with a technological marvel and a condescending smile.
In 2009, a German university published an article, “Learning to Recognize Missing E-mail Attachments”, in which data was introduced showing the superiority of self-learning algorithms over the static key word method. That very article might have been what planted the idea at Microsoft to create a ‘smart’ attachments reminder. That article plus the fact that they process a huge number of messages, with the resulting technology potentially useful in Outlook.com, Microsoft Outlook, and probably even in mobile apps may be the explanation.
Here is how Attachment Reminders reacted to several test phrases in Microsoft Outlook 2013 and in popular online services. (Yes – warning displayed, green – correct response):
This mini-test cannot be used to support an absolute claim about the algorithm, but it is adequate to conjecture that Gmail uses a primitive static key word method. It activates without fail when the phrase “I HAVE ATTACHED” and “IS ATTACHED” appear, without regard to the semantics or syntax. Outlook.com also works by this method, but it works with a larger number of key phrases than Gmail. It is clear that the advanced technology in Microsoft Outlook 2013 has not yet been implemented in Outlook.com
Only Microsoft Outlook 2013 demonstrates an attempt to analyze the text, but it is not always successful in doing that. There is no clear leader in the table. An increase in the size of the vocabulary list (by an order of magnitude) would probably result in higher quality results from the algorithm.
In practical application, the method of static keywords with adjustable settings available to the user seems to secure a more confidence-inspiring system, since electronic communications often include abbreviations and expressions, professional jargon, and other nuances that can make full text analysis difficult.
However, in any case, Microsoft has created a cool and unusual thing, which was very interesting to investigate. We’ll see what it becomes in a few years! We also looked over the July 16, 2014 version of MSFAD.DLL, which was included in the KB2883094 update (the latest available at the time of writing). In the new version, the vocabulary list and data for analysis of syntax is unchanged; the algorithm is also the same. There were only bug fixes. So active development of Attachments Reminder has not been going on at Microsoft over the last few months, apparently. A major update does not seem to be on the horizon.