Email Retention – automated organization

Like most people who work with computers on a regular basis, I receive a high volume of email every day. I would like software to automate the classification and organization of my email so that important and unique information is retained and easily accessible later, when I have a need to find it. I spend a significant amount of time doing this by hand, and it seems like a task that would be done more efficiently and effectively by software.

Most email accounts have a storage limit. Even when there is no technical limit, users would like to keep their mail box pruned so that it is not cluttered with messages that are a waste of space. The available space should be utilized most efficiently, and enough room should always be made available so that inbound mail is not rejected due to the quota being exceeded.

It is helpful to classify email to identify what type of content is represented, so that appropriate retention policies can be applied to each type. Classification may be done by organizing email into folders, and tagging messages with keywords or categories. An email’s classifiers can be used to determine which retention policies to apply.

Email can be moved to various offline repositories for archival purpose to stay in compliance with storage limits associated with an account on the email server. This allows email to be retained despite server storage limits, so that valuable information can remain searchable and accessible. The utility of the information is not apparent until the need for it arises later.

Often when a user replies to an email, the entire content of the previous email in the thread is embedded as a quote. Throughout the thread, the entire history of content is duplicated in quotes again and again. When determining which email messages to retain, it is usually sufficient to keep the latest messages, which include the content from every past message in the thread.

It is also helpful to prioritize messages for retention (or deletion) based on topics of significance or recognizable patterns. We have seen how Bayesian spam filtering successfully distinguishes spam from non-spam messages. Using similar techniques, it should be possible to distinguish significant content from email that contains content that is not valuable to retain.

Certain categories of email contain time-sensitive information. When we receive notifications and reminders, these should expire fairly quickly resulting in deletion. When we receive a list of this month’s events, this should expire at the end of the month. It should be possible to recognize these types of content, classify them, and apply expiration policies to them automatically.

These are some of the more obvious requirements that should be satisfied by an email classification and retention component. An add-on to popular email clients like Thunderbird and Exchange would be a valuable time-saver to many users. I am optimistically hopeful that someone will take up the challenge to develop such a useful tool.