Blog Fingerprinting: Identifying Anonymous Posts Written by an Author of Interest Using Word and Character Frequency Analysis [open pdf - 339KB]
"Internet blogs are an easily accessible means of global communications. Monitoring blogs for criminal and terrorist activity is a serious challenge, due to blogs' anonymous nature and the sheer volume of data. The intelligence community is often faced with more information than it can process. The need exists to develop methods for processing the massive amounts of data this media presents, without a significant increase in manpower. An automated tool capable of indentifying posts written by an individual, given a sample of his writing, would allow law enforcement and intelligence agencies to gather evidence that would otherwise be overlooked due to manpower and time constraints. This research focuses on identifying blog posts written by a particular author, when we do not have a model of every potential author. Previous research either builds a distinct model for every possible author, or limits itself to large documents. Neither approach is appropriate for processing blog posts. Blog posts tend to be short documents, and building a distinct model of each author is unreasonable if you are looking for one author among millions. We address this problem by combining sample posts by other authors to create a model of an 'average author.'"
Naval Postgraduate School, Dudley Knox Library: http://www.nps.edu/Library/index.aspx