Mining online diaries for blogger identification

Mohtasseb, Haytham and Ahmed, Amr (2009) Mining online diaries for blogger identification. In: The 2009 International Conference of Data Mining and Knowledge Engineering - The World Congress on Engineering, 1 - 3 July, 2009, London, UK.

Documents
Mining Online Diaries for Blogger Identification
[img]
[Download]
[img]
Preview
PDF
wce-paper.pdf

410kB
Item Type:Conference or Workshop contribution (Paper)
Item Status:Live Archive

Abstract

In this paper, we present an investigation of authorship
identification on personal blogs or diaries, which are different from other types of text such as essays, emails, or articles based on the text properties. The investigation utilizes couple of intuitive feature sets and studies various parameters that affect the identification performance.

Many studies manipulated the problem of authorship
identification in manually collected corpora, but only few
utilized real data from existing blogs. The complexity of
the language model in personal blogs is motivating to
identify the correspondent author. The main contribution
of this work is at least three folds. Firstly, we utilize the LIWC and MRC feature sets together, which have been
developed with Psychology background, for the first time
for authorship identification on personal blogs. Secondly, we analyze the effect of various parameters, and feature sets, on the identification performance. This includes the number of authors in the data corpus, the post size or the word count, and the number of posts for each author.

Finally, we study applying authorship identification over a limited set of users that have a common personality attributes. This analysis is motivated by the lack of standard or solid recommendations in literature for such task, especially in the domain of personal blogs.

The results and evaluation show that the utilized features
are compact while their performance is highly comparable
with other larger feature sets. The analysis also confirmed
the most effective parameters, their ranges in the data
corpus, and the usefulness of the common users classifier
in improving the performance, for the author identification
task.

Additional Information:In this paper, we present an investigation of authorship identification on personal blogs or diaries, which are different from other types of text such as essays, emails, or articles based on the text properties. The investigation utilizes couple of intuitive feature sets and studies various parameters that affect the identification performance. Many studies manipulated the problem of authorship identification in manually collected corpora, but only few utilized real data from existing blogs. The complexity of the language model in personal blogs is motivating to identify the correspondent author. The main contribution of this work is at least three folds. Firstly, we utilize the LIWC and MRC feature sets together, which have been developed with Psychology background, for the first time for authorship identification on personal blogs. Secondly, we analyze the effect of various parameters, and feature sets, on the identification performance. This includes the number of authors in the data corpus, the post size or the word count, and the number of posts for each author. Finally, we study applying authorship identification over a limited set of users that have a common personality attributes. This analysis is motivated by the lack of standard or solid recommendations in literature for such task, especially in the domain of personal blogs. The results and evaluation show that the utilized features are compact while their performance is highly comparable with other larger feature sets. The analysis also confirmed the most effective parameters, their ranges in the data corpus, and the usefulness of the common users classifier in improving the performance, for the author identification task.
Keywords:Data Mining, Authorship Identification, Machine Learning, Text Mining, Information Retrieval, Psycholinguistic
Subjects:G Mathematical and Computer Sciences > G700 Artificial Intelligence
G Mathematical and Computer Sciences > G760 Machine Learning
G Mathematical and Computer Sciences > G400 Computer Science
G Mathematical and Computer Sciences > G720 Knowledge Representation
Divisions:College of Science > School of Computer Science
ID Code:1857
Deposited On:21 Apr 2009 13:32

Repository Staff Only: item control page