More blogging features for author identification

Mohtasseb, Haytham and Ahmed, Amr (2009) More blogging features for author identification. In: The 2009 International Conference on Knowledge Discovery (ICKD'09), 2009, Manila.

Documents
More Blogging Features for Author Identification
This paper presents a novel improvement in the field of authorship identification in personal blogs. The improvement in authorship identification, in our work, is by utilizing a hybrid collection of linguistic features that best capture the style of users in diaries blogs. The features sets contain LIWC with its psychology background, a collection of syntactic features & part-of-speech (POS), and the misspelling errors features.
[img]
[Download]
[img]
Preview
PDF
ickd_paper.pdf

276kB

Abstract

In this paper we present a novel improvement in the field of authorship identification in personal blogs. The improvement in authorship identification, in our work, is by utilizing a hybrid collection of linguistic features that best capture the style of users in diaries blogs. The features sets contain LIWC with its psychology background, a collection of syntactic features & part-of-speech (POS), and the misspelling errors features.

Furthermore, we analyze the contribution of each feature set on the final result and compare the outcome of using different combination from the selected feature sets. Our new categorization of misspelling words which are mapped into numerical features, are noticeably enhancing the classification results. The paper also confirms the best ranges of several parameters that affect the final result of authorship identification such as the author numbers, words number in each post, and the number of documents/posts for each author/user. The results and evaluation show that the utilized features are compact, while their performance is highly comparable with other much larger feature sets.

Item Type:Conference or Workshop Item (Paper)
Additional Information:In this paper we present a novel improvement in the field of authorship identification in personal blogs. The improvement in authorship identification, in our work, is by utilizing a hybrid collection of linguistic features that best capture the style of users in diaries blogs. The features sets contain LIWC with its psychology background, a collection of syntactic features & part-of-speech (POS), and the misspelling errors features. Furthermore, we analyze the contribution of each feature set on the final result and compare the outcome of using different combination from the selected feature sets. Our new categorization of misspelling words which are mapped into numerical features, are noticeably enhancing the classification results. The paper also confirms the best ranges of several parameters that affect the final result of authorship identification such as the author numbers, words number in each post, and the number of documents/posts for each author/user. The results and evaluation show that the utilized features are compact, while their performance is highly comparable with other much larger feature sets.
Keywords:Authorship Identification, Machine Learning, Text Mining, Psycholinguistic, Blogosphere Text Analysis, Computational Stylistics
Subjects:G Mathematical and Computer Sciences > G700 Artificial Intelligence
G Mathematical and Computer Sciences > G760 Machine Learning
G Mathematical and Computer Sciences > G720 Knowledge Representation
Divisions:College of Science > School of Computer Science
ID Code:1862
Deposited By:INVALID USER
Deposited On:30 Apr 2009 11:46
Last Modified:13 Mar 2013 08:32

Repository Staff Only: item control page