Mohtasseb, Haytham and Ahmed, Amr (2009) More blogging features for author identification. In: The 2009 International Conference on Knowledge Discovery (ICKD'09), 2009, Manila.
Documents |
|
![]()
|
PDF
ickd_paper.pdf 276kB |
Item Type: | Conference or Workshop contribution (Paper) |
---|---|
Item Status: | Live Archive |
Abstract
In this paper we present a novel improvement in the field of authorship identification in personal blogs. The improvement in authorship identification, in our work, is by utilizing a hybrid collection of linguistic features that best capture the style of users in diaries blogs. The features sets contain LIWC with its psychology background, a collection of syntactic features & part-of-speech (POS), and the misspelling errors features.
Furthermore, we analyze the contribution of each feature set on the final result and compare the outcome of using different combination from the selected feature sets. Our new categorization of misspelling words which are mapped into numerical features, are noticeably enhancing the classification results. The paper also confirms the best ranges of several parameters that affect the final result of authorship identification such as the author numbers, words number in each post, and the number of documents/posts for each author/user. The results and evaluation show that the utilized features are compact, while their performance is highly comparable with other much larger feature sets.
Additional Information: | In this paper we present a novel improvement in the field of authorship identification in personal blogs. The improvement in authorship identification, in our work, is by utilizing a hybrid collection of linguistic features that best capture the style of users in diaries blogs. The features sets contain LIWC with its psychology background, a collection of syntactic features & part-of-speech (POS), and the misspelling errors features. Furthermore, we analyze the contribution of each feature set on the final result and compare the outcome of using different combination from the selected feature sets. Our new categorization of misspelling words which are mapped into numerical features, are noticeably enhancing the classification results. The paper also confirms the best ranges of several parameters that affect the final result of authorship identification such as the author numbers, words number in each post, and the number of documents/posts for each author/user. The results and evaluation show that the utilized features are compact, while their performance is highly comparable with other much larger feature sets. |
---|---|
Keywords: | Authorship Identification, Machine Learning, Text Mining, Psycholinguistic, Blogosphere Text Analysis, Computational Stylistics |
Subjects: | G Mathematical and Computer Sciences > G700 Artificial Intelligence G Mathematical and Computer Sciences > G760 Machine Learning G Mathematical and Computer Sciences > G720 Knowledge Representation |
Divisions: | College of Science > School of Computer Science |
Related URLs: | |
ID Code: | 1862 |
Deposited On: | 30 Apr 2009 11:46 |
Repository Staff Only: item control page