Stylometry research using syntax-based features and machine learning techniques Kim Luyckx Researcher Center for Dutch Language and Speech (CNTS), University of Antwerp Preference: Oral presentation The computational extraction and analysis of style is a quest for linguistic features on the morphological, syntactical and lexical level that describe the style of individual authors (both fiction and non-fiction) or groups of authors and are not under the author's conscious control. The current advances and improvements in language technology and shallow parsing allow us to use insights from these fields in this type of research. Combining shallow text analysis and Machine Learning techniques enables us to predict a number of author-specific features (e.g. identity, gender, region, age) that can be used in systems for plagiarism detection, authorship attribution, gender detection and document dating. This PhD research aims at creating a technical and methodological infrastructure for applied stylometry of Dutch. A central issue is the development of tools (corpora, benchmarks, software for linguistic analysis) for scientific research and concrete applications. The methodology covers several aspects: (1) Automatic linguistic analysis of documents by means of available text analysis tools on the level of morphological structure, part of speech, global syntactic structures and semantic roles (subject, object, temporal, location) for the construction of potentially relevant stylistic characteristics. (2) Unsupervised and supervised learning techniques for selecting characteristics with high information value and constructing a model of authorial style. (3) Evaluation of these models by (a) comparison with stylistic analyses in linguistics and literary science and (b) empiric testing of the predictive power of the models. For our experiments in Authorship Attribution, we selected a corpus that consists of articles about national politics from the Flemish newspaper "De Standaard" written by different authors. At first sight, there is barely any difference between them, even as far as token-based and lexical features are concerned since they deal with similar subjects. The focus of our investigation, however, lies on syntax-based features. These style characteristics are not under the author's conscious control and are therefore good clues for stylometry studies. By combining syntax-based, token-based and lexical features, we aim at a profile ad hoc that characterizes the author. Classification is performed by means of Machine Learning algorithms that automatically learn and improve by experience, since these learners constantly upgrade the stylistic profiles. This methodology is a promising approach to Authorship Attribution.