iVia Notes on Genres Identification of Corpora with an Eye Towards Assigning Intellectual Level of Resource:

Work next year may include explorations of genre identification to create an additional metadata element and as an aid to classification. A particular target is the intellectual level of the audience (expertise of user) for which the resource is intended. The goal being the auto assignment of metadata indicating elementary/middle school, high school and university level ed materials. There has been some, mixed work on this but its been pretty minimal. Genres or corpora that are effectively identified can include class, gender, region/dialect... Just on the surface, challenges may include working with Web resources/sites as .corpora. in terms of how much text (how many docs) is needed for accurate identification. Probably a lot (beyond the limits of most individual Web resources?). It may turn out that identification of elementary school material (as one pole) and university level (as the other pole) may be easiest (the diction being most different)...with middle and high school material found by elimination? We could use GEM and other ed/teacher oriented collections to try and develop a .profile. for this type of material? There is also the challenge of (filtering) material for the .lifelong learner. (i.e., material for adults such as is found in public libraries). IPL, LII, Yahoo content might help identify lifelong learner profiles?


Web Genres Visualizer

Maya Dimitrova, Aidan Finn, Nicholas Kushmerick and Barry Smyth
http://www.smi.ucd.ie/misty/
http://www.smi.ucd.ie/misty/aboutWGV.html

You Are Here is a technique for personalized search-result visualization that uses shallow natural-language processing techniques to map documents into a two-dimensional space that captures genre dimensions such as level of expertise and amount of detail, and propose a simple visualization interface that helps users rapidly find appropriate documents.. We are developing the Web Genre Visualizer, a user interface that replaces conventional ranked document lists with a graphical depiction of the retrieved documents. Our interface classifies documents according to various dimensions of Web document "genre", such as the degree of expertise assumed by the document, the amount of detail presented, or whether the document reports primarily facts or opinions.


Hyppia

http://smi.ucd.ie/hyppia/
The Hyppia demo shows how genre classification can be used in a digital library setting. It allows news articles to be filtered and searched based on genre information. The genre class in this demo is whether the document is subjective or objective.


Exploring the Use of Linguistic Features in Domain and Genre Classification

Maria Wolters and Mathias Kirsten
http://ais.gmd.de/~leopold/textcat.pdf
The central questions are How useful is information about part of speech frequency for text categorisation Is it feasible to limit word features to content words for text classifications This is examined for domain and genre classi cation tasks using LIMAS the German equivalent of the Brown corpus Because LIMAS is too heterogeneous neither question can be answered reliably for any of the tasks However the results suggest that both questions have to be examined separately for each task at hand because in some cases the additional information can indeed improve performance. Discusses identification of academic texts.


User Assessment of a Visual Web Genres Classifier

Dimitrova, M., Kushmerick, N., Radeva, P. & Villanueva J.J. (2003)
3rd IASTED Int. Conf. on Visualization, Imaging, and Image Processing (Malaga).
http://www.cs.ucd.ie/staff/nick/home/research/download/dimitrova-viip2003.pdf http://www2003.org/cdrom/papers/poster/p143/p143-dimitrova.htm
Users assess the .appropriateness. of web documents in many ways. Traditionally, appropriateness has been solely a matter of relevance to a particular topic. But users are concerned with other aspects of document .genre., such as the level of expertise assumed by the author, or the amount of detail. In previous work, we have used machine learning to automatically classify documents along a variety of genre dimensions, and we have developed a graphical interface that depicts documents visually along orthogonal genre dimensions. In order to validate the design of our interface, we have performed two experiments . a brainstorming session and a web-based survey - which have shown that users perceive genre dimensions as independent. In the present paper we elaborate in more detail the idea behind the classifier and draw upon the possibility of user .first-glance. biases in assessment of web documents


Learning to classify documents according to genre

http://www.cs.ucd.ie/staff/nick/home/research/download/finn-ijcai03-style.pdf
Genre or style analysis can be used to improve results achieved using standard IR techniques. A genre class is a group of documents that are written in a similar style. Genre classification can identify documents that are written in a style most likely to satisfy a user.s information need. We consider the use of Machine Learning techniques applied to the task of automatic genre classification. We investigate two sample genre classification tasks: whether a news article is subjective or objective; and whether a review is positive or negative. We investigate the use of three different feature-sets for building genre classifiers. We argue that traditional methods of evaluating text classifiers are insufficient for genre classifiers and emphasize domain transfer for the generated classifiers. Domain transfer indicates the ability of a genre classifier to classify documents that are about topics other than those it was trained on. For both sample genre classification tasks, we build classifiers that perform well within a single topic domain. We also investigate and evaluate the performance of these classifiers when transferred to new subject domains. We describe a method of combining evidence based on different feature-sets. We show that an ensemble learner based on different feature-sets improves performance for genre classification. We further combine predictions from different feature-sets to selectively sample which documents to add to the training set and show that this approach improves the learning rate of the resulting genre classifier.


Genre Classification and Domain Transfer for Information Filtering (2002)

Finn, A., Kushmerick, N. & Smyth, B. (2002). Genre classification and domain transfer for information filtering. In Proc. European Colloquium on Information Retrieval Research (Glasgow).
http://smi.ucd.ie/hyppia/publications/ECIR02/
The World Wide Web is a vast repository of information, but the sheer volume makes it difficult to identify useful documents. We identify document genre is an important factor in retrieving useful documents and focus on the novel document genre dimension of subjectivity. We investigate three approaches to automatically classifying documents by genre: traditional bag of words techniques, part-of-speech statistics, and hand-crafted shallow linguistic features. We are particularly interested in domain transfer: how well the learned classifiers generalize from the training corpus to a new document corpus.

Integrating Automatic Genre Analysis into Digital Libraries (2001)

Andreas Rauber, Alexander Müller-Kögler
JCDL'01, June 24-28, 2001, Roanoke, Virginia
http://www.ifs.tuwien.ac.at/ifs/research/pub_html/rau_jcdl01/
In this paper we present a way to provide automatic analysis of the structure of text documents. This analysis is based on a combination of various surface level features of texts, such as word statistics, punctuation information, the occurrences of special characters and keywords, as well as mark-up tags capturing image, equation, hyperlink and similar information. Based on these structural descriptions of documents, the self-organizing map ( SOM) [12], a popular unsupervised neural network, is used to cluster documents according to their structural similarities. This information is incorporated into the SOMLib digital library system [17] which provides an automatic, topic-based organization of documents using again the self-organizing map to group documents according to their content. The libViewer, a metaphor-graphical interface to the SOMLib system depicts the documents in a digital library as hardcover or paperback books, binders, or papers, sorted by content into various bookshelves, labeled by automatically extracted content descriptors using the LabelSOM technique. Integrating the results of the structural analysis of documents allows us to color the documents, which are sorted by subject into the various shelves, according to their structural similarities, making e.g. complex descriptions stand apart from summaries or legal explanations on the same subject. Similarly, interviews on a given topic are depicted different from reports, as are numerical tables or result listings. We demonstrate the benefits of an automatic structural analysis of documents in combination with content-based classification using a collection of news articles from several Austrian daily, weekly and monthly news magazines.


Text Mining in the SOMLib Digital Library System: The Representation of Topics and Genres

A. Rauber, D. Merkl
In: Applied Intelligence, Vol. 18, No. 3, pp. 271-293, Kluwer, May/June 2003.
This paper presents the SOMLIB digital library system, built on neural networks to provide text mining capabilities. At its foundation we use the self-organizing map to provide content-based clustering of documents. By using an extended model, i.e. the growing hierarchical self-organizing map, we can further detect subject hierarchies in a document collection, with the neural network adapting its size and structure automatically during its unsupervised training process to reflect the topical hierarchy. By mining the weight vector structure of the trained maps our system is able to select keywords describing the various topical clusters. Text mining has to incorporate more than the mere analysis of content. Structural and genre information are key in organizing and locating information.


Automatic Detection of Text Genre

Brett Kessler, Geoffrey Nunberg, Hinrich Schütze, Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics http://arxiv.org/abs/cmp-lg/9707002
As the text databases available to users become larger and more heterogeneous, genre becomes increasingly important for computational linguistics as a complement to topical and structural principles of classification. We propose a theory of genres as bundles of facets, which correlate with various surface cues, and argue that genre detection based on surface cues is as successful as detection based on deeper structural properties.


Coordinating Information Using Genres (2000)

http://citeseer.ist.psu.edu/yoshioka00coordinating.html


An Attempt to Use Weighted Cusums to Identify Sublanguages

http://cnts.uia.ac.be/conll98/pdf/131139so.pdf
This paper explores the use of weighted cusums, a technique found in authorship attribution studies, for the purpose of identifying sublanguages. The technique, and its relation to standard cusums (cumulative sum charts) is first described, and the formulae for calculations given in detail. The technique compares texts by testing for the incidence of linguistic .features. of a superficial nature, e.g. proportion of 2- and 3.letter words, words beginning with a vowel, and so on, and measures whether two texts differ significantly in respect of these features. The paper describes an experiment in which 14 groups of three texts each representing different sublanguages are compared with each other using the technique. The texts are first compared within each group to establish that the technique can identify the groups as being homogeneous. The texts are then compared with each other, and the results analysed. Taking the average of seven different tests, the technique is able to distinguish the sublanguages in only 43% of the case. But if the best score is taken, 79% of pairings can be distinguished. This is a better result, and the test seems able to quantify the difference between sublanguages.


Synthesizing composite topic structure trees for multiple domain specific documents.

Min-Yen Kan, Judith L. Klavans, Kathleen R. McKeown, 2001, Columbia University Computer Science Technical Report, CUCS-003-01. http://www.cs.columbia.edu/~min/papers/cucs-003-01.pdf
Domain specific texts often have implicit rules on content and organization. We introduce a novel method for synthesizing this topical structure. The system uses corpus examples and recursively merges their topics to build a hierarchical tree. A subjective cross domain evaluation showed that the system performed well in combining related topics and highlighting important ones.. We have defined the nature of the relatedness between documents that gives rise to similar topical structure . the notion of a text type, defined by the intersection of domain and genre. The challenges here are in some ways the opposite of those concerning document boundary identification. Here the concern is with an merging and identifying common approaches and text structures that characterize a generalized text type.

Literature Genre & Cybergenre

Bibliography
http://www.cs.uu.nl/people/leen/GenreDev/Bibliography.htm