Work next year may include explorations of genre identification to create an additional metadata element and as an aid to classification. A particular target is the intellectual level of the audience (expertise of user) for which the resource is intended. The goal being the auto assignment of metadata indicating elementary/middle school, high school and university level ed materials. There has been some, mixed work on this but its been pretty minimal. Genres or corpora that are effectively identified can include class, gender, region/dialect... Just on the surface, challenges may include working with Web resources/sites as .corpora. in terms of how much text (how many docs) is needed for accurate identification. Probably a lot (beyond the limits of most individual Web resources?). It may turn out that identification of elementary school material (as one pole) and university level (as the other pole) may be easiest (the diction being most different)...with middle and high school material found by elimination? We could use GEM and other ed/teacher oriented collections to try and develop a .profile. for this type of material? There is also the challenge of (filtering) material for the .lifelong learner. (i.e., material for adults such as is found in public libraries). IPL, LII, Yahoo content might help identify lifelong learner profiles?
Maya Dimitrova, Aidan Finn, Nicholas Kushmerick and Barry Smyth
http://www.smi.ucd.ie/misty/
http://www.smi.ucd.ie/misty/aboutWGV.html
You Are Here is a technique for personalized search-result visualization that uses shallow natural-language processing
techniques to map documents into a two-dimensional space that captures genre dimensions such as level of expertise and
amount of detail, and propose a simple visualization interface that helps users rapidly find appropriate documents.. We
are developing the Web Genre Visualizer, a user interface that replaces conventional ranked document lists with a
graphical depiction of the retrieved documents. Our interface classifies documents according to various dimensions of
Web document "genre", such as the degree of expertise assumed by the document, the amount of detail presented, or
whether the document reports primarily facts or opinions.
http://smi.ucd.ie/hyppia/
The Hyppia demo shows how genre classification can be used in a digital library setting. It allows news articles to be
filtered and searched based on genre information. The genre class in this demo is whether the document is subjective or
objective.
Maria Wolters and Mathias Kirsten
http://ais.gmd.de/~leopold/textcat.pdf
The central questions are How useful is information about part of speech frequency for text categorisation Is it
feasible to limit word features to content words for text classifications This is examined for domain and genre classi
cation tasks using LIMAS the German equivalent of the Brown corpus Because LIMAS is too heterogeneous neither question
can be answered reliably for any of the tasks However the results suggest that both questions have to be examined
separately for each task at hand because in some cases the additional information can indeed improve performance.
Discusses identification of academic texts.
Dimitrova, M., Kushmerick, N., Radeva, P. & Villanueva J.J. (2003)
3rd IASTED Int. Conf. on Visualization, Imaging, and Image Processing (Malaga).
http://www.cs.ucd.ie/staff/nick/home/research/download/dimitrova-viip2003.pdf
http://www2003.org/cdrom/papers/poster/p143/p143-dimitrova.htm
Users assess the .appropriateness. of web documents in many ways. Traditionally, appropriateness has been solely a
matter of relevance to a particular topic. But users are concerned with other aspects of document .genre., such as the
level of expertise assumed by the author, or the amount of detail. In previous work, we have used machine learning to
automatically classify documents along a variety of genre dimensions, and we have developed a graphical interface that
depicts documents visually along orthogonal genre dimensions. In order to validate the design of our interface, we have
performed two experiments . a brainstorming session and a web-based survey - which have shown that users perceive genre
dimensions as independent. In the present paper we elaborate in more detail the idea behind the classifier and draw upon
the possibility of user .first-glance. biases in assessment of web documents
http://www.cs.ucd.ie/staff/nick/home/research/download/finn-ijcai03-style.pdf
Genre or style analysis can be used to improve results achieved using standard IR techniques. A
genre class is a group of documents that are written in a similar style. Genre classification can identify
documents that are written in a style most likely to satisfy a user.s information need. We consider the use of Machine
Learning techniques applied to the task of automatic genre classification. We investigate two sample genre
classification tasks: whether a news article is subjective or objective; and whether a review is positive or negative.
We investigate the use of three different feature-sets for building genre classifiers.
We argue that traditional methods of evaluating text classifiers are insufficient for genre classifiers and
emphasize domain transfer for the generated classifiers. Domain transfer indicates the ability of a genre classifier to
classify documents that are about topics other than those it was trained on. For both sample genre classification tasks,
we build classifiers that perform well within a single topic domain. We also investigate and evaluate the performance of
these classifiers when transferred to new subject domains.
We describe a method of combining evidence based on different feature-sets. We show that an ensemble
learner based on different feature-sets improves performance for genre classification. We further combine predictions
from different feature-sets to selectively sample which documents to add to the training set and show that this approach
improves the learning rate of the resulting genre classifier.
Finn, A., Kushmerick, N. & Smyth, B. (2002). Genre classification and domain transfer for information filtering. In
Proc. European Colloquium on Information Retrieval Research (Glasgow).
http://smi.ucd.ie/hyppia/publications/ECIR02/
The World Wide Web is a vast repository of information, but the sheer volume makes it difficult to identify useful
documents. We identify document genre is an important factor in retrieving useful documents and focus on the novel
document genre dimension of subjectivity. We investigate three approaches to automatically classifying documents by
genre: traditional bag of words techniques, part-of-speech statistics, and hand-crafted shallow linguistic features. We
are particularly interested in domain transfer: how well the learned classifiers generalize from the training corpus to
a new document corpus.
Andreas Rauber, Alexander Müller-Kögler
JCDL'01, June 24-28, 2001, Roanoke, Virginia
http://www.ifs.tuwien.ac.at/ifs/research/pub_html/rau_jcdl01/
In this paper we present a way to provide automatic analysis of the structure of text documents. This analysis is based
on a combination of various surface level features of texts, such as word statistics, punctuation information, the
occurrences of special characters and keywords, as well as mark-up tags capturing image, equation, hyperlink and similar
information. Based on these structural descriptions of documents, the self-organizing map ( SOM) [12], a popular
unsupervised neural network, is used to cluster documents according to their structural similarities. This information
is incorporated into the SOMLib digital library system [17] which provides an automatic, topic-based organization of
documents using again the self-organizing map to group documents according to their content. The libViewer, a
metaphor-graphical interface to the SOMLib system depicts the documents in a digital library as hardcover or paperback
books, binders, or papers, sorted by content into various bookshelves, labeled by automatically extracted content
descriptors using the LabelSOM technique. Integrating the results of the structural analysis of documents allows us to
color the documents, which are sorted by subject into the various shelves, according to their structural similarities,
making e.g. complex descriptions stand apart from summaries or legal explanations on the same subject. Similarly,
interviews on a given topic are depicted different from reports, as are numerical tables or result listings. We
demonstrate the benefits of an automatic structural analysis of documents in combination with content-based
classification using a collection of news articles from several Austrian daily, weekly and monthly news magazines.
A. Rauber, D. Merkl
In: Applied Intelligence, Vol. 18, No. 3, pp. 271-293, Kluwer, May/June 2003.
This paper presents the SOMLIB digital library system, built on neural networks to provide text mining capabilities. At
its foundation we use the self-organizing map to provide content-based clustering of documents. By using an extended
model, i.e. the growing hierarchical self-organizing map,
we can further detect subject hierarchies in a document collection, with the neural network adapting its size and
structure automatically during its unsupervised training process to reflect the topical hierarchy.
By mining the weight vector structure of the trained maps our system is able to select keywords describing the various
topical clusters. Text mining has to incorporate more than the mere analysis of content.
Structural and genre information are key in organizing and locating information.
Brett Kessler, Geoffrey Nunberg, Hinrich Schütze, Proceedings of the Thirty-Fifth Annual Meeting of the Association
for
Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
http://arxiv.org/abs/cmp-lg/9707002
As the text databases available to users become larger and more heterogeneous, genre becomes increasingly important for
computational linguistics as a complement to topical and structural principles of classification. We propose a theory of
genres as bundles of facets, which correlate with various surface cues, and argue that genre detection based on surface
cues is as successful as detection based on deeper structural properties.
http://citeseer.ist.psu.edu/yoshioka00coordinating.html
http://cnts.uia.ac.be/conll98/pdf/131139so.pdf
This paper explores the use of weighted cusums, a technique found in authorship attribution studies, for
the purpose of identifying sublanguages. The technique, and its relation to standard cusums (cumulative
sum charts) is first described, and the formulae for calculations given in detail. The technique compares
texts by testing for the incidence of linguistic .features. of a superficial nature, e.g. proportion of 2- and
3.letter words, words beginning with a vowel, and so on, and measures whether two texts differ significantly in respect
of these features. The paper describes an experiment in which 14 groups of three texts each representing different
sublanguages are compared with each other using the technique. The texts are first compared within each group to
establish that the technique can identify the groups as being homogeneous. The texts are then compared with each other,
and the results analysed. Taking the average of seven different tests, the technique is able to distinguish the
sublanguages in only 43% of the case. But if the best score is taken, 79% of pairings can be distinguished. This is a
better result, and the test seems able to quantify the difference between sublanguages.
Min-Yen Kan, Judith L. Klavans, Kathleen R. McKeown, 2001,
Columbia University Computer Science Technical Report, CUCS-003-01.
http://www.cs.columbia.edu/~min/papers/cucs-003-01.pdf
Domain specific texts often have implicit rules on content and organization. We introduce a novel method for
synthesizing this topical structure. The system uses corpus examples and recursively merges their topics to build a
hierarchical tree. A subjective cross domain evaluation showed that the system performed well in combining related
topics and highlighting important ones.. We have defined the nature of the relatedness between documents that gives
rise to similar topical structure . the notion of a text type, defined by the intersection of domain and genre.
The challenges here are in some ways the opposite of those concerning document boundary identification. Here the concern
is with an merging and identifying common approaches and text structures that characterize a generalized text type.
Bibliography http://www.cs.uu.nl/people/leen/GenreDev/Bibliography.htm