Information Extration and Classification Notes, Articles, Tools

Interactive Information Extraction with Constrained Conditional Random Fields.
Trausti Kristjannson, Aron Culotta, Paul Viola and Andrew McCallum. Nineteenth National Conference on Artificial Intelligence (AAAI 2004). San Jose, CA. (Winner of Honorable Mention Award.)

http://www.cs.umass.edu/~mccallum/papers/addrie-aaai04.pdf
Interactive component and notion of confidence estimates (something we.ve talked about)

Confidence Estimation in Extraction

Confidence Estimation for Information Extraction. Aron Culotta and Andrew McCallum. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, short paper.
http://www.cs.umass.edu/~mccallum/papers/crfcp-hlt04.pdf
Confidence estimation.when not to apply metadata

Using Unlabeled Data for Training/Boosting . New Approach:

A Note on Semi-supervised Learning using Markov Random Fields. Wei Li and Andrew McCallum. Technical Note, February 3, 2004.
http://www.cs.umass.edu/~mccallum/papers/li-ssmrf.pdf
Development of a good distance metric with class labels/boundaries that reflect nature of training data important

Unified Information Extraction and Data Mining:

A Note on the Unification of Information Extraction and Data Mining using Conditional-Probability, Relational Models. Andrew McCallum and David Jensen. IJCAI'03 Workshop on Learning Statistical Models from Relational Data, 2003.
http://www.cs.umass.edu/~mccallum/papers/iedatamining-ijcaiws03.pdf
Need to address aboutness, rich text identification in Soumen.s work.

Information Extraction

Y. Matsuo and M. Ishizuka:
Keyword Extraction from a Single Document using Word Co-ocurrence Statistical Information, Int'l Journal on Artificial Intelligence Tools, Vol.13, No.1, pp.157-169 (2004.3)
http://www.miv.t.u-tokyo.ac.jp/papers/matsuoIJAIT04.pdf

Yutaka Matsuo, Yukio Ohsawa, Mitsuru Ishizuka:
KeyWorld: Extracting Keywords in a Document as a Small World, Proc. Discovery Science(DS'2001), pp. 271-281, Washington D.C., USA, Nov. 2001.
http://www.miv.t.u-tokyo.ac.jp/papers/matsuoDS01.pdf

Yutaka Matsuo, Yukio Ohsawa and Mitsuru Ishizuka:
Document as a Small World, in New Frontiers in Artificial Intelligence -- Joint JSAI 2001 Workshop Post-Proceedings (T. Terano, T. Nishida, A. Namatame, S. Tsumoto, Y. Ohsawa, T. Washio (eds.)), LNAI 2253, pp.444-448, Springer-Verlag (2001)
http://www.miv.t.u-tokyo.ac.jp/papers/matsuoJSAI01ws.pdf

Classification:

Y. Yang, J. Zhang and B. Kisiel. A scalability analysis of classifiers in text categorization (ps.gz) ACM SIGIR'03, pp 96-103, 2003.

Software from CMU Auton Project
http://www.autonlab.org/autonweb/software.jsp
Bayes Net Learner; Dense kNN; Dense NB; Sparse kNN; Sparse NB

Linear Classification:

J. Zhang and Y. Yang. Robustness of regularized linear classification methods in text categorization (ps.gz) ACM SIGIR'03, pp 190-197, 2003.

Logistic Regression:

Fast Logistic Regression for Data Mining, Text Classification and Link Detection
Paul Komarek and Andrew Moore
http://www.autonlab.org/autonweb/documents/papers/komarek:nips2003.pdf

Paul Komarek
Logistic Regression for Data Mining and High-Dimensional Classification
http://www.autonlab.org/autonweb/showPaper.jsp?ID=komarek:lr_thesis
Our Logistic Regression implementation offers superior results for large classification problems.
The focus of this thesis is fast and robust adaptations of logistic regression (LR) for data mining and high-dimensional classification problems. LR is well-understood and widely used in the statistics, machine learning, and data analysis communities. Its benefits include a firm statistical foundation and a probabilistic model useful for ``explaining'' the data. There is a perception that LR is slow, unstable, and unsuitable for large learning or classification tasks. Through fast approximate numerical methods, regularization to avoid numerical instability, and an efficient implementation we will show that LR can outperform modern algorithms like Support Vector Machines (SVM) on a variety of learning tasks. Our novel implementation, which uses a modified iteratively re-weighted least squares estimation procedure, can compute model parameters for sparse binary datasets with hundreds of thousands of rows and attributes, and millions or tens of millions of nonzero elements in just a few seconds. Our implementation also handles real-valued dense datasets of similar size.

Anna Gordenberg, et. al.
A Comparison of Statistical and Machine Learning Algorithms on the Task of Link Completion
http://www.autonlab.org/autonweb/showPaper.jsp?ID=linkcomplete2003
This paper examines the task of link completion, relative algorithm performance, and what this can tell us about the structure of the data.

Paul Komarek and Andrew Moore
Fast Robust Logistic Regression for Large Sparse Datasets with Binary Outputs
http://www.autonlab.org/autonweb/showPaper.jsp?ID=komarek-fast
Logistic regression can provide faster, better results than SVM for life-sciences datasets with hundreds of thousands of attributes. This paper consists of an empirical examination of the first assumption, and surveys, implements and compares techniques by which logistic regression can be scaled to data with millions of attributes and records. Our results, on a large life sciences dataset, indicate that logistic regression can perform surprisingly well, both statistically and computationally, when compared with an array of more recent classification algorithms.

Tong Zhang and F. Oles
Text Categorization Based on Regularized Linear Classification Methods
http://www.research.ibm.com/people/t/tzhang/pubs.html
A number of linear classification methods such as the linear least squares fit (LLSF), logistic regression, and support vector machines (SVM's) have been applied to text categorization problems. These methods share the similarity by finding hyperplanes that approximately separate a class of document vectors from its complement. However, support vector machines are so far considered special in that they have been demonstrated to achieve the state of the art performance. It is thereforeee worthwhile to understand whether such good performance is unique to the SVM design, or if it can also be achieved by other linear classification methods. In this paper, we compare a number of known linear classification methods as well as some variants in the framework of regularized linear systems. We will discuss the statistical and numerical properties of these algorithms, with a focus on text categorization. We will also provide some numerical experiments to illustrate these algorithms on a number of datasets.
Tong Zhang, Vijay S. Iyengar: Recommender Systems Using Linear Classifier. Journal of Machine Learning Research 2: 313-334 (2002)
http://www.research.ibm.com/people/t/tzhang/pubs.html
Recommender systems use historical data on user preferences and other available data on users (for example, demographics) and items (for example, taxonomy) to predict items a new user might like. Applications of these methods include recommending items for purchase and personalizing the browsing experience on a web-site. Collaborative filtering methods have focused on using just the history of user preferences to make the recommendations. These methods have been categorized as memory-based
Structural Logistic Regression for Link Analysis ,
Alexandrin Popescul, Lyle H. Ungar ,
Workshop on Multi-Relational Data Mining at KDD 2003.
http://www.cis.upenn.edu/~popescul/Publications/popescul03mrdm.pdf
Towards Structural Logistic Regression: Combining Relational and Statistical Learning , Alexandrin Popescul, Lyle H. Ungar, Steve Lawrence, David M. Pennock , Workshop on Multi-Relational Data Mining at KDD 2002.
http://www.cis.upenn.edu/~popescul/Publications/popescu02structural.ps

kNN

Ting Liu, et. al.,
Efficient Exact k-NN and Nonparametric Classification in High Dimensions
http://www.autonlab.org/autonweb/showPaper.jsp?ID=Liu-knn
This paper is about non-approximate acceleration of high dimensional nonparametric operations such as $k$ nearest neighbor classifiers and the prediction phase of Support Vector Machine classifiers. We attempt to exploit the fact that even if we want exact answers to nonparametric queries, we usually do not need to explicitly find the datapoints close to the query, but merely need to ask questions about the properties about that set of datapoints. This offers a small amount of computational leeway, and we investigate how much that leeway can be exploited. For clarity, this paper concentrates on pure $k$-NN classification and the prediction phase of SVMs. We introduce new ball tree algorithms that on real-world datasets give accelerations of 2-fold up to 100-fold compared against highly optimized traditional ball-tree-based $k$-NN. These results include datasets with up to $10^6$ dimensions and $10^5$ records, and show non-trivial speedups while giving exact answers.

Naïve Bayes Improvement:

Tackling the Poor Assumptions of Naive Bayes Text Classifiers [PDF]
Jason D. M. Rennie, Lawrence Shih, Jaime Teevan and David R. Karger.
Proceedings of the Twentieth International Conference on Machine Learning. 2003.
http://www.ai.mit.edu/~jrennie/papers/icml03-nb.pdf
Quick and robust, these changes may improve accuracy to SVM levels.


SVM Improvements:

Not Too Hot, Not Too Cold: The Bundled-SVM is Just Right! [PDF]
Lawrence Shih, Yu-Han Chang, Jason D. M. Rennie and David Karger.
Proceedings of the ICML-2002 Workshop on Text Learning. 2002.
http://www.ai.mit.edu/~jrennie/papers/icml02-bundled.ps.gz

I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, Support Vector Machine Learning for Interdependent and Structured Output Spaces, Proceedings of the International Conference on Machine Learning (ICML), 2004.
http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf


SIMPL:

Fast and accurate text classification via multiple linear discriminant projections.
Soumen Chakrabarti, Shourya Roy and Mahesh Soundalgekar. VLDB Journal, 12(2), pages 170--185.
http://www.cs.berkeley.edu/~soumen/doc/vldbj2003/simpl2003.pdf


Text Classification Tools - Miscellaneous:

Orange: component-based data mining software in C++ includes SVM, logistic regression, clustering, and lots more.
http://magix.fri.uni-lj.si/orange/

OpenNLP Tools Common API v1.0.0
http://opennlp.sourceforge.net/api/index.html

MLC++ :
a library of C++ classes for supervised machine learning.
http://www.sgi.com/tech/mlc/index.html

MALLET:
A Machine Learning for Language Toolkit
http://mallet.cs.umass.edu/
new release as of 4/04