Information Extration and Classification Notes, Articles, Tools
Interactive Information
Extraction with Constrained Conditional Random Fields.
Trausti Kristjannson, Aron Culotta, Paul Viola and Andrew McCallum. Nineteenth National Conference
on Artificial Intelligence (AAAI 2004). San Jose, CA. (Winner of Honorable Mention
Award.)
http://www.cs.umass.edu/~mccallum/papers/addrie-aaai04.pdf
Interactive component and notion of confidence estimates (something we.ve talked about)
Confidence Estimation in Extraction
Confidence Estimation for
Information Extraction. Aron Culotta and Andrew McCallum. Proceedings of Human Language
Technology Conference and North American Chapter of the Association for Computational Linguistics
(HLT-NAACL), 2004, short paper.
http://www.cs.umass.edu/~mccallum/papers/crfcp-hlt04.pdf
Confidence estimation.when not to apply metadata
Using Unlabeled Data for Training/Boosting . New Approach:
A Note on Semi-supervised Learning
using Markov Random Fields. Wei Li and Andrew McCallum. Technical Note, February 3, 2004.
http://www.cs.umass.edu/~mccallum/papers/li-ssmrf.pdf
Development of a good distance metric with class labels/boundaries that reflect nature of training
data important
Unified Information Extraction and Data Mining:
A Note on the
Unification of Information Extraction and Data Mining using Conditional-Probability, Relational
Models. Andrew McCallum and David Jensen. IJCAI'03 Workshop on Learning Statistical Models
from Relational Data, 2003.
http://www.cs.umass.edu/~mccallum/papers/iedatamining-ijcaiws03.pdf
Need to address aboutness, rich text identification in Soumen.s work.
Information Extraction
Y. Matsuo and M. Ishizuka:
Keyword Extraction from a Single Document using Word Co-ocurrence
Statistical Information, Int'l Journal on Artificial Intelligence Tools, Vol.13, No.1, pp.157-169
(2004.3)
http://www.miv.t.u-tokyo.ac.jp/papers/matsuoIJAIT04.pdf
Yutaka Matsuo, Yukio Ohsawa, Mitsuru Ishizuka:
KeyWorld: Extracting Keywords in a Document
as a Small World, Proc. Discovery Science(DS'2001), pp. 271-281, Washington D.C., USA, Nov.
2001.
http://www.miv.t.u-tokyo.ac.jp/papers/matsuoDS01.pdf
Yutaka Matsuo, Yukio Ohsawa and Mitsuru Ishizuka:
Document as a Small World, in New
Frontiers in Artificial Intelligence -- Joint JSAI 2001 Workshop Post-Proceedings (T. Terano, T.
Nishida, A. Namatame, S. Tsumoto, Y. Ohsawa, T. Washio (eds.)), LNAI 2253, pp.444-448,
Springer-Verlag (2001)
http://www.miv.t.u-tokyo.ac.jp/papers/matsuoJSAI01ws.pdf
Classification:
Y. Yang, J. Zhang and B. Kisiel. A scalability analysis of classifiers
in text categorization (ps.gz) ACM SIGIR'03, pp 96-103, 2003.
Software from CMU Auton Project
http://www.autonlab.org/autonweb/software.jsp
Bayes Net Learner; Dense kNN; Dense NB; Sparse kNN; Sparse NB
Linear Classification:
J. Zhang and Y. Yang. Robustness of regularized
linear classification methods in text categorization (ps.gz) ACM SIGIR'03, pp 190-197,
2003.
Logistic Regression:
Fast Logistic Regression for Data Mining, Text Classification and Link Detection
Paul Komarek and Andrew Moore
http://www.autonlab.org/autonweb/documents/papers/komarek:nips2003.pdf
Paul Komarek
Logistic Regression for Data Mining and High-Dimensional Classification
http://www.autonlab.org/autonweb/showPaper.jsp?ID=komarek:lr_thesis
Our Logistic Regression implementation offers superior results for large classification
problems.
The focus of this thesis is fast and robust adaptations of logistic regression (LR) for data
mining and high-dimensional classification problems. LR is well-understood and widely used in the
statistics, machine learning, and data analysis communities. Its benefits include a firm
statistical foundation and a probabilistic model useful for ``explaining'' the data. There is a
perception that LR is slow, unstable, and unsuitable for large learning or classification tasks.
Through fast approximate numerical methods, regularization to avoid numerical instability, and an
efficient implementation we will show that LR can outperform modern algorithms like Support Vector
Machines (SVM) on a variety of learning tasks. Our novel implementation, which uses a modified
iteratively re-weighted least squares estimation procedure, can compute model parameters for
sparse binary datasets with hundreds of thousands of rows and attributes, and millions or tens of
millions of nonzero elements in just a few seconds. Our implementation also handles real-valued
dense datasets of similar size.
Anna Gordenberg, et. al.
A Comparison of
Statistical and Machine Learning Algorithms on the Task of Link Completion
http://www.autonlab.org/autonweb/showPaper.jsp?ID=linkcomplete2003
This paper examines the task of link completion, relative algorithm performance, and what this
can tell us about the structure of the data.
Paul Komarek and Andrew Moore
Fast Robust Logistic
Regression for Large Sparse Datasets with Binary Outputs
http://www.autonlab.org/autonweb/showPaper.jsp?ID=komarek-fast
Logistic regression can provide faster, better results than SVM for life-sciences datasets with
hundreds of thousands of attributes. This paper consists of an empirical examination of the first
assumption, and surveys, implements and compares techniques by which logistic regression can be
scaled to data with millions of attributes and records. Our results, on a large life sciences
dataset, indicate that logistic regression can perform surprisingly well, both statistically and
computationally, when compared with an array of more recent classification algorithms.
Tong Zhang and F. Oles
Text Categorization Based on Regularized Linear Classification Methods
http://www.research.ibm.com/people/t/tzhang/pubs.html
A number of linear classification methods such as the linear least squares fit (LLSF), logistic
regression, and support vector machines (SVM's) have been applied to text categorization problems.
These methods share the similarity by finding hyperplanes that approximately separate a class of
document vectors from its complement. However, support vector machines are so far considered
special in that they have been demonstrated to achieve the state of the art performance. It is
thereforeee worthwhile to understand whether such good performance is unique to the SVM design, or
if it can also be achieved by other linear classification methods. In this paper, we compare a
number of known linear classification methods as well as some variants in the framework of
regularized linear systems. We will discuss the statistical and numerical properties of these
algorithms, with a focus on text categorization. We will also provide some numerical experiments
to illustrate these algorithms on a number of datasets.
Tong Zhang, Vijay S.
Iyengar: Recommender Systems Using Linear Classifier. Journal of
Machine Learning Research 2: 313-334 (2002)
http://www.research.ibm.com/people/t/tzhang/pubs.html
Recommender systems use historical data on user preferences and other available data on users (for
example, demographics) and items (for example, taxonomy) to predict items a new user might like.
Applications of these methods include recommending items for purchase and personalizing the
browsing experience on a web-site. Collaborative filtering methods have focused on using just the
history of user preferences to make the recommendations. These methods have been categorized as
memory-based
Structural Logistic
Regression for Link Analysis ,
Alexandrin Popescul, Lyle H. Ungar ,
Workshop on Multi-Relational Data Mining at KDD 2003.
http://www.cis.upenn.edu/~popescul/Publications/popescul03mrdm.pdf
Towards
Structural Logistic Regression: Combining Relational and Statistical Learning , Alexandrin
Popescul, Lyle H. Ungar, Steve Lawrence, David M. Pennock , Workshop on Multi-Relational Data
Mining at KDD 2002.
http://www.cis.upenn.edu/~popescul/Publications/popescu02structural.ps
kNN
Ting Liu, et. al.,
Efficient Exact k-NN and
Nonparametric Classification in High Dimensions
http://www.autonlab.org/autonweb/showPaper.jsp?ID=Liu-knn
This paper is about non-approximate acceleration of high dimensional nonparametric operations such
as $k$ nearest neighbor classifiers and the prediction phase of Support Vector Machine
classifiers. We attempt to exploit the fact that even if we want exact answers to nonparametric
queries, we usually do not need to explicitly find the datapoints close to the query, but merely
need to ask questions about the properties about that set of datapoints. This offers a small
amount of computational leeway, and we investigate how much that leeway can be exploited. For
clarity, this paper concentrates on pure $k$-NN classification and the prediction phase of SVMs.
We introduce new ball tree algorithms that on real-world datasets give accelerations of 2-fold up
to 100-fold compared against highly optimized traditional ball-tree-based $k$-NN. These results
include datasets with up to $10^6$ dimensions and $10^5$ records, and show non-trivial speedups
while giving exact answers.
Naïve Bayes Improvement:
Tackling the Poor Assumptions of
Naive Bayes Text Classifiers [PDF]
Jason D. M. Rennie,
Lawrence Shih, Jaime Teevan and David R. Karger.
Proceedings of the Twentieth International
Conference on Machine Learning. 2003.
http://www.ai.mit.edu/~jrennie/papers/icml03-nb.pdf
Quick and robust, these changes may improve accuracy to SVM levels.
SVM Improvements:
Not Too Hot, Not Too Cold:
The Bundled-SVM is Just Right! [PDF]
Lawrence Shih,
Yu-Han Chang, Jason D. M. Rennie and David Karger.
Proceedings of the ICML-2002 Workshop on Text Learning.
2002.
http://www.ai.mit.edu/~jrennie/papers/icml02-bundled.ps.gz
I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, Support Vector Machine Learning for
Interdependent and Structured Output Spaces, Proceedings of the International Conference on
Machine Learning (ICML), 2004.
http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf
SIMPL:
Fast and accurate text classification
via multiple linear discriminant projections.
Soumen Chakrabarti, Shourya Roy and Mahesh Soundalgekar. VLDB Journal, 12(2), pages 170--185.
http://www.cs.berkeley.edu/~soumen/doc/vldbj2003/simpl2003.pdf
Text Classification Tools - Miscellaneous:
Orange:
component-based data mining software in C++ includes SVM, logistic regression, clustering, and
lots more.
http://magix.fri.uni-lj.si/orange/
OpenNLP Tools Common API v1.0.0
http://opennlp.sourceforge.net/api/index.html
MLC++ :
a library of C++ classes for supervised machine learning.
http://www.sgi.com/tech/mlc/index.html
MALLET:
A Machine Learning for Language Toolkit
http://mallet.cs.umass.edu/
new release as of 4/04