CLASK: Combining Linguistic and Statistical Knowledge
(1993-96)
This page will expire and be deleted by April 2000
The aim of the CLASK project was to outline directions for
research into robustness techniques, covering both the ambiguity
problem and the problem of ill-formed or unexpected input. At the
heart of the enterprise is the belief that, given the current
state of linguistic knowledge, solutions have to be based in part
on what we understand (i.e. the linguistic knowledge which can be
expressed in discrete rules), and in part on what we do not quite
understand, but are capable of measuring precisely enough to
allow for extrapolation (i.e. patterns of linguistic behavior
which can be described statistically). Rather than to put
rule-based and statistical approaches in opposition, and argue
which one is to be preferred, the project aims at combining them
in such a way that the strong points are exploited to a maximum,
and the negative effects of their shortcomings are reduced to a
minimum. The framework adopted by the project is the DOP
framework, which has the clear advantage that linguistic
knowledge is not sacrificed to existing probabilistic methods,
and which integrates statistical knowledge in a way compatible
with the description of complex linguistic phenomena and the
building of rich interpretations in conformance with mainstream
linguistic (or semantic) representation theories. The intended
output of the project is a collection of tools, methods and
techniques, that should help to reduce the robustness problem,
and that should be general enough to be applicable in different
contexts and environments.
During the first phase of CLASK, the activities were directed
toward designing and implementing an efficient deterministic
integrated parser and disambiguator for DOP grammars. A pilot
implementation has been already operational since February 1995
and is exhibiting very efficient performance (time and space) on
realistic DOP grammars (in comparison to the previous
non-deterministic unit implemented at Alfa-Informatica in
Amsterdam). The design of this unit is based on the observation
that DOP models are stochastically enriched constrained
Context-Free Grammars (CFGs), which implies that the
parser+disambiguator unit is an extension to a CFG parser (CKY
based). The second phase of CLASK is directed toward both
improving the performance of this unit and toward the problem of
dealing with ill-formed text (error-correction). The
error-correction task can also be seen as a case-study of the
effectiveness of integrating statistical and syntactic knowledge.
Some papers and reports
-
Efficient Disambiguation by means of
Stochastic Tree Substitution Grammars.
Khalil Sima'an,
Bod, R.,
Krauwer, S.,
and Scha, R. (1994).
In Proceedings International Conference on
New Methods in Language Processing (NeMLaP'94),
Manchester, pages 50--58. Centre for Computational Linguistics, UMIST.