Marcu, Daniel
Department of Computer Science
University of Toronto

Daniel Marcu
Department of Computer Science
University of Toronto
Toronto, Ontario
Canada M5S 3G4

marcu@cs.toronto.edu

Strand: computational

Title: The automatic derivation of complex conjunctive structures

I present an algorithm that takes as input unrestricted natural language text and outputs the set of valid complex discourse (conjunctive) structures that can be associated with that text. The essentials of the algorithm are grounded in: (a) A first-order formalization of complex discourse structures that relies both on the concept of nuclearity, which has been put forth by Rhetorical Structure theorists, and on the systemic view of language, which emphasizes the relationship between discourse structures and lexicogrammatical resources. (b) An extensive corpus analysis of more than 450 discourse markers and 7900 text fragments that were randomly extracted from the Brown corpus. The analysis provides data with respect to the types of relations and the sizes of the textual spans that each discourse marker relates.

In step 1, the algorithm performs a shallow analysis of the text that is given as input: on the basis of a collection of heuristics that were derived from the corpus analysis, it breaks the text into textual units and it hypothesizes conjunctive relations between the units. Then, in step 2, the algorithm finds all possible discourse structures that can be associated with the set of textual units determined at step 1. The algorithm ensures that each of these discourse structures enforces not only the overall discourse structure constraints that were mentioned in (a), but also a subset of the hypothesized relations that encompasses all textual units in the text.

As expected, the algorithm generates more than one valid discourse structure for a given text. A metric defined over the set of possible structures induces a total ordering on the set of solutions that permits the selection of the "best" ones.

The paper emphasizes general lessons that were learned from the corpus analysis about the structure of English discourse, and presents numerous examples of the operation of the algorithm.