December 9, 2014, 9:30-10:30

Nerbonne
John Nerbonne
University of Groningen & University of Freiburg

''Computational Linguistics and Digital Humanities''

Interest in the digital humanities (DH) is rising very quickly. The two most recent digital humanities conferences in Europe, DH 2012 in Hamburg and DH 2014 in Lausanne, attracted 500 and 740 participants, respectively. The greatest interest is found in text-oriented work, especially in literature and history. Some areas of linguistics that had not earlier attracted much interest from computational linguistics (CL), such as historical linguistics, dialectology and psycholinguistics are likewise witnessing an upsurge of interest in using computational techniques.

CL offers indispensable tools for the preparation of large text collections (corpora), including tools for word tokenization, for lemmatization, for named-entity recognition, for recognizing paragraph and sentence boundaries, for recognizing spelling alternatives and for normalizing spelling, for linking texts to external knowledge sources such as lexica and gazetteers, for sentiment analysis and for topic modeling. All these CL methods are now in use in DH, and their intelligent use naturally benefits from the hands of experienced developers.

DH also offers a wealth of opportunities to CLers to become involved in areas of science and scholarship outside its normal "home fields" of linguistics and computer science. Exciting horizons are becoming visible!

December 10, 2014, 17:30-18:30

Hovy
Eduard Hovy
Carnegie Mellon University

''NLP as a Core Driver for AI: Its Past, Present, and Possible Future''

Artificial Intelligence (AI) started in the late 1950s, with one of its main challenges being Machine Translation (MT). Despite progress over the decades,no-one would say today that MT has been solved.Instead it was joined by other challenges involving human language, resulting today in a mixture of (sub)fields, including Natural Language Processing, Information Retrieval, and Speech Processing. Looking at their histories, two conclusions can be drawn: (1) much of language processing consists of representation transformations, between for example natural languages, syntactic representations, semantic ones, or others, and (2) every subarea has undergone a shift of methodology, when the inadequacy of linguistics-oriented hand-crafted rules and associated rule bases and knowledge structures like ontologies forces someone to realize that learning representation transformations automatically is more effective. In the new paradigm, the focus shifts to engineering techniques that produce larger, multiple-solution systems that are rarely exactly correct but that do not catastrophically fail. Eventually, the limit of representation expressiveness and transformation power is reached, and the subfield 'matures': progress slows down, commercialization occurs, and research funding disappears. This has happened to speech recognition (Apple's Siri), information retrieval (Google and Yahoo!), and is close to being finalized with MT (Google Translate and others), Information Extraction (several small companies), Text Summarization and QA (some small companies), NL access to databases (parts of commercial packages).

One can (re)interpret the state of most subareas of AI through the same historical lens. Many of them have experienced a parallel evolution and paradigm shift. The more 'mature' branches of AI, including Robotics and Scheduling, are all evaluation-driven engineering and offer commercially available solutions that work acceptably but not perfectly. The less 'mature' branches, such as Knowledge Representation, are almost all still working in the pre-automation/learning paradigm, require long apprenticeship training of students and postdocs in the 'art' of the area. For them, evaluations are scarce or are contrived, and engineering is much less developed.

If Newell and Simon were correct, and success in AI is indeed mostly the problem of choosing the most appropriate representation, then AI researchers should become skilled in styles and powers of different representation styles (from connectionist to distributional/deep to symbolic); methods of performing transformations on them (from manual rules using finite state automata to various forms of machine learning to neural networks); the kinds of resources required for each style (from basic collections of elements such as lexicons, to corpora used for training, with the attendant challenge of annotation, to the kinds of information best suited for unsupervised learning); and techniques and nuances of evaluation, including sophisticated statistical reasoning and experimental design. Few students today are trained this way, which is a problem for both AI and language processing.