LGTagger is an open-source Part-of-speech tagger that also recognizes Multiword units.
It is based on Conditional Random Fields (CRF) and large-coverage lexical resources. The lexical resources can be composed of morphosyntactic dictionaries (including simple and compound words) and strongly lexicalized local grammars. It presently works for French.
Reference: Matthieu Constant and Anthony Sigogne. MWU-aware Part-of-Speech Tagging with a CRF model and lexical resources. ACL Workshop on Multiword Expressions: from Parsing and Generation to the Real World (MWE'11). 2011.
Download LGTagger-1.1 (release 13/02/2013): lighter release with better French models.
Download LGTagger-1.0 (release 15/12/2011): included lemmatisation and fixed bug on sentence segmentation.
Download LGTagger-0.1 (release 26/10/2011)
Download LGTagger-0.1-beta (historical beta version)
LGExtract is an open-source software devoted to converting Lexicon-Grammar
tables (M. Gross 1975) into an XML-structured syntactic lexicon. It is implemented in Java. It has been used to generate the LGLex syntactic lexicon of simple verbs, verbal nouns, fixed expressions and adverbials in French.
Reference: Matthieu Constant and Elsa Tolone. A generic tool to generate a lexicon for NLP from Lexicon-Grammar tables. In Michele De Gioia, editor, Actes du 27e Colloque international sur le lexique et la grammaire (L'Aquila, 10-13 septembre 2008). Seconde partie, volume 1 of Lingue d'Europa e del Mediterraneo, Grammatica comparata, pages 79-93. Aracne, April 2010. ISBN 978-88-548-3166-7.
Distagger is a open-source software that automatically detects disfluencies in spoken language transcripts.
Reference: Matthieu Constant, Anne Dister. 2010. Automatic detection of disfluencies in speech transcriptions. In: Massimo Pettorino, Antonella Giannini, Isabella Chiari, Francesca M. Dovetto (Eds.). Spoken Communication. Cambridge Scholars Publishing. pp. 259-272.
Download Distagger-0.2 jar file