The Old Spanish Textual Archive, Design and Development of a Corpus of Medieval Texts: Lemmatization and PoS Tagging

Authors

  • Francisco Gago Jover College of the Holy Cross (USA)
  • Francisco Javier Pueyo Mena College of the Holy Cross (USA)

Abstract

This paper presents aspects related to the processing of forms, lemmas, grammatical analysis and texts in the Old Spanish Textual Archive (OSTA), a linguistic corpus of more than 32 million words, based on the more than 400 semi-paleographic transcriptions of medieval texts written in Castilian, Asturian, Leonese, Navarro-Aragonese and Aragonese prepared by the collaborators of the Hispanic Seminary of Medieval Studies (HSMS). It also describes the process of tagging and lemmatization using Freeling, a Natural Language Processing tool, and HSMS-app, a textual analysis tool developed for this project.

Keywords

Electronic corpus design, Corpus annotation, digital Medieval Spanish corpus, Medieval Spanish

References

CARRERAS, Xavier, Isaac CHAO, Lluís PADRÓ y Muntsa PADRÓ (2004): «FreeLing: An Open-Source Suite of Language Analyzers», Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04). [En línea] <http://nlp.lsi.upc.edu/publications/papers/carreras04.pdf>

CHARTA (2013): Criterios de edición de documentos hispánicos (Orígenes-siglo XIX) de la red internacional CHARTA. [En línea] <http://www.redcharta.es/criterios-de-edicion>

GAGO JOVER, Francisco (2015): «La Biblioteca Digital de Textos del Español Antiguo (BiDTEA)», Scriptum Digital, 4, pp. 5-36.

GAGO JOVER, Francisco (1997): «Diccionario de términos militares del castellano medieval», Doctoral Dissertation. University of Wisconsin-Madison.

HERRERA, María Teresa, et al. (1996): Diccionario español de textos médicos antiguos. 2 vols. Madrid: Arco/Libros.

KASTEN, Lloyd A. y John J. NITTI (2002): Diccionario de la prosa castellana del Rey Alfonso X. New York: Hispanic Seminary of Medieval Studies.

MACKENZIE, David (1977): A manual of manuscript transcription for the Dictionary of the Old Spanish Language. Madison: Hispanic Seminary of Medieval Studies.

MACKENZIE, David y Ray HARRIS-NORTHALL (1997): A manual of manuscript transcription for the Dictionary of the Old Spanish Language. 5.ª edición. Madison: Hispanic Seminary of Medieval Studies. [En línea] <http://hispanicseminary.org/manual-es.htm>

NITTI, John (1978): «Computers and the Old Spanish Dictionary», Computers and the Humanities, 12, pp. 43-52.

PADRÓ, Lluís y Evgeny STANILOVSKY (2012): «FreeLing 3.0: Towards Wider Multilinguality», Proceedings of the Language Resources and Evaluation Conference (LREC 2012) ELRA. Istanbul, Turkey. May. [En línea] <http://nlp.lsi.upc.edu/publications/papers/padro12.pdf>

PADRÓ, Lluís (2011): «Analizadores Multilingües en FreeLing», Linguamatica, 3.2, pp. 13-20. [En línea] <http://nlp.lsi.upc.edu/publications/papers/padro11.pdf>

SÁNCHEZ, María Nieves (2000): Diccionario español de documentos alfonsíes. Madrid: Arco/Libros.

SÁNCHEZ-MARCO, Cristina, Gemma BOLEDA, y Lluís PADRÓ (2011): «Extending the tool, or how to annotate historical language varieties», Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 1-9, Portland, OR, USA, 24 June 2011. [En línea] <http://nlp.lsi.upc.edu/papers/sanchezmarco11.pdf>

Published

15-10-2018

Downloads

Download data is not yet available.