The Old Spanish Textual Archive, Design and Development of a Corpus of Medieval Texts: Lemmatization and PoS Tagging
Abstract
This paper presents aspects related to the processing of forms, lemmas, grammatical analysis and texts in the Old Spanish Textual Archive (OSTA), a linguistic corpus of more than 32 million words, based on the more than 400 semi-paleographic transcriptions of medieval texts written in Castilian, Asturian, Leonese, Navarro-Aragonese and Aragonese prepared by the collaborators of the Hispanic Seminary of Medieval Studies (HSMS). It also describes the process of tagging and lemmatization using Freeling, a Natural Language Processing tool, and HSMS-app, a textual analysis tool developed for this project.
Keywords
Electronic corpus design, Corpus annotation, digital Medieval Spanish corpus, Medieval SpanishReferences
CARRERAS, Xavier, Isaac CHAO, Lluís PADRÓ y Muntsa PADRÓ (2004): «FreeLing: An Open-Source Suite of Language Analyzers», Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04). [En línea] <http://nlp.lsi.upc.edu/publications/papers/carreras04.pdf>
CHARTA (2013): Criterios de edición de documentos hispánicos (Orígenes-siglo XIX) de la red internacional CHARTA. [En línea] <http://www.redcharta.es/criterios-de-edicion>
GAGO JOVER, Francisco (2015): «La Biblioteca Digital de Textos del Español Antiguo (BiDTEA)», Scriptum Digital, 4, pp. 5-36.
GAGO JOVER, Francisco (1997): «Diccionario de términos militares del castellano medieval», Doctoral Dissertation. University of Wisconsin-Madison.
HERRERA, María Teresa, et al. (1996): Diccionario español de textos médicos antiguos. 2 vols. Madrid: Arco/Libros.
KASTEN, Lloyd A. y John J. NITTI (2002): Diccionario de la prosa castellana del Rey Alfonso X. New York: Hispanic Seminary of Medieval Studies.
MACKENZIE, David (1977): A manual of manuscript transcription for the Dictionary of the Old Spanish Language. Madison: Hispanic Seminary of Medieval Studies.
MACKENZIE, David y Ray HARRIS-NORTHALL (1997): A manual of manuscript transcription for the Dictionary of the Old Spanish Language. 5.ª edición. Madison: Hispanic Seminary of Medieval Studies. [En línea] <http://hispanicseminary.org/manual-es.htm>
NITTI, John (1978): «Computers and the Old Spanish Dictionary», Computers and the Humanities, 12, pp. 43-52.
PADRÓ, Lluís y Evgeny STANILOVSKY (2012): «FreeLing 3.0: Towards Wider Multilinguality», Proceedings of the Language Resources and Evaluation Conference (LREC 2012) ELRA. Istanbul, Turkey. May. [En línea] <http://nlp.lsi.upc.edu/publications/papers/padro12.pdf>
PADRÓ, Lluís (2011): «Analizadores Multilingües en FreeLing», Linguamatica, 3.2, pp. 13-20. [En línea] <http://nlp.lsi.upc.edu/publications/papers/padro11.pdf>
SÁNCHEZ, María Nieves (2000): Diccionario español de documentos alfonsíes. Madrid: Arco/Libros.
SÁNCHEZ-MARCO, Cristina, Gemma BOLEDA, y Lluís PADRÓ (2011): «Extending the tool, or how to annotate historical language varieties», Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 1-9, Portland, OR, USA, 24 June 2011. [En línea] <http://nlp.lsi.upc.edu/papers/sanchezmarco11.pdf>
Published
Downloads
Copyright (c) 2018 Francisco Gago Jover, Francisco Javier Pueyo Mena
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.