CoDiAJe – The annotated diachronic corpus of judeo-spanish. Description of a multi-alphabetic corpus and its textual and linguistic annotations
Abstract
Judeo-Spanish differs from late 15th-century Spanish and modern Spanish in several respects, such as its morphology, syntax, and semantics, but the most visible difference is in the alphabet. From the end of the 19th century, Judeo-Spanish has been written in various alphabets –Greek, Cyrillic, and especially Latin–. However, the Hebrew alphabet had been used since ancient times, before it was abandoned finally only in the 1940s. This means that the majority of Judeo-Spanish texts are written in Hebrew characters.
CoDiAJe is an annotated diachronic corpus that includes documents produced from the 16th century up to the present day, developed in TEITOK. The significance of its development is that this tool processes linguistic data in the alphabets mentioned above, allowing users to visualize each text in five orthographic forms (the original version in which it was written, its transcription in Latin characters, an expanded form to complete abbreviations or to correct defective writing, a version in modern Judeo-Spanish, and a version in orthographic modern Spanish). CoDiAJe enables the user to conduct searches not only for a specific word, but also for all its linguistic and orthographic variants in the different alphabets. During the annotation process, tags from the EAGLES tagset for Spanish were modified, and others were created: these are simply steps towards the creation of an accurate tagset for Judeo-Spanish. The digitized texts are also enriched with semantic-conceptual information and information on the affiliation of all non-Romance elements.
Keywords
Judeo-Spanish, Multi-alphabetic corpus, Corpus annotation, Linguistic variation, DiachronyReferences
ÁLVAREZ LÓPEZ, Cristóbal José (2017): Estudio lingüístico del judeoespañol en la revista “Aki Yerushalayim”. Sevilla: Universidad de Sevilla. PhD. dissertation directed by José Javier Rodríguez Toro and Aitor García Moreno.
ARNOLD, Rafael D. (in this volume): «La digitalización del fichero del Diccionario del Español Medieval (DEM): una nueva fuente para la historia del español y del judeoespañol», in Miriam Bouzouita and Antoine Primerano (eds.), Actas del V Congreso Internacional de Corpus Diacrónicos en Lenguas Iberorrománicas (CoDiLI5). Scriptum Digital, 8, pp. xxx.
BRADLEY, Travis G. and Ann Marie DELFORGE (2006): «Phonological Retention and Innovation in the Judeo-Spanish of Istanbul», in Timothy L. Face and Carol A. Klee (eds.), Selected Proceedings of the 8th Hispanic Linguistics Symposium. Somerville, MA: Cascadilla Proceedings Project, pp. 73-88.
BUNIS, DAVID M. (1993): A Lexicon of Hebrew and Aramaic Elements in Modern Judezmo. Jerusalem: The Magnes Press/Misgav Yerushalayim.
BUNIS, David M. (2009): «Judezmo Analytic Verbs with a Hebrew-Origin Pariciple: Evidence of Ottoman Incluence», in David M. Bunis (ed.), Languages and Literatures of Sephardic and Oriental Jews. Jerusalem: Misgav Yerushalayim/The Bialik Institute, pp. 94-166.
BUNIS, David M. (2019): «La ortografia de Aki Yerushaluyim: Un pinukolo en la estoria de la romanizasión del djuezmo (djudeo-espanyol)», Aki Yerushaluyim, 101.
http://www.akiyerushalayim.com/ay/101/101_03_ortografia.htm [Accessed: 08/07/2020].
BUSSE, Winfried (2005): «Rashí. Transliteración, transcripción y adaptación de textos aljamiados», Neue Romania, 34 (= Judenspanisch, IX), pp. 97-107.
CÁRDENAS, John (2004): «Judeo-Spanish and the Lexicalist Morphology Hypothesis: A Vindication of Inflectional and Derivational Morphology», California Linguistic Notes, 29, 1, pp. 1-23.
http://english.fullerton.edu/publications/clnArchives/pdf/cardenas_jslmh.pdf [Accessed: 30/06/2020].
DiJeSt = RUSINEK, Sinai: DiJeSt: Digitizing Jewish Studies. http://dijest.net/ [Accessed: 15/08/2020].
GARCÍA MORENO, Aitor (2010): «El judeoespañol II: Características». Madrid: Liceus. https://aprende.liceus.com/producto/judeoespanol-ii-caracteristicas/ [Accessed: 10/09/2019].
HUALDE, José Ignacio (2013): «Language Contact and Change in the Sound System of Judeo-Spanish», in Mahir Şaul (ed.), Judeo-Spanish in the Time of Clamoring Nationalisms. Istanbul: Libra kitap, pp. 151-178.
HUALDE, José Ignacio and Mahir ŞAUL (2011): «Istanbul Judeo-Spanish», Journal of the International Phonetic Association, 41, 1, pp. 89-110.
JANSSEN, Maarten (2012): «NeoTag: A POS Tagger for Grammatical Neologism Detection», in Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asunción Moreno, Jan Odijk and Stelios Piperidis (eds.), Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012. Istanbul: ELRA (European Language Resources Association), s. p. http://maarten.janssenweb.net/Papers/neotag-lrec.pdf [Accessed: 26/09/2020].
JANSSEN, Maarten (2016): «TEITOK: Text-Faithful Annotated Corpora», in Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis (eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC 2016, Paris: ELRA (European Language Resources Association), pp. 4037-4043. http://www.lrec-conf.org/proceedings/lrec2016/pdf/651_Paper.pdf [Accessed: 09/07/2020].
JANSSEN, Maarten (2018): «TEITOK as a Tool for Dependency Grammar», Procesamiento del Lenguaje Natural, 61, pp. 185-188. doi:http://dx.doi.org/10.26342/2018-61-28 [Accessed: 09/07/2020]
JANSSEN, Maarten, Josep AUSENSI and Josep M. FONTANA (2017): «Improving POS Tagging in Old Spanish Using TEITOK», in Gerlof Bouma and Yvonne Adesam (eds.), Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language. Gothenburg: Linköping University Electronic Press (NEALT Proceedings Series, 32), pp. 2-6. https://www.aclweb.org/anthology/W17-0502.pdf [Accessed: 09/07/2020].
LLEAL GALCERÁN, Coloma (2004): «El judeoespañol», in Rafael Cano (coord.), Historia de la lengua española. Barcelona: Ariel, pp. 1139-1167.
LYONS, John (1981): Language and Linguistics. Cambridge: Cambridge University Press.
MANNING, Christopher D. and Hinrich SCHÜTZE (1999): Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press.
MINERVINI, Laura (2006): «El desarrollo histórico del judeoespañol», Revista Internacional de Lingüística Iberoamericana, 8, pp. 13-34.
MUÑOZ JIMÉNEZ, Isabel (1997): «Perífrasis verbales híbridas en judeoespañol literario», Revista de Filología Románica, 14, pp. 363-390.
OntoSem = GLIF (Grupo de Lingüística Formal): OntoSem Corpora. http://corptedig-glif.upf.edu/ontosem-corpora/ [Accessed: 15/08/2020].
PADRÓ, Lluís (2011): «Analizadores Multilingües en FreeLing», Linguamatica, 3, 2, pp. 13-20. http://www.lsi.upc.edu/~nlp/papers/padro11.pdf [Accessed: 06/07/2020].
PADRÓ, Lluís, Miquel COLLADO, Samuel REESE, Marina LLOBERES and Irene CASTELLÓN (2010): «FreeLing 2.1: Five Years of Open-Source Language Processing Tools», in Proceedings of 7th International Conference on Language Resources and Evaluation (LREC 2010). La Valletta: ELRA. http://www.lrec-conf.org/proceedings/lrec2010/pdf/14_Paper.pdf [Accessed: 06/07/2020].
PENNY, Ralph (2000): Variation and Change in Spanish. Cambridge: Cambridge University Press.
Post Scriptum = CLUL (ed.) (2014): P. S. Post Scriptum: Arquivo Digital de Escritura Quotidiana em Portugal e Espanha na Época Moderna. http://ps.clul.ul.pt/pt/index.php? [Accessed: 09/07/2020].
QUINTANA, Aldina (2006): Geografía lingüística del judeoespañol: Estudio sincrónico y diacrónico. Berna: Peter Lang.
QUINTANA, Aldina (2010): «El judeoespañol, una lengua pluricéntrica al margen del español», in Paloma Díaz-Mas and María Sánchez Pérez (eds.), Los sefardíes ante los retos del mundo contemporáneo. Indentidad y mentalidades. Madrid: Consejo Superior de Investigaciones Científicas, pp. 33-54.
SÁNCHEZ-MARCO, Cristina, Gemma BOLEDA, Josep Maria FONTANA and Judith DOMINGO (2010): «Annotation and Representation of a Diachronic Corpus of Spanish», in Proceedings of 7th International Conference on Language Resources and Evaluation (LREC 2010). La Valletta: ELRA.
https://upcommons.upc.edu/bitstream/handle/2117/10373/535_Paper.pdf?sequence=1&isAllowed=y [Accessed: 08/07/2020].
SÁNCHEZ-MARCO, Cristina, Gemma BOLEDA and Lluís PADRÓ (2011): «Extending the Tool, or How to Annotate Historical Language Varieties», in Kalliopi Zervanou and Piroska Lendvai (eds.), Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. Stroudsburg (PA): Association for Computational Linguistics, pp. 1-9.
http://dl.acm.org/citation.cfm?id=2107637&CFID=979433322&CFTOKEN=43121501 [Accessed: 08/07/2020].
SÁNCHEZ-MARCO, Cristina, Josep Maria FONTANA and Judith DOMINGO (2012): «Anotación automática de textos diacrónicos del español», in Emilio Montero and Carmen Manzano (coords.), Actas del VIII Congreso Internacional de Historia de la Lengua Española, Santiago de Compostela, 14-18 de septiembre de 2009, vol. 2. Santiago de Compostela: Meubook: Asociación de Historia de la Lengua Española (AHLE), pp. 1709-1720.
TRUDGILL, Peter (2011): Sociolinguistic Typology: Social Determinants of Linguistic Complexity. Oxford: Oxford University Press.
VARVARO, Alberto and Laura MINERVINI (2008): «Orígenes del Judeoespañol (II): comentario lingüístico», Revista de Historia de la Lengua Española (RHLE), 3, pp. 149-195.
Published
Downloads
Copyright (c) 2020 Aldina Quintana
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.