CoDiAJe – THE ANNOTATED DIACHRONIC CORPUS OF JUDEO-SPANISH . DESCRIPTION OF A MULTI-ALPHABETIC CORPUS AND ITS TEXTUAL AND LINGUISTIC ANNOTATIONS

,


INTRODUCTION
Judeo-Spanish is an autonomous linguistic diasystem made up of a continuum of dialects that have developed without contact with Peninsular and American Spanish.The exception is the North African or Hakitia variety, which never ceased to maintain contact with Peninsular Spanish.CoDiAJe -Corpus diacrónico anotado del judeoespañol (The Annotated Diachronic Corpus of Judeo-Spanish) is a project 1 whose purpose is to build a resource that allows researchers to study in depth the evolution of Judeo-Spanish.This resource, in addition to providing paleographic information about the texts, enriches them with linguistic information (POS tagging and lemmatization), ensuring its easy usage by nonexperts in NLP.The resource is easily maintainable and has the possibility of being permanently improved by non-NLP experts, following OLDES (see Janssen et al. 2017)the first model from which its development started.Nevertheless, it should also offer satisfactory solutions to the specific problems that Judeo-Spanish texts pose.
Developed from the popular Spanish spoken in the late 15 th century (cf.Arnold in this volume), it is the only historical variety of Spanish that does not conform to its unity.This is not only due to the remarkable differences that Judeo-Spanish exhibits in many respects, such as its phonology, morphology, syntax, and semantics (cf.Penny 2000: 176-192;Cárdenas 2004;Lleal 2004;Bradley and Delforge 2006;Minervini 2006;Varvaro and Minervini 2008;García Moreno 2010;Hualde and Şaul 2011;Hualde 2013), in comparison with Spanish, but primarily because of the different systems used in the graphic representation of the language, and the extent of graphemic variation found in its documents (cf.Quintana 2010;Bunis 2019).Therefore, the problems taken into account before starting the building of CoDiAJe can be summarized in the following points: 1. Most of the Sephardic textual heritage, preserved in both printed and manuscript documents, is written in the Hebrew alphabet.It was only in the late 19 th century that Sephardim began to make progressive use of the Latin alphabet in various versions (French, Italian, Serbo-Croatian, or Romanian, Turkish and, to a lesser extent, Spanish), together with the Cyrillic and Greek alphabets, depending on their dominant language, and to adapt them to the phonemic characteristics of Judeo-Spanish (cf.Bunis 2019).However, these alphabets did not fully replace the Hebrew alphabet until after World War II, and only in the last few decades has a relatively unified system of writing in Latin characters been imposed.This means that perhaps 90% of Judeo-Spanish texts were written using the Hebrew alphabet.Another consequence of this is that the Judeo-Spanish texts accessible to scholars who have not necessarily specialized in the study of this variety are very few compared to those which make up its textual legacy.They are also not accessible to speakers because they are literate in other languages and, consequently, they cannot read these scripts, and also because Judeo-Spanish works have not been republished for one, two, or more centuries.As for the manuscripts, few are the scholars who have acquired the ability to read and understand them.
1 CoDiAJe is being developed within the framework of two research projects funded by grant 473/11 (completed) and grant 486/19 of the Israel Science Foundation (ISF).This paper was written in the framework of the latter project at the Hebrew University.Other members of the project are Josep M. Fontana (Universitat Pompeu Fabra, Barcelona) and Maarten Janssen (Charles University, Prague).
211 2. In addition to the graphemic variation that the texts display in each alphabet, one must bear in mind the variation of the language in all its dimensions (diachronic, diatopic, diastratic and diaphasic) as a consequence of the situation of low normative pressure (cf.Quintana 2006), which allowed for a flexible internal development of Judeo-Spanish in keeping with universal tendencies of natural human languages (Trudgill 2011).Particularly in the texts written in the 18 th and the early 19 th centuries, the language also shows a significant degree of medium-transferability (Lyons 1981: 12), meaning that a high percentage of units of the abstract language system became medium-independent, giving rise to a considerable degree of variation.To illustrate, Figure 1 shows the linguistic and orthographic variants of the lemma adientro ʻinsideʼ in CoDiAJe, which in Judeo-Spanish expresses both situation and movement, unlike in Standard Spanish, and the frequency of each of them2 .
Figure 1.Linguistic and orthographic variants of the lemma adientro ʻinsideʼ with their frequency in CoDiAJe.
3. Although the principle of language representation has always been phonemic, texts written or printed in the Hebrew alphabet have the peculiarity of representing the vowels with matres lectionis, i.e. consonants that are used to indicate a vowel.In Hebrew and Judeo-Spanish they are ‫א‬ (aleph), ‫ה‬ (he), ‫ו‬ (waw) and ‫י‬ (yod).Without taking into account the representation of diphthongs, which is even more complex, two consonant graphemes < ‫א‬ , -‫ה‬ > represent the vowel /a/ and two others < ‫ו‬ ‫,י‬ > are used to write the vowels /o, u/ and /e, i/ respectively, except in Hebrew words and expressions, where the etymological form based only on consonants was maintained and generally follows the Sephardic tradition to use haser or defective spelling -i.e.only consonants without any indication of vowels-in the Hebrew script.This is because the Hebrew language exhibits a pattern of stems consisting of 3-consonant consonantal roots.Moreover, the affricate consonant phonemes /ʦ/, /ʣ/, /ʧ/, /ʤ/ and the palatal /ʒ/ -which do not exist in Hebrew-are also represented by Hebrew letters bearing diacritical marks.The result is that at some point, the grapheme < ‫>ג'‬ represented up to three phonemes: /ʧ/, /ʤ/ and /ʒ/.Hebrew does not have the voiced palatal nasal /ɲ/, represented by ‫>ני<‬ or ‫>ניי<‬ in Judeo-Spanish.This makes it hard to differentiate geographical variants, such as nieve ‫>נייב'י<‬ or ñeve ‫>נייב'י<‬ ʻsnowʼ, since both readings are possible.Another problem lies in reading words that are pronounced with a voiced alveolar tap /ɾ/ or a trill /r/ (e.g.‫,>פירה<‬ which may have two readings: /'peɾa/ ʻpearʼ or ʻbitchʼ and /'pera/ ʻbitchʼ, since only one grapheme is available for writing the two phonemes still preserved in some varieties).Adjustments to the spelling system imposed by phonological changes also need to be borne in mind.4. Contact of Judeo-Spanish speakers with Hebrew as the Sephardim's ethnoreligious language, and the different types of contact with the surrounding Romance and non-Romance languages must also be considered.These languages are Turkish, Greek, Slavic languages, Arabic, Romanian and Italian dialects, to which German, but especially Italian and French -and to a lesser extent Spanish-, as languages of culture since the mid-19 th century, must be added3 .Judeo-Spanish contact with the last three furthered its revival and the re-Romanization of its regional norms.Before that, however, mainly in works belonging to the rabbinical style -which are almost all of them-and private letters, not only are quotations from Hebrew sources embedded in Judeo-Spanish texts, but all kinds of Hebrew nouns in construct state or smikhut ʻgenitiveʼ and other words pertaining to all parts of speech may also appear merged with Hebrew inflectional morphemes.Some examples are shown in Figure 2.While Hebrew single words and nouns in construct state do not pose problems and can be lemmatized as integrated words in Judeo-Spanish, it is impossible to lemmatize words merged with all kinds of Hebrew inflectional morphemes following the rules of Hebrew.  5. Finally, an essential aspect taken into account in the development of CoDiAJe is the ascription of Judeo-Spanish to different cultural traditions, mainly the Hispanic and Jewish ones.All this, naturally, without «concealing the Judeo-Spanish nature of the text, the characteristics, and history of the Sephardic language» (Busse 2005: 105).
In short, a single corpus should a) process linguistic data in the alphabets mentioned above, b) allow the visualization of each text in the original version independently of the alphabet in which it was written, its Latinized transcription, and a modern standardized version, and c) enable the user to conduct searches not only for a specific word but also for all its linguistic and orthographic variants in the different alphabets.
The development of CoDiAJe originated from the need to recover, for research and for the speech community, at least part of the nearly 4,000 Judeo-Spanish printed books, some of them of more than 1,000 pages, and about 250 periodicals, some of which were published for over 30 years.To these one must add thousands of manuscripts that were never published.Most of this textual legacy in Judeo-Spanish is hidden in archives around the world.Their publication would make it possible, in many cases, to piece together fragments of a single document, now scattered over different collections and archives.The present paper is organized as follows: Section 1 provides the reader with a brief description of the corpus and the current state of its development.Section 2 deals with metadata (2.1), the multi-alphabetic nature of CoDiAJe (2.2), the problems raised in the tasks of textual (2.3) and linguistic annotations (2.4) and their solutions.Section 3 addresses the advantages offered by the search options of the corpus and presents examples of the frequency distribution of some of the search results.Finally, Section 4 contains concluding remarks and offers directions for the future development of multialphabetic corpora for languages whose texts have similar characteristics to those of Judeo-Spanish.

BRIEF DESCRIPTION OF CODIAJE
After several attempts using other tools, with unsatisfactory results, CoDiAJe was created in TEITOK4 , initially developed at the Centro de Linguística da Universidade de Lisboa (Janssen 2016: 4037), and follows the structure of the project P. S. Post Scriptum, with the replacement of some features and the addition of others to satisfy the specific requirements of Judeo-Spanish texts.
Taking advantage of TEITOK as a web-based platform for visualizing, searching, and editing TEI/XML-based corpora (Janssen 2018) that combines textual and linguistic annotation within a single TEI-based XML document, CoDiAJe is a structured multi-genre diachronic corpus that includes documents produced from the 16 th century up to the 21 st century -this does not preclude the addition of older texts in the future-enriched with different kinds of textual and linguistic information.Every document is also accompanied by metadata.
CoDiAJe currently contains 74 documents totaling to 352,357 tokens.16 documents were written in the Hebrew alphabet, 1 in the Cyrillic script, and 57 in Latin characters, representing 98,112, 323, and 253,922 tokens respectively, as shown in Figures 3 and 4 5 .This disproportion derives from the fact that some texts had already been transcribed in Latin characters for the failed attempts to develop a corpus of Judeo-Spanish before starting its development in TEITOK.In the future, we will correct this by returning each of the documents originally written in Hebrew characters to its original alphabet.
24 of the documents (35,624 tokens) have been fully annotated.This still small set of texts is being used for corpus training, which will allow annotations to be made automatically in the future, requiring only their revision by the CoDiAJe editors to exclude possible errors.

Metadata
The list of metadata has been carefully planned to allow advanced corpus search options, and could be improved in the future.The metadata of each document provide information of scientific interest about the author (name, gender, year and birthplace, place of residence), and about the document, such as date, genre, alphabet, or documentary source, as shown in Figure 5.There is also the option of viewing more data, which include information about the medium in which the text was created, whether it is an original or a translation, and who was responsible for the tasks of transcription, normalization, and tagging of each text, among other data (Figure 6).In this case, they correspond to the letter in Figure 5.

The multi-alphabetic corpus of Judeo-Spanish
The digitization of the texts is carried out with the ABBYY OCR software, except for the texts written in Hebrew characters, where Transkribus is used after being trained for the recognition of texts in Judeo-Spanish6 .The first version usually contains a large number of errors that must be corrected manually.In spite of this, the task is exceptionally profitable, since, in the time required to copy a page manually, it is possible to check between 35 and 40 pages once the optical character recognition has been completed.The documents are incorporated into CoDiAJe using the XML-TEI format in order to enter all the necessary metadata for further processing.
A particularly innovative aspect of CoDiAJe is the possibility of incorporating texts in the alphabets in which they were originally written or published and visualizing them correctly.Therefore, the first step in the development of CoDiAJe consisted in adapting TEITOK to the peculiar multi-alphabetic character of the Judeo-Spanish documents.The possibility of including orthographic forms in the Hebrew alphabet raised the difficulty of writing and reading in a right-left direction without disturbing the opposite directionality of the Latin, Greek and Cyrillic alphabets.After expanding the potential of CoDiAJe to allow the inclusion of texts written in different alphabets and the coexistence of multiple orthographies, it was necessary to determine the layers required to encode the multiple orthographies and alphabets in which a word can be written.At present CoDiAJe has a total of five different orthographic realizations: 1) An original spelling (Transcription), which is an exact copy of the text (Figure 7); 2) A transcription in Latin characters (Romanized form), available only when the Transcription is written in non-Latin characters (Figure 8); 3) an Expanded form, very useful for expanding the numerous abbreviations in the Judeo-Spanish texts, and correcting defective writings, very frequent in manuscripts and documents in Hebrew script (cf. the tokens hashem and vegomer in the Expanded form after the completion of the abbreviations h´ and vego´ shown in the Romanized form in Figure 11); 4) a Normalized form in which each word appears standardized according to the spelling rules (cf.Álvarez López 2017) authorized by the National Authority of Ladino on August 13, 2018, and to the standard characteristics of the Judeo-Spanish spoken in Istanbul, which is the variety currently spoken by the largest number of Judeo-Spanish speakers (Figure 9); 5) A hispanized form (Spanish equivalent) that in the future will allow for a visualization of the texts in modern Spanish spelling.
These results can be shown in different layers -including the original spelling of the TEIbased XML file as shown in Figure 7, and a Romanized version when the file is not in the Latin alphabet (see Figure 8)-, which allows for a visualization of the diversity of variants.
All the orthographic options can be visualized by clicking on the corresponding buttons on top to switch between the various layers.It should be stressed that due to the specific characteristics of Judeo-Spanish, which derive from a low standardization level, the aim here is to achieve a structured approach to the visualization of diversity, focusing on variant detection.The multi-alphabetic character of CoDiAJe does therefore not affect the search function.One of CoDiAJe's achievements is that a single query can yield results for all variants without losing track of the original spelling form, as shown in Figure 10.

Textual annotation: Problems and solutions
Certain forms in the process of grammaticalization or lexicalization may present problems when making textual annotations in historical texts.Good examples in Judeo-Spanish texts are some adverbs ending in -mente.Forms written as orthographic variants such as ‫מינטי‬ ‫סולא‬ or ‫מינטי‬ ‫סולה‬ (= sola mente) ʻonlyʼ or ‫מינטי‬ ‫קואל‬ (= kual mente) ʻalsoʼ emerge in texts from the <tok roman="avizi" nform="avizi" spa="avisé">‫/<אב'יזי‬tok> <tok roman="perasha" nform="perasha" spa="perašá">‫/<פרשה‬tok> 16 th to the 18 th century, while in modern Judeo-Spanish they are written as one word.TEITOK offers the possibility of merging two or more adjacent words in a single token, while preserving their original orthography in two or more words, as shown in Figure 13.Both the same and the opposite happen in sequences formed by some prepositions, such as a or de, followed by the definite article el, which may appear written separately (a el, de el) or contracted (al, del).In this case, we follow the mixed approach of TEITOK (cf.Janssen 2016: 4038), according to which contractions are annotated as one orthographic <tok> with two grammatical <dtoks>, but preserve their original spelling, whether they are written together or separately.The same procedure is followed in ala, alas, alos, dela, delas, delos, enlos and other similar tokens, always annotated in one orthographic <tok> with two grammatical <dtoks> (Figure 14).Complex forms, such as verbs with enclitic pronouns are also annotated as one orthographic <tok> with the corresponding grammatical <dtoks>.
The vast majority of nouns in construct state borrowed from Hebrew are elements of the Judeo-Spanish lexicon, in which they are collocations, although morphologically they preserve their Hebrew structure.In this case, they are annotated as one orthographic <tok>.For example, the two parts of divre tora ʻwords of the Torahʼ, ‫אומות‬ ‫העולם‬ ʻnations of the worldʼ, ‫ביקור‬ ‫חולים‬ ʻvisiting the sickʼ, ‫דין‬ ‫בית‬ ʻrabbinical courtʼ or ‫בית‬ ‫החיים‬ ʻcemeteryʼ, which fulfil the conditions of non-compositionality, non-substitutability and nonmodifiability 7 (Manning and Schütze 1999: 184) are considered collocations.The same is true for ‫ה‬ ‫כנסת‬ ‫אנשי‬ ‫גדו‬ ‫לה‬ ʻThe Men of the Great Assemblyʼ, which, in addition to the two nominal parts, also contains an adjective (Figure 15). 7These characteristics are highlighted in the creation of new simple lexies, such as amares (sg.), amareses (pl.) ʻignorant, especially in matters of Jewish law and custom; boorish, unlettered personʼ (Bunis 1993: 368, #3169), arising from the contraction of the two parts of the noun in the construct state ‫הארץ‬ ‫עם‬ (= am a-areṣ).

222
Different approaches are followed in relation to Hebrew words merged with Hebrew inflected morphemes and tokens involving more than one word that are not included in the patterns already discussed.If they are frequently used forms in Judeo-Spanish texts, albeit sometimes only as diaphasic variants, they are considered elements of its lexicon: for example, rabotenu (N+Poss.)lit.ʻour wise lordsʼ, when preceded by verba dicendi (dezir 'to say', avizar 'to point out', and expressions of similar meaning), belongs to the rabbinical style, while other genres display more standardized phrases such as muestros hahamim or muestros sinyores savios.However, in this context, rabotenu limits the reference to the sages of Israel from the time of the Mishnah and the Talmud, and it is understood as such.
Since Judeo-Spanish also has the word ribi (lit.ʻmy lordʼ), which is a term of respect and courtesy that comes before the names of men, and also before the names of the wise, the two forms could reasonably be considered to belong to the same nominal Judeo-Spanish paradigm of which ribi is the unmarked singular form and rabotenu the plural.One of the several lexical units involving more than one word borrowed from Hebrew is ‫בצער‬ (= beṣar), lit.'with regret', composed of the preposition be--with which the noun to which it is linked can also be adverbialized-and the common noun ṣar ʻpain, grief, sadness, sufferingʼ.In Judeo-Spanish, besar belongs to the class of adverbs.Therefore, it is annotated in a single tok.
Hebrew forms that are only found once or have a low frequency in CoDiAJe, with no other evidence of their use in Judeo-Spanish, are annotated as alien elements and, as we saw in §2.2, their linguistic affiliation is the only information that is tagged.In multipleword sequences in other languages, such as Hebrew ve-ahar kah ʻand after thisʼ, ke-dereh a-soharim ʻlike the merchantsʼ, mizerah amiluha ʻof royal lineageʼ, the textual annotation appears in one <tok>, and they lack standard linguistic annotations.
Frozen idioms, sayings and proverbs, and similar multiple-word units, as well as quotations from Hebrew sources or other languages, are annotated in one <tok> without further linguistic information (see Ish bitiren para ʻwhat puts an end to the work is the moneyʼ in Figure 18).As we will see later, these sequences, whether in Hebrew, Turkish or another language, are assigned a special mark to indicate their linguistic affiliation, in view of the fact that they are not part of the Judeo-Spanish lexicon.
The task of annotation of alien words that are not integrated into the Judeo-Spanish system is an arduous undertaking that involves a careful analysis of their function in the system and their frequency in the documents, which often necessitates modifying their textual and linguistic annotations more than once.

Linguistic annotation
The exploitation potential offered by CoDiAJe would not have been achievable without accurate and careful linguistic annotations (POS and lemma), and without the patience needed for the lemmatization of poorly standardized languages, in the absence of ancillary resources, such as a good dictionary, especially when their speakers are unaware of vast portions of the lexicon used by previous generations.Initially this corpus was tagged using a custom-built version of Freeling (Padró 2011;Padró et al. 2010) for Old Spanish (Sánchez-Marco et al. 2010;Sánchez-Marco et al. 2011;Sánchez-Marco et al. 2012) and the EAGLES tagset for Spanish adapted for Old Spanish within the framework of the OntoSEM project.Although the level of reliability of the automatic tagging was very high (approximately 85%) when using this tool for 15 th -17 thcentury Sephardic texts manually transcribed in Latin alphabet, the incorporation of documents from later centuries transcribed in an adaptation of the modern orthography of Judeo-Spanish or copied in Hebrew characters revealed that the efficiency of automatic tagging was not sufficient.
Since Judeo-Spanish differs from late 16 th -century Spanish and modern Spanish in a number of respects, such as in its morphology, syntax, and semantics, in addition to the orthographic representations, POS tagging is done manually using the EAGLES tagset for Spanish.When the corpus contains a considerably greater number of incorporated texts, the intention is that these tasks will be carried out directly with NeoTag, trained on the already tagged files.This training is necessary since the POS tagger NeoTag uses lexical smoothing to detect grammatical neologisms in a corpus, and therefore tags and lemmatizes known and unknown words alike (cf.Janssen 2012).
The EAGLES tagset had to be modified in many cases, and, in addition, new tags have been created to describe all Judeo-Spanish forms accurately.The digitized texts are also enriched with semantic-conceptual information: it is possible to identify expressions made up of one or more words, automatically classified as names of people, places, institutions, titles, and names assigned to God.Quotations in other languages, expressions and all kinds of forms of one or more words that do not belong to Judeo-Spanish are also tagged.Finally, all non-Romance forms have been enriched with information about their linguistic affiliation.This information is combined with the linguistic tagging of the texts.This means that the semantic-conceptual information and the language affiliation are part of the POS, together with the morpho-syntactic information.Figure 19 shows the full tagging of the geographic variant bugitus ʻlittle packagesʼ, a common noun (NC) in masculine (M), plural (P), and diminutive form (D) of the lemma bogo.The final Y of the POS also provides information about the language from which this word was borrowed in Judeo-Spanish, in this case Turkish.
As already explained in §2.1, complex forms of two or words, such as verbs with postponed clitics, are annotated in a single orthographic form but separated according to their morphological characteristics.Figure 20 contains the textual and linguistic information of ‫מאלסינארלו‬ ʻto denounce him, to slander himʼ 8 .  Although this verb is a derivate of the Hebrew noun malšīn, it is not clear to which medieval social group its formation should be assigned.Therefore, the verb malsinar is here considered a patrimonial word transferred from the Iberian Peninsula and not a Hebrew word.
<tok id="w-4052" nform="bogitos" spa="boguitos" pos="NCMP00DY" lemma="bogo" lemmaes="paquete" glosses="paquetitos">bugitus</tok> <tok id="w-22" roman="malsinarlo">‫<מאלסינארלו‬dtok id="d-22-1" form="malsinar" pos="VMN0000" lemma="malsinar" lemmaes="malsinar"/><dtok id="d-22-2" form="lo" lemma="lo" pos="L3MSA0 " lemmaes="lo"/></tok> 225 The lemmatization is done according to the orthographic rules of modern Judeo-Spanish and the norm of Istanbul as much as possible, since that is the variety with the most speakers in the world, and the one most frequently used in writing.The corpus is also lemmatized in modern Spanish.As surprising as that may seem, it can be of great help in the lemmatization of CoDiAJe, a task that sometimes becomes difficult due to the dialectal variation.With a quick search by the Spanish lemma, possible errors in the lemmatization of a Judeo-Spanish token can be easily detected.Consequently, its lemmatization can be unified for all the geographical variants belonging to the lemma of the standardized variety -in this case, that of Istanbul.A good example may be the geographical variant dishipla (Bitola, Salonika, Sarajevo) ʻmaidʼ, while in Istanbul, a maid is called mosa.In Bitola however, mosa means 'young womanʼ, djovena in Istanbul, and in Salonika, it can be used with both meanings.The problem is solved by assigning to the lemma djoven 'young' all the textual occurrences with this meaning, like djovena, mosa, moso, moçuelo (Spanish lemma: joven), and to the lemma moso 'servant' (Spanish lemma: criado) all those that have the same meaning, like mosa and dishipla.If we search for the Spanish lemma criado, the results show in the KWIC line only occurrences of mosa or moso meaning 'servant' in the texts, with the Judeo-Spanish lemma moso, while when mosa or moso have the meaning of 'young', they would always appear under the Spanish lemma joven, while djoven must also be the Judeo-Spanish lemma.Otherwise, the search results would contain errors.Therefore, it is essential to ensure uniformity in the task of the Judeo-Spanish lemmatization, following as much as possible the variety spoken in Istanbul, selected as the standard variety for the lemmatization in CoDiAJe, and the lemmatization in Spanish can help in this task.The problem arises when there is not a total equivalence of meaning between cognate forms such as the Spanish noun joven and the Judeo-Spanish djoven.Judeo-Spanish words such as muchacha and manseva may also appear in the KWIC line on a search for the Spanish lemma joven as a noun, because the former has the meaning of 'young' in the Sarajevo variety and the latter is used with the meaning 'young' in all Judeo-Spanish varieties.It has been proved that lemmatization in Spanish can also facilitate the query when the user lacks knowledge of Judeo-Spanish, because it is possible to extract from the texts all the words and variants that in Judeo-Spanish have the same meaning as the Spanish lemma.Figure 21  The first 7 lines show the linguistic annotation of the occurrences discussed in the previous paragraph.Lines 8-18 contain tags that needed to be modified or created.From line 19 to 24, we can see POS with semantic-conceptual information (G = geographical names; P = private or fictional names of people; O = institutional names; T = titles of works; V = epithets referring to God) in position 5 of the tag.Capital letters in the end position of the POS offer information concerning the linguistic affiliation of words or groups of words borrowed from non-Romance languages that are part of the Judeo-Spanish lexicon (for instance, H indicates a Hebrew origin of the term; Y refers to Turkish, K to Greek, and A to non-Hispanic Arabic, as seen in lines 25-35).The same tags without any other information indicate only terms, expressions, sayings, proverbs, blessings, and curses, and quotations in non-Romance languages interpolated in Judeo-Spanish texts (lines 36-38).With this information, it is possible to retrieve the complete list of terms from Hebrew or Turkish or other non-Romance contact languages documented in CoDiAJe in a single query.A summarized description of the corpus tagset is available in the main menu of CoDiAJe.

QUERY
CoDiAJe is searchable via TEITOK, which interfaces with a local CQP.With the annotated corpus, adequately indexed for exploitation via the CQP search engine, it has become possible to conduct searches not only for a specific word and any of its variants through XML files directly using the CQP query, but also to query using CQL directly on the website.This allows for the conduct of searches for all kinds of variants, and for specific sequences of expressions, grammatical categories and combinations thereof, and makes it easy to carry out various types of quantitative analyses (e.g.relative frequencies, distribution).This is very important in order to draw statistical inferences about the degree of incidence of several variants exposed to linguistic changes, to map out the relative frequency of each form, and to draw conclusions from the use of variants in Judeo-Spanish.One of CoDiAJe's achievements is that a single query can yield results for all variants without losing track of the original spelling form.For example, a search for the lemma defter yields results for all the orthographic and grammatical forms of this lemma in the corpus.The results appear in the browser, showing the KWIC line for each of them.
Displaying them in the Romanized transcription is also possible, which allows for visualization of all variants in the transcription in Latin characters (Figure 22).When the purpose of the query is not the variation, the results can be displayed in modern standardized Judeo-Spanish or modern Spanish spelling.It is also possible to search in a 228 single text or in a group of texts, predetermined in the query.Figure 23 shows the result of the search for the lemma sinyor in 16 th -century texts.
By clicking on the word context in each line of the KWIC, it is possible to display the word form of the lemma in the text in the selected orthographic layer (Figure 24) or switch between the various options on the top menu.Further, the visualization of the text without or with the linguistic annotations is possible by clicking on the Tags buttons at the top (Figure 25).In TEITOK, the CWB files are created directly from the XML files (Janssen 2018).It is therefore possible to define more complex queries, combining filters of all levels.In Figure 26 the search is limited to adjectives whose lemma ends in the morpheme -li in texts written in the 20 th century.Their distribution and frequency by author are shown in Figure 27. Figure 28 shows the Romanized visualization of the results of the search for any infinitive preceded by a clitic personal pronoun after a preposition in the text collection of CoDiAJe, and their distribution by the place in which the texts with this word order were written.As Figure 29 shows, it is also possible to retrieve morphological information, for example on the gender of feminine Hebrew nouns ending in -ut borrowed into Judeo-Spanish, and confirm that they only emerge with masculine determiners.As already noted in §2.2, CoDiAJe allows searching for specific semantic-conceptual information, such as place names mentioned in the collection of texts or some of them in particular.For example, the bar chart in Figure 30 presents the distribution of the two variants of the lemma yerushalayim found in texts from Sarajevo.A search containing, for example, H or Y in the POS will retrieve all forms borrowed from Hebrew, Turkish, or another non-Romance language with which the Judeo-Spanish speakers were in contact, or quotations, sayings, and proverbs in these languages, as well as other lexical elements not integrated into the Judeo-Spanish system.Figure 31 shows the two kinds of lexical units: all the conventional POS tags ending in Y correspond to integrated words borrowed from Turkish, while the isolated Y tag refers to Turkish lexicon that sometimes appears inserted in Judeo-Spanish texts.In the near future, CoDiAJe will have other resources that are included in TEITOK, such as (a) a dictionary with the vocabulary of the corpus, and (b) a module for the development of linguistic maps.Later, the module for syntactic annotation (c) will be added.

CONCLUDING REMARKS
CoDiAJe now has the potential of functionality allowed by TEITOK's basic design, to which several new features have been added to adapt it to the specific needs of a Judeo-Spanish corpus.First, to the three standard attributes of each tok (transcription, an expanded, and a normalized form), two new attributes were added (a Romanized form, and a Spanish equivalent), yielding up to five orthographic forms of the same word.This addition allows us to visualize each text in five different orthographic forms and, not least, to retrieve all the orthographic and linguistic variants of a lemma through the query.
In more recent times -since the 1940s-Judeo-Spanish documents have been written in Latin characters, albeit using various versions of orthographic systems.Some documents have also been written in the Cyrillic and Greek alphabets.However, until the late 19 th century all documents were written in Hebrew characters.Consequently, a requisite in the development of CoDiAJe was to obtain a tool with the option of incorporating texts in all the alphabets in which they were originally written or published, and correctly visualizing them without interfering with the search task in the corpus.This paper shows that all these requirements were achieved thanks to Maarten Janssen, TEITOK developer and a collaborator of this project.

233
CoDiAJe -like any other corpus created in TEITOK-is easy to use by editors who are not experts in NLP, and by users.Other important advantages lie in the ease with which errors detected in the corpus can be corrected or necessary changes can be made promptly or at any time.It is well known that textual and linguistic annotation tasks in a corpus of a limitedly standardized language -such as Judeo-Spanish-are not always easy.The annotation in CoDiAJe requires continuous revision every time that forms emerge in the texts for which the EAGLES tagset for Spanish -and for other European languages as wellhas no tags.Tags modified or especially created for Judeo-Spanish are simply steps in the creation of an accurate tagset for Judeo-Spanish.The tagging work itself and the test of the adequacy of each annotation through searching offer the possibility of analyzing the language and, frequently, reveal the need to modify the annotation of a particular token, when morphological features not mentioned in the secondary literature are detected.
An unlimited number of new texts can be incorporated into CoDiAJe, and the annotation can be improved at any time.The most immediate goal is to incorporate new texts in Hebrew characters that have already been extracted from the image and converted into Word text, as a result of which the corpus will soon have more than three million words.
CoDiAJe also meets other conditions required in corpus linguistics: the possibility of attaching the facsimile alongside the text, descriptive statistics, and the complete download of documents in XML format and plain text.Moreover, it offers the possibility of adding other resources, including (a) a dictionary with the vocabulary of the corpus, (b) a module for the development of linguistic maps, and (c) a module for syntactic annotation.
In view of CoDiAJe's potential for text-processing and its capacity to store an endless number of documents in a single virtual library, a significant part of the Sephardic documentary heritage will be made available to a wide number of scholars from different fields and readers who do not have access to it at present.Despite being still in its development phase, CoDiAJe should serve as a guide in the development of diachronic corpora of majority or minority languages that have been written in different alphabets throughout their history, and for historical varieties of languages that have been written in a different alphabet from the standardized variety, for example, part of the morisco texts, or documents written in other Judeo-Romance varieties.A corpus like this makes it easier to detect the linguistic variation that characterizes such languages and linguistic varieties.The multi-alphabetic nature of CoDiAJe additionally allows visualizing the texts in their original orthography, regardless of the alphabet in which they were written, and in their Romanized and modernized versions.This would contribute to disseminating these cultural heritages and would provide linguists and philologists with the possibility of analyzing each variety of the language, taking into account the set of all its other varieties, and scholars in other fields could benefit from the information frozen in the endless list of documents still hidden in archives and, in many cases, unintelligible for one reason or another.236 del VIII Congreso Internacional de Historia de la Lengua Española, Santiago de Compostela, 14-18 de septiembre de 2009, vol. 2. Santiago de Compostela: Meubook: Asociación de Historia de la Lengua Española (AHLE), pp.1709-1720.TRUDGILL, Peter (2011): Sociolinguistic Typology: Social Determinants of Linguistic Complexity.

Figure 3 .
Figure 3. Current number of tokens distributed by alphabet in which the texts were loaded onto CoDiAJe.

Figure 4 .
Figure 4. Number of tokens classified by century according to the date of composition of the CoDiAJe text collection.

Figure 5 .
Figure 5.A letter in CoDiAJe, handwritten originally in Hebrew characters, known as solitreo script.

Figure 6 .
Figure 6.Example of the metadata that accompany each document.In this case, they correspond to the letter in Figure5.
Figures 7, 8, and 9  show different orthographic realizations of the text next to its facsimile image, in this case the reverse of a postcard sent from Rhodes to Los Angeles at the end of the 1920s, listed as lad801 in CoDiAJe.

Figure 11 .
Text view options in CoDiAJe.

Figure 15 .
Figure 15.Textual annotations of characteristic Hebrew nouns in construct state that in Judeo-Spanish became collocations.

Figure 16 .
Figure 16.<tok> of lexical units of two or more Hebrew words.

Figure 17 .
Figure 17.Textual annotation of two Hebrew participles in Judeo-Spanish.

Figure 22 .
Figure 22.Variants of the lemma defter ʻnotebook, account bookʼ, queried by lemma and visualized by the orthographic transcription and the Romanized transcription.

Figure 24 .
Figure 24.Visualization of the form ‫שיניורה‬ (occurrence in KWIC line 9 in Figure 23) in the text, displayed in the original orthographic form.

Figure 25 .
Figure 25.Visualization of the original orthographic form ‫שיניורה‬ in modern standardized Judeo-Spanish, displayed with lemma and POS tags.

Figure 26 .Figure 27 .
Figure 26.Adjectives with lemma ending in -li in texts written in the 20 th century.

Figure 28 .
Figure 28.Output of the preverbal position of clitic personal pronouns before an infinitive, and their distribution by place.

Figure 30 .
Figure 30.Distribution of local variants of the lemma yerushalayim ʻJerusalemʼ in Sarajevo texts.

Figure 31 .
Figure 31.Distribution by POS tags of all Turkish forms occurring in Judeo-Spanish texts already incorporated in CoDiAJe and tagged.
Possible visualizations of the Judeo-Spanish diatopic variants vežežirije ʻold ageʼ, ‫עזיז‬ ‫אל‬ ‫חאב‬ ʻCyperus esculentusʼ, and the Hebrew constructus ‫הרע‬ ‫לשון‬ (H.) ʻdefamationʼ.Other Judeo-Spanish texts involve greater difficulties than the one shown in Figures7, 8, and 9.For example, quotations from other texts, often in Hebrew, are usually not printed with a distinctive typeface.Their normalization in capital letters not only facilitates their quick location in the Normalized form, but also in the Transcription with the original spelling, as can be verified in the brief extract of a text first printed in 1730 (Figure11), included in CoDiAJe.
contains the linguistic annotations of 38 forms included in CoDiAJe, tagged according to the exposed criteria.Different kinds of orthographic forms tagged according to the criteria for CoDiAJe.