Lexico-morphology (<TAN-A-lm>)

TAN-A-lm files are used to annotate a class-1 source by specifying the lexical and morphological properties of its tokens or morphemes.

Every TAN-A-lm file has two different types of dependencies: a class 1 source (optional) and the grammatical rules defined in one or more TAN-mor files. This section therefore should be read in close conjunction with the section called “Morphological Concepts and Patterns (TAN-mor)”).

TAN-A-lm files are either source-specific or language-specific.

Source-specific TAN-A-lm files depend exclusively upon one class-1 source. Source-specific TAN-A-lm files are useful for closely analyzing the grammatical properties of the words in one particular text. Well-curated source-specific TAN-A-lm files are enormously useful for other applications, e.g., quotation detection. Any source-specific TAN-A-lm file can be converted into a language-specific one, to be used as noted below.

Language-specific TAN-A-lm files depend upon an unknown number of sources. Some language-specific TAN-A-lm files might be based upon a small, specific corpus, perhaps just one text. Others might rely upon a vast, general one. Language-specific TAN-A-lm files are useful for building language resources for computer applications. Many language-specific TAN-A-lm files become the basis for a local language catalog, which can be used to populate a new source-specific TAN-A-lm file.

Editors of TAN-A-lm files should understand the vocabulary and grammar of the languages of their sources. They should have a good sense of the rules established by the lexical and grammatical authorities adopted. They should be familiar with the conventions and assumptions of the TAN-mor files being used.

Although you must assume the point of view of a particular grammar and lexicon, you need not hold to a single one. In addition, you may bring to the analysis your own expertise and supply lexical headwords unattested in published authorities.

Although TAN-A-lm files are simple, they can be laborious to write and edit, more than any other type of TAN file. They can also be hard to read if the morphological codes are cryptic. It is customary for an editor of a TAN-A-lm file to use tools to create and edit the data.

The root element of a lexico-morphological file is TAN-A-lm.

If the file is source-specific, <source> points to the one and only TAN-T(EI) file that is the object of analysis. If the file is language-specific, <for-lang> is used in the declarations section of the <head> to indicate the languages that are covered.

For highly inflected languages, language-specific TAN-A-lm files can be enormous in size or quantity. To improve performance when validating and processing numerous or large language-specific TAN-A-lm files, the <head> may also include <tok-starts-with> and <tok-is>. It is common for language-specific TAN-A-lm files to be cataloged in a <collection> file. These become part of the local language catalog, bound to the global parameter $tan:lang-catalog-map, found in parameters/params-application-language.xsl. By including in that parameter your collections to language-specific TAN-A-lm files, you open up those resources to use in a variety of other applications. In that <collection> file, the individual <doc>s that point to language-specific TAN-A-lm files should include as children any <tok-starts-with> and <tok-is> as in the original.


Conversion from a source-specific TAN-A-lm to a language-specific one is a one-way operation. There is at present no mechanism for automatically reconstructing the corpus that underlies a language-specific TAN-A-lm file.

<vocabulary-key> takes the elements other class-2 files take (see the section called “Class 2 metadata (<head>)”. It also permits two elements unique to TAN-A-lm: <lexicon> (optional) and <morphology> (mandatory). Any number of lexica and morphologies may be declared; the order is inconsequential.

There is, at present, no TAN format for lexica and dictionaries. So even if a digital form of a dictionary is identified through the the section called “Digital entity metadata pattern”, the Schematron validation routine will not attempt to check the TAN-A-lm data against the lexical authorities cited.

Because you or other TAN-A-lm editors are likely to be authorities in your own right, <person> can be treated as if a <lexicon>, and be referred to by @lexicon.

The <body> of a TAN-A-lm file takes, in addition to the customary optional attributes found in other TAN files (see the section called “Edit stamp”), @lexicon and @morphology, to specify the default lexicon and grammar.

<body> has only one type of child: one or more <ana>s (short for analysis), each of which matches one or more tokens (<tok>) to one or more lexemes or morphological assertions (<lm>, which takes zero or more <l>s followed by one or more <m>s).

An <ana> may take a @tok-pop, to specify the number of tokens that the assertion applies to. This is particularly helpful for language-specific files based upon a limited corpus of texts, where the underlying data for the assertion might be difficult or impossible to retrieve. The token population can be used to calibrate levels of certainty, or to compare statistical profiles of one TAN-A-lm file against another.

If you wish to point to a linguistic token that straddles more than one token, you should use multiple <tok>s, wrapping them in a <group>.

Any token may be the object of as many <ana>s as you like. In fact, this is preferred if you wish to register competing claims or alternatives.

Claims within an <ana> are distributed. That is, every combination of <l> and <m> (governed by <lm>) is asserted to be true for every <tok> or <group>.

If an <lm> lacks an <l>, the token value its itself, calculated by each <tok>, is taken to be the default value of the lexeme.

All assertions are assumed to be made with 100% confidence unless @cert is invoked. This still holds even when a <tok> is the subject of multiple <ana>s, because it is possible to be completely confident that a given word has two different grammatical profiles in the target text (e.g., puns, wordplay).

Many TAN-A-lm files will be generated by an algorithm that automatically lists all possible morphological values of each token. It is advised that such automatic calculations always include in their output @cert, with weighted values. That is, if an algorithm identifies two possible lexico-morphological profiles for a word, but one occurs nine times more than the other, then it is advised that this be reflected in the two resultant elements, e.g.: <lm cert="0.9">...</lm> and <lm cert="0.1">...</lm>. If an algorithm is written with a more sophisticated way to weigh possibilities, then adjust the value of @cert accordingly. Be certain that the <algorithm> is credited in the <vocabulary-key> and in a <resp>.

As with TAN-A-tok files, not every word needs to be explained or described. In fact, this is oftentimes undesirable, to avoid files that are overly long and time-consuming to validate or process.

A TAN-A-lm file is rendered more efficient when claims can be grouped. If a particular token invariably has a single lexico-morphological profile, this can be declared once, in a <tok> that does not have @ref. If the token has a particular profile in a given region of text, it can be specified through a @ref that encompasses the specified region. You do not need to provide a <tok> for every token, which would entail restricting @ref to leaf divs. You may do so, but such an approach can result in very long files that are time-consuming to validate, process, and edit. It is more advantageous to declare lexico-morpological properties more generally, thereby replacing numerous leaf-div <tok>s.

The benefits in processing time are significant. In early versions of TAN, the lexico-morphogical values of the Greek Septuagint (8.3 MB) were converted to a TAN-A-lm file of 407,811 <tok>s, one per token per leaf div, grouped in 52,703 <ana>s (25.8 MB). Early 2020 validation routines took about 25 minutes (2018 validation routines took hours). The long processing time is due primarily to the TAN-A-lm file itemizing every single token in the text. That same file was revised to be more declarative along the lines advocated above. If a particular token had only one lexico-morphological profile throughout the text, then every instance was reduced to a single <ana>, with no @ref in <tok>. When a particular token value had different lexico-morphological profiles, @ref targeted the rootmost <div> that encompassed them all. This revision resulted in a smaller file (15.8 MB; 158,376 <tok>s in 54,335 <ana>s) that validated in about a third of the time (8.5 minutes).

In general, there is always a trade-off between convenience and efficiency. If your priority is speed, you should break a large file into several smaller ones, perhaps recombining them in a master file via <inclusion> (see the section called “Networked Files”).

Applications can be written to convert TAN-A-lm <m> data from one morphological system to another. This is a two-step process facilitated by the functions tan:morphological-code-conversion-maps() and tan:convert-morphological-codes(). See documentation in these guidelines or in functions/language/TAN-fn-language-extended.xsl.