TextServer
Company Structured Text Software Solutions Customers Support
Demonstrations Documentation Downloads Publications Acknowledgements



The Text Model

Each SGML document indexed by tokenizer.exe has a grammar derived from the content of this document, and a parse tree derived from parsing this document. This parse tree consists of nodes, which identify elements, attributes, comments, and individual words in the document.

Nodes within this parse tree are assigned node numbers which have an ordering consistent with the order in which these nodes would be visited by a preorder traversal of this parse tree. Nodes are also assigned a span. This span is merely the number of nodes which are descendants of this node. The node number and span serve to identify the region of text spanned by any given conceptual node.

Elements span attribute nodes, and words of text bracked by element markup, potentially comments, and processing instructions. Attribute nodes span the words contained within the values associated with these attributes. Comments span the words contained within them. Processing instructions are currently not indexed.

Each structured text value (except null) is associated with a physical text document, indexed using tokenizer.exe. Because this association exists, each structured text value has a grammar. Each structured text value begins at some node, which itself spans some set of nodes.

Each structured text value contains a list of zero or more marks. A mark is merely a node number and span, which spans a collection of nodes contained within the textwithin which it occurs. Marks within this list are ordered by their node number, and within this by their span.

The node numbers which occur in the structured text value may have spans which differ from those associated with this node within the parse tree. In particular any sequence of consecutive words (represented within the parse tree as distinct nodes) will internally be represented by a node beginning at the first word, and spanning all nodes within this sequence.

Collating sequence

Instances of type text have the following ascending collating sequence. Firstly, they will be grouped in no specific order by their input source. Within this they will be sorted by their node number. Within this they will be ordered by their descending span. Finally they will be ordered by their list of marks, with marks themselves order by ascending node number, and within this by descending span.

Internal Structure of a Text Index

The internal structure of a text index is quite complex. An attempt is made to describe it here, primarily so that the potential of this text index may be grasped. The text index is built upon ISAM software also designed and owned by Dealers Choice Software. The structure of this "ISAM" file is therefore first explained.

Internal structure of an ISAM file

A single ISAM file is internally divided into multiple logical subfiles. Each subfile is internally identified by a consecutive integer, beginning with zero.

Block 0 in the file, records both the blocking factor of the file, which may be 1, 2, 4 or 8K bytes/block, and acts as the header for a Unix style of free list, which may span multiple blocks. This free list is used to both register those blocks within the file which are currently free, and to on demand allocate these blocks to arbitrary subfiles.

Block 1 in the file contains a single record recording all active subfile. This record may itself span multiple blocks. Each subfile may optionally have records indexed by a key (of not more than 255 characters), and thus have a root index block. Keys may be either fixed length or variable length. Quasi B-tree's are used to support record keys. These B-trees behave appropriately under expansion, but do not free blocks and contract the index until the block becomes empty. This results in a considerable performance improvement when (as is typically the case) data is added rather than removed from the file, and only moderate degradation in storage optimisation when records are deleted. [In the case of text indices, information within the index is never deleted so the whole issue of how the B-tree's behave under deletion become irrelevant.

Each subfile may also be scanned sequentially, and thus also has a starting data block. Subfiles may either be allocated a fixed number of consecutive blocks which they will permanently occupy, or may be constructed by chaining consecutive blocks together. Subfiles which are preallocated consecutive disk space may be accessed randomly, without the need for either indices or sequential traversal of the blocks in these subfiles. However, the management of the contents of the space occupied by such subfiles is the applications responsibility.

Records in any subfile may be of essentially arbitrary length (the current physical limitation is that block addresses use 32 bit arithmetic). If a record cannot fit into a block, (even if the block is fragmented into multiple blocks according to the standard B-tree rules) then the key of this record is written instead as a "tag". The tag provides a reference to the address where the actual block where the record begins. Such "overflow" records may occupy multiple overflow blocks which are chained. A tag may also contain an index detailing the contents of each such overflow block, allowing these blocks to be read randomly, without the need to walk through chains of blocks.

Multiple threads and processes may concurrently access an ISAM file. Concurrency problems are avoided by low level locking of information within the ISAM file structure. The problem of flushing cached data across all processes whenever any process performs an update to an ISAM file is resolved by monitoring the last update time on the ISAM file.

Text Index Organisation

(1) Text subfile
The text subfile contains the actual text being indexed, if the index program has indicated that the text should be incorporated into the index file. This text is stored within a preallocated set of consecutive blocks. This strategy simplifies the administration of texts, and ensures that the indices refer to the correct text, but denies direct access to the text being indexed by external programs unfamiliar with the internal organisation of the text index.

(2) Vector subfile
The vector subfile provides the primary source of information about the actual text spanned by each text node (element/attribute/word etc). This vector is stored within a preallocated set of consecutive blocks. For any preorder text node number, this vector records the character offset within the text where this text node number begins, the number of characters spanned by this text node, and the number of nodes spanned by this text node. Node 0 represents the invalid node number, node 1 represents the entire text, and node 2 represents the document root of the text tree. Note that text processing instructions and comments may legally lie outside the text spanned by node 2. Leaf nodes (those spanning no other nodes) contain a reference to the nearest ancestor that they are not the left most child of, thus allowing efficient navigation from an arbitrary text node to its parent or ancestor text nodes. [Technically this data structure is referred to as a left-threaded tree]. Each entry in this vector has the same size, allowing arbitrary access to elements of this vector. All values stored in entries of this vector are capable of storing 64 bit integer values. This vector is highly compressed, and is most certainly not human readable without supporting software.

(3) Global subfile
The global subfile currently occupies one block, and records version numbers, the encoding of the text (ascii/unicode/reverse unicode), the GUID of the dll used to translate character strings into words, so that the same dll may later be employed when searching the internal text, the time the text file was created, the size of the DTD and internal text, the number of nodes in the text, and the name of the external text file if any. This record also contains sizing information needed in order to determine precisely how to uncompress information in the text index file.

(4) Grammar subfile
Each element and attribute type encountered within the text file being indexed is recorded by name and type in the grammar subfile. If any numeric values are associated with the words/attribute values spanned by this type then a subfile is allocated for the express purpose of allowing such values to be retrieved using their numeric ordering as well as their lexigraphical ordering. The frequency with which this named type occurs is also recorded for cost optimisation. Finally the minimum and maximum frequency with which every named type occurs beneath this type is recorded, together with an indication of which types occasionally occur as children of the described type. The information in this subfile can be recovered by using the grammar_root, grammar_elements and grammar_hierarchy table functions.

(5) Description subfile
The description subfile contain records that correlate the elements described in the grammar subfile with human readable descriptions of the purpose of these elements. These descriptions are returned by the grammar_elements table function. This subfile may be updated as desired by using the comment program.

(6) Genid subfile
This subfile contains records describing each genid associated with elements in the text. Each such record is indexed by the characters that represent the genid, and contains an ordered list of node numbers rooted at this genid. This list for efficiency also redundantly contains the span of each recorded node number. Each record is highly compressed, using different compression statistics for each block containing this record. When this record overflows a single block the tag referencing it contains an auxiliary index detailing each block used to store this record, and the first node number mentioned within this block. Records in this subfile are used to resolve queries involving genid's.

(7) Attribute subfile
This subfile behaves identically to the genid subfile but records information about attribute names rather than genid names.

(8) Value subfile
This subfile behaves identically to the genid subfile but records information about the words used within attribute values. There is no requirement that the words used as keys within this subfile correspond to the original word used within an attribute value. The mapping here is determined entirely by the tokenizer dll used when indexing the text. In particular words in the text may be associated with zero, one, or multiple representations as strings within the index, none of which need in theory have any obvious correlation to the words being indexed. It must also be stressed that it is entirely up to the tokenizer to determine where word boundaries occur. The only restriction is that keys representing words longer than 256 bytes will be truncated to their first 256 bytes. This means that the maximum effective length of a unicode word is 128 bytes.

(9) Word subfile
This subfile behaves identically to the value subfile but indexes words within the text rather than words within the markup.

(10) Comment subfile
This subfile contains a single record having the structure of records in the word, and value subfile, but recording the locations of comments within the text.

(11) Comment word subfile
This subfile identical in structure to the value and word subfile, records words that occur within comments.

(12) Process instruction nodes
This subfile contains a single record documenting the node numbers of all processor instructions occurring int the text.

(13) Process instruction words
This subfile behaves identically to the value subfile but indexes words within processing instructions rather than words within the markup.

(14) Id mappings
This subfile maps id strings to the node number which has this id.

(15) Genid in namespace
This subfile records for each namespace the genid node numbers that are members of this namespace.

(16) Attributes in namespace
This subfile records for each namespace the attribute node numbers that are members of this namespace.

(17) Entities
This subfile records information about entities that occur in the text. The information in this subfile can be recovered using the grammar_entities table functions.

(18) Namespace scoping
This subfile records the namespace scopes that exist within the input text.

(19) References
This subfile maps attributes identified by node number to the element node number that they contain an idref to.

(20) Anti-references
This subfile maps elements node numbers to the set of attributes that contain idref's to them.

(30) and on..
These subfiles record numeric values occuring directly within specific types of named element or attribute. The mapping to these subfiles is via the Grammar subfile.

Maintainer
webmaster@textserver.com
Back