TextServer
Company Structured Text Software Solutions Customers Support
Demonstrations Documentation Downloads Publications Acknowledgements



Indexer

An Xml indexer.

SYNOPSIS

indexer -iinput -ooutput [-a] [-bblksize] [-ccatalog] [-ddirectory] [-eerrorlimit] [-f] [-gguid] [-mmin_sort_memory] [-mmax_sort_memory] [-mmemorymap_size] [-s] [-ttemp_directory] [-uordering] [-v] [-w{ignore|caution|verbose}]

string_to_text(string, options)

DESCRIPTION

The program indexer parses and validates the XML document contained in input file. Some allowance is made for SGML documents which depart from XML, either through design or error. Upon successfully parsing the input text, it outputs a text index suitable for use by TextServer.

This same program can be invoked within textserver by invoking the string_to_text function which converts an input text into a temporary index.

In either case the index constructed contains hierarchical indexes which facilitate rapid retrieval of textual components within the indexed document.

The following options are available:

-a
Attribute defined in the DTD but not present in the text for which defaults exist are indexed as if the attribute value pair occurred at the '>' terminating character of the start tag, with the value occupying zero bytes of text.

-b blocksize
This specifies the required blocksize to use when building the text index. Currently acceptable values are 1K, 2K, 4K, and 8K. Small block sizes potentially improve information retrieval speeds when accessing small documents. However small block sizes may increase the number of levels of index used in retrieving records, and may cause an unacceptable number of records, each larger than the block size, to be placed in overflow chains. Very large documents should use larger block sizes, both to reduce the number of levels of indirection needed to recover data, and to increase the physical file sizes supported, since an index may contain at most 2^32-1 logical blocks. The default is 4K block transfer sizes.

-c catalog
Catalog files contain mappings from public entity names to the corresponding filenames containing the contents referenced by these entities. Each such mapping consists of a line in a catalog file containing three parts: the keyword PUBLIC the public identifier in either single or double quotes, and then the corresponding file containing the public entity. For example: PUBLIC "-//IETF//DTD HTML Strict Level 1//EN" html-1s.dtd. Multiple catalogs can optionally be specified. They are searched in the order specified.

-d directory
Specifies the directories in which named entity files may reside. Multiple directories can optionally be specified. They are searched in the order specified.

-e error count
The maximum number of warnings to be produced before terminating in error.

-f
If this option is specified then the index is created anew, even if it already exists.

-g guid
Specified the GUID of the desired External Word Tokenizer COM object which is to be employed in translating the input text into words, numerics, etc. By default the default external word tokenizer be used. The format of a GUID is: 8 hexadecimal digits followed by a hyphen, then three groups of 4 hexadecimal digits each followed by a hyphen, then 12 hexadecimal digits.

-i input
Specified the input XML source to index. This may be either a local file name, or a URL beginning with "ftp:", "http:" or "https:".

-m memory
This option may appear up to three times. The memory size specified may optionally be followed by 'K','KB', 'M','MB','G' or 'GB' to convey a size in kilobytes, megabytes, or gigabytes. The first occurance of this option specifies the minimum memory to be reserved when sorting, the second occurance specifies the maximum memory to be used when sorting, and the third occurrance if specified requests that a memory mapped file space of the indicated size be associated with the index being built. While building the index this file will as a consequence be preallocated to occupy this much space on disk.

-o outputfilename
This mandatory option specifies the output index name to construct.
-s
Instructs the indexer to save a copy of the input in the output index file. This ensures that the indexed text, and indices relating to this text remain in sync, but results in additional storage overheads.

-t directory
Specifies the temporary directory to be used for sorting etc. If not specified the TEMP directory is employed. Note that a large file is also created in the TEMP directory prior to actually building the index, so that it may be desirable to both change the TEMP directory and redirect sort output to a second distinct TEMP directory.

-u {FFFE|FEFF}
Explicitly identifies the input as being in uncode if specified, and identifies the ordering of bytes according to the indicated mark character.

-v
Print version information identifying the software being executed.

-V
Validate the indexes produced by this execution of the tokenizer. The indices produced should always be valid, but this option essentially checks that the tokenizer is performing as expected.

-w {ignore|caution|verbose}
Indicates how departures from XML in the input source from which the parser can reasonably recover are to be handled. If ignore is specified such departures are unreported. If caution is specified such departures are briefly reported. If verbose is specified such departures are extensively documented.

-w {xml}
Warn about constructs that are not allowed by XML.

Caveats

XML markup is presumed to delimiting words. Thus 'a<b>c' will be treated as the three words a, b, c rather than a single word abc. Similarly, If CDATA sections are used 'ac' will be treated as three words. It is strongly recommended that CDATA sections not be employed, for the simple reason that when extracting fragments of text from source text potentially containing CDATA sections, it is not possible to distinguish sequences of characters intended to represent data, from those intended to represent markup. This may confuse both internal and external software that operates on such fragments of text.

Maintainer
webmaster@textserver.com
Back