|
Indexer
An Xml indexer.
SYNOPSIS
indexer
-iinput
-ooutput
[-a]
[-bblksize]
[-ccatalog]
[-ddirectory]
[-eerrorlimit]
[-f]
[-gguid]
[-mmin_sort_memory]
[-mmax_sort_memory]
[-mmemorymap_size]
[-s]
[-ttemp_directory]
[-uordering]
[-v]
[-w{ignore|caution|verbose}]
string_to_text(string, options)
DESCRIPTION
The program indexer parses and validates the XML document contained in input file.
Some allowance is made for SGML documents which depart from XML, either through design or error. Upon
successfully parsing the input text, it outputs a text index suitable for use by TextServer.
This same program can be invoked within textserver by invoking the string_to_text function which converts
an input text into a temporary index.
In either case the index constructed contains hierarchical indexes which facilitate rapid retrieval of textual
components within the indexed document.
The following options are available:
-a
- Attribute defined in the DTD but not present in the text for which defaults exist are indexed as if the
attribute value pair occurred at the '>' terminating character of the start tag, with the value occupying
zero bytes of text.
-b blocksize
- This specifies the required blocksize to use when building the text index. Currently acceptable values are
1K, 2K, 4K, and 8K. Small block sizes potentially improve information retrieval speeds when accessing small
documents. However small block sizes may increase the number of levels of index used in retrieving records, and
may cause an unacceptable number of records, each larger than the block size, to be placed in overflow chains.
Very large documents should use larger block sizes, both to reduce the number of levels of indirection needed
to recover data, and to increase the physical file sizes supported, since an index may contain at most 2^32-1
logical blocks. The default is 4K block transfer sizes.
-c catalog
- Catalog files contain mappings from public entity names to the corresponding filenames containing the
contents referenced by these entities. Each such mapping consists of a line in a catalog file containing three
parts: the keyword PUBLIC the public identifier in either single or double quotes, and then the
corresponding file containing the public entity. For example: PUBLIC "-//IETF//DTD HTML Strict Level
1//EN" html-1s.dtd. Multiple catalogs can optionally be specified. They are searched in the order
specified.
-d directory
- Specifies the directories in which named entity files may reside. Multiple directories can optionally be
specified. They are searched in the order specified.
-e error count
- The maximum number of warnings to be produced before terminating in error.
-f
- If this option is specified then the index is created anew, even if it already exists.
-g guid
- Specified the GUID of the desired External Word Tokenizer COM object which is to
be employed in translating the input text into words, numerics, etc. By default the default external word
tokenizer be used. The format of a GUID is: 8 hexadecimal digits followed by a hyphen, then three groups of 4
hexadecimal digits each followed by a hyphen, then 12 hexadecimal digits.
-i input
- Specified the input XML source to index. This may be either a local file name, or a URL beginning with
"ftp:", "http:" or "https:".
-m memory
- This option may appear up to three times. The memory size specified may optionally be followed by 'K','KB',
'M','MB','G' or 'GB' to convey a size in kilobytes, megabytes, or gigabytes. The first occurance of this option
specifies the minimum memory to be reserved when sorting, the second occurance specifies the maximum memory to
be used when sorting, and the third occurrance if specified requests that a memory mapped file space of the
indicated size be associated with the index being built. While building the index this file will as a
consequence be preallocated to occupy this much space on disk.
-o outputfilename
- This mandatory option specifies the output index name to construct.
-s
- Instructs the indexer to save a copy of the input in the output index file. This ensures that the indexed
text, and indices relating to this text remain in sync, but results in additional storage overheads.
-t directory
- Specifies the temporary directory to be used for sorting etc. If not specified the TEMP directory is
employed. Note that a large file is also created in the TEMP directory prior to actually building the index,
so that it may be desirable to both change the TEMP directory and redirect sort output to a second distinct
TEMP directory.
-u {FFFE|FEFF}
- Explicitly identifies the input as being in uncode if specified, and identifies the ordering of bytes
according to the indicated mark character.
-v
- Print version information identifying the software being executed.
-V
- Validate the indexes produced by this execution of the tokenizer. The indices produced should always be
valid, but this option essentially checks that the tokenizer is performing as expected.
-w {ignore|caution|verbose}
- Indicates how departures from XML in the input source from which the parser can reasonably recover are to
be handled. If ignore is specified such departures are unreported. If caution is specified such
departures are briefly reported. If verbose is specified such departures are extensively documented.
-w {xml}
- Warn about constructs that are not allowed by XML.
Caveats
XML markup is presumed to delimiting words. Thus 'a<b>c' will be treated as the three words a, b, c
rather than a single word abc. Similarly, If CDATA sections are used 'ac' will be treated as
three words.
It is strongly recommended that CDATA sections not be employed, for the simple reason that when extracting
fragments of text from source text potentially containing CDATA sections, it is not possible to distinguish
sequences of characters intended to represent data, from those intended to represent markup. This may
confuse both internal and external software that operates on such fragments of text.
|