|
INTRODUCTION:
TextServer is an OLE/DB data service provider that allows structured
XML
text to be efficiently extracted from XML documents and transformed for subsequent processing or presentation.
As such TextServer should be of interest to all who are interested in indexing and exploiting the full
information content contained within existing and future XML documents.
Structured text is any text that contains markup, which clarifies how the text is to be presented and
interpreted. Historically, text providers developed ad hoc schemes for marking up text.
SGML
(Standard Generalized Markup Language) is an industry standard defining how text should be marked up.
HTML (Hypertext Markup Language), which conforms to the SGML standard, is universally used to allow client
browsers to present textual information obtained via the World Wide Web. HTML focuses on how a document
should be presented. As a consequence, it contains few descriptive rules, which would allow the text to be
searched intelligently. There is thus a growing interest in XML (Extensible Markup Language), which defines
a small workable subset of SGML, and allows descriptive markup to be inserted into text and to be interpreted
without regard to context.
At present, there are limited commercially available products for searching structured text. As text on the
World Wide Web explodes exponentially, it will be increasingly difficult to find useful information. Structured
text not only allows for more effective and efficient searching of text on the internet or CD-ROM, but also
enables the integration of text with conventional database products. Data mining is another potential
application. In a world that sees value in information, there will be a demand for more effective management
of text-based information.
BACKGROUND:
Originally computers were viewed as glorified calculators. They could perform numeric computations very
rapidly, and were of immediate use to mathematicians and scientists. Later, as the business world began to
appreciate the potential of computers, computers were used to store corporate records. These records were
initially managed through ad-hoc software, but it soon became clear that database management systems were
needed to simplify the task of managing business records.
Historically
Relational Database Technology was widely used to manage information. However, until
comparatively recently, database technology has largely ignored the issue of how textual information (including
information incorporating graphical images such as the page that you are reading) might be managed, manipulated
and presented. This is perhaps surprising since most information is textual in nature.
The development of the World Wide Web, the availability of ever faster computers, and the ability to make vast
amounts of material cheaply available to all, has lead to a revolution in the types of information which we
wish to store and retrieved, and presents new challenges to software developers everywhere.
TEXTSERVER:
TextServer is a key component designed to provide you with the tool you need to translate large highly
Structured Documents
into managable fragments of text that can be rapidly retrieved based on both context and content. TextServer
dynamically translates the requested fragments of text into virtual
SQL
tables. TextServer implements a full self contained
SQL2
query engine which allows the resulting texts to be related and manipulated. This SQL2 query engine supports
the SQL2 information schema, multiple users, schemas, and catalogs, and allows initially unknown texts to be
externally described in sufficient detail to allow intelligent use of these texts by informed users and
applications.
Given an SGML
document of essentially arbitrary size this document must first be indexed. TextServer includes a tool which
will parse an SGML document, construct an index file describing this document, and transform this SGML document
into XML for subsequent retrieval and processing. This index describes the entities used in the document, the
inclusion relationships between these entities, and the various attributes of each entity. Using this index,
document fragments documents can be marked and/or retrieved, by entity name, attribute name, attribute value,
textual content, and according to the actual inclusion relationships present within the document. The index
allows this same marking and retrieval process to be applied to any retrieved fragment of text, and also
facilitates arbitrary traversal of the indexed text from any identifiable fragment of this text.
The second major component of TextServer is an OLEDB service provider. This service provider when presented
with SQL2 queries describing how fragments of text are to be recovered, marked and transformed for subsequent
processing, returns the result set as a virtual relational table. This allows this result set to be integrated
with and related to other result sets delivered by other OLEDB or vendor specific sources. Because TextServer
is a conforming OLEDB service provider it can also be employed by any OLEDB aware application. In particular,
it can be invoked by Active Data Objects (ADO) using VBSCRIPT, JAVASCRIPT, or any other scripting language
capable of employing ADO. It would thus be very natural to employ TextServer under any web server that
supported server side Active Service Pages, possibly in conjunction with client side scripting or applets.
TextServer is essentially language neutral (although diagnostic messages are available only in English at the
present time). Internally all strings employ unicode to maximise the usefulness of the product, and the rules
defining how words are to be tokenized and indexed are established in a provided DLL. If desired rules
governing such issues as capitalisation, hyphenation, synonymns, formats and interpretation of numeric values,
dates and times, etc. may be redefined in a substitute DLL, conforming to the same TextServer interface
specification. Each text is associated with an appropriate DLL when that text is initially indexed. Thus
different texts may use very different indexing schemes, and have radically different interpretations while
being internally fully compatable with each other.
TextServer extends the SQL2 language by providing support for functions that manipulate text and marks within
text, for converting requested fragments of text (where ever they might be found within one or more than one
text) into new relational tables which preserve the contexts between textual fragments, and for transducing
fragments of XML text into HTML or other encodings, based on rules that may be encoded either within strings,
or if desired within external text files.
|