TextServer
Company Structured Text Software Solutions Customers Support
Demonstrations Documentation Downloads Publications Acknowledgements



INTRODUCTION:

TextServer is an OLE/DB data service provider that allows structured XML text to be efficiently extracted from XML documents and transformed for subsequent processing or presentation. As such TextServer should be of interest to all who are interested in indexing and exploiting the full information content contained within existing and future XML documents.

Structured text is any text that contains markup, which clarifies how the text is to be presented and interpreted. Historically, text providers developed ad hoc schemes for marking up text. SGML (Standard Generalized Markup Language) is an industry standard defining how text should be marked up. HTML (Hypertext Markup Language), which conforms to the SGML standard, is universally used to allow client browsers to present textual information obtained via the World Wide Web. HTML focuses on how a document should be presented. As a consequence, it contains few descriptive rules, which would allow the text to be searched intelligently. There is thus a growing interest in XML (Extensible Markup Language), which defines a small workable subset of SGML, and allows descriptive markup to be inserted into text and to be interpreted without regard to context.

At present, there are limited commercially available products for searching structured text. As text on the World Wide Web explodes exponentially, it will be increasingly difficult to find useful information. Structured text not only allows for more effective and efficient searching of text on the internet or CD-ROM, but also enables the integration of text with conventional database products. Data mining is another potential application. In a world that sees value in information, there will be a demand for more effective management of text-based information.

BACKGROUND:

Originally computers were viewed as glorified calculators. They could perform numeric computations very rapidly, and were of immediate use to mathematicians and scientists. Later, as the business world began to appreciate the potential of computers, computers were used to store corporate records. These records were initially managed through ad-hoc software, but it soon became clear that database management systems were needed to simplify the task of managing business records.

Historically Relational Database Technology was widely used to manage information. However, until comparatively recently, database technology has largely ignored the issue of how textual information (including information incorporating graphical images such as the page that you are reading) might be managed, manipulated and presented. This is perhaps surprising since most information is textual in nature.

The development of the World Wide Web, the availability of ever faster computers, and the ability to make vast amounts of material cheaply available to all, has lead to a revolution in the types of information which we wish to store and retrieved, and presents new challenges to software developers everywhere.

TEXTSERVER:

TextServer is a key component designed to provide you with the tool you need to translate large highly Structured Documents into managable fragments of text that can be rapidly retrieved based on both context and content. TextServer dynamically translates the requested fragments of text into virtual SQL tables. TextServer implements a full self contained SQL2 query engine which allows the resulting texts to be related and manipulated. This SQL2 query engine supports the SQL2 information schema, multiple users, schemas, and catalogs, and allows initially unknown texts to be externally described in sufficient detail to allow intelligent use of these texts by informed users and applications.

Given an SGML document of essentially arbitrary size this document must first be indexed. TextServer includes a tool which will parse an SGML document, construct an index file describing this document, and transform this SGML document into XML for subsequent retrieval and processing. This index describes the entities used in the document, the inclusion relationships between these entities, and the various attributes of each entity. Using this index, document fragments documents can be marked and/or retrieved, by entity name, attribute name, attribute value, textual content, and according to the actual inclusion relationships present within the document. The index allows this same marking and retrieval process to be applied to any retrieved fragment of text, and also facilitates arbitrary traversal of the indexed text from any identifiable fragment of this text.

The second major component of TextServer is an OLEDB service provider. This service provider when presented with SQL2 queries describing how fragments of text are to be recovered, marked and transformed for subsequent processing, returns the result set as a virtual relational table. This allows this result set to be integrated with and related to other result sets delivered by other OLEDB or vendor specific sources. Because TextServer is a conforming OLEDB service provider it can also be employed by any OLEDB aware application. In particular, it can be invoked by Active Data Objects (ADO) using VBSCRIPT, JAVASCRIPT, or any other scripting language capable of employing ADO. It would thus be very natural to employ TextServer under any web server that supported server side Active Service Pages, possibly in conjunction with client side scripting or applets.

TextServer is essentially language neutral (although diagnostic messages are available only in English at the present time). Internally all strings employ unicode to maximise the usefulness of the product, and the rules defining how words are to be tokenized and indexed are established in a provided DLL. If desired rules governing such issues as capitalisation, hyphenation, synonymns, formats and interpretation of numeric values, dates and times, etc. may be redefined in a substitute DLL, conforming to the same TextServer interface specification. Each text is associated with an appropriate DLL when that text is initially indexed. Thus different texts may use very different indexing schemes, and have radically different interpretations while being internally fully compatable with each other.

TextServer extends the SQL2 language by providing support for functions that manipulate text and marks within text, for converting requested fragments of text (where ever they might be found within one or more than one text) into new relational tables which preserve the contexts between textual fragments, and for transducing fragments of XML text into HTML or other encodings, based on rules that may be encoded either within strings, or if desired within external text files.

Maintainer
webmaster@textserver.com
Back