TextServer
Company Structured Text Software Solutions Customers Support
Demonstrations Documentation Downloads Publications Acknowledgements



Introduction

TextServer provides extended support for employing Xpath expressions directly within SQL. The set of nodes recovered by an Xpath expression is represented within TextServer as a set of zero or more marks within an output text, that is, apart from the set of marks it contains, identical to the input text. Each such mark is simply a start node number, together with the range of node numbers spanned by this mark. The advantage of considering sets of nodes to be marks within a text rather than a new set of texts is that the context (documenting from where sets of nodes originate within a text) is not lost. In addition this approach lends itself to presenting fragments of text with suitable highlighting of marked sections of text.

Xpath implementation and extensions

Some trivial modifications are made to SQL2 in order to make TextServer more compatable with Xpath. The DIV keyword may be used as an alternative to the standard '/' divide operator anywhere, and the MOD operator having the semantics described in the Xpath specification may be used anywhere. Under some circumstances when constructing xpath expressions the '/' may not be employed as a division operator, because of the resulting ambiguity.

The boolean functions true(), false(), and unknown() are, for completeness with Xpath, defined as valid functions in SQL, and may be used anywhere within SQL.

The Xpath implementation is extended to be fully compatable with SQL. An arbitrary SQL predicate may occur within any Xpath nodeset filter. The SQL predicates are extended to include as pseudo predicates arbitrary Xpath expressions. The RANGE keyword (proposed for matching ranges of context nodes in XQuery) is for consistency with SQL changed to the BETWEEN keyword. Thus chapter[1] is semantic short hand for chapter[position() = 1] while chapter[between 2 and 3] is semantic shorthand for chapter[position() between 2 and 3] which is equivalent to chapter[position() >= 2 and position() <= 3].

The context node employed when evaluating nodeset filters is typically implicit within Xpath. However, there may be cases where it is desirable to employ it explicitly within an appropriate SQL function. For this reason the function context() when employed within an Xpath expression evaluates to a text node corresponding to the current context node and span, having no internal marks.

Because Xpath is being integrated with SQL, it seems undesirable to have different rules for describing variables (such as column references) inside and outside Xpath expressions. Therefore, variable names should not be preceded by a '$' in TextServer's syntax.

Adopting this approach necessitates that some means be available to distinguish an arbitrary Xpath expression from an arbitrary standard SQL expression. Most integral Xpath expressions must therefore be preceded by a '/'. The sole exception to this rule is that degenerate standalone Xpath functions may when used in SQL boolean expressions optionally omit the preceding '/'. To root the context of an expression at an arbitrary node (such as the root node) the text function root_text(...) [rather than a '/'] would be used. The main impact of this rule is that Xpath expressions occurring within predicates as allowed for within the Xpath specification must also be preceeded by a '/'. Thus, for example, the Xpath expression chapter/section[topic and @note and comment()] would be expressed either as /chapter/section[/topic and /@note and comment()] or as /chapter/section[/topic and /@note and /comment()].

The set of functions that conventionally act as filters of nodesets on a given axis are extended to include:

*
Any node of the principle nodetype is matched.
name
Any node of the principle nodetype having the indicated name is matched. To avoid conflict with SQL reserved words, it is recommended that this name be double quoted.
namespace : *
Any node in the indicated namespace is matched. To avoid conflict with SQL reserved words, it is recommended that this name be double quoted.
namespace : name
Only nodes in the indicated namespace having the indicated name are matched.
node()
Essentially a no-op. Matches whatever the indicated axis matches.
node(set expression)
The set expression is evaluated using the same immediate ancestor as its node of the xpath expression. Only nodes within the resulting set expression may occur at this level of the xpath tree.
text()
The text() function may be employed on the child/descendant axis of (1) attributes to recover attribute values, (2) comments to recover the text of a comment, (3) process instructions to recover the text of a process instruction, and/or (4) other types of node to recover the sections of text occuring outside of markup, under elements.

word()
This test is similar to the text() function, but stipulates that the marks within the nodeset associated with this step are to be those associated with each word of the text, rather than the nodes associated with each fragment of text, as defined in the Xpath standard.

marks()
Each input text has associated with it a set of zero or more marks. This function specifies that only marks within this input nodeset may be selected as candidates for inclusion within the output set of marks.

contains(string [, integer])
Computes the marks (ie. node spans) which minimally contain in sequence all of the words occurring within string. If no integer is specified (or if it is negative) arbitrary numbers of words may occur between consecutive words in this sequence. Otherwise at most the specified number of additional words may be subsummed by each such mark. The parameters to this function may be derived from any appropriate SQL expression. Note that the text context determines the definition of what constitutes a word. There is no intrinsic requirement that the words occurring under a text match in any visual fashion the words occurring within the specified string.

starts-with(string [, integer])
Computes the set of word nodes beginning at the first node number within the prior nodeset which contain words matching the specified string. Otherwise identical to the contains() function.

equals(string [, integer])
Similar to the starts-with function, but also requires that last node in region being tested is last node under this nodes parent.

like(string [, integer])
Behaves in a fashion similar to contains(), but interprets the '%' and '_' as wildcards. When these wild cards occur adjacent to characters within a word, character wildcarding within that word is implied. When occuring in other contexts these symbols are interpreted as wildcarding unknown words in the text.

starts-like(string [, integer])
Computes the set of word nodes beginning at the first node number within the prior nodeset which are like words matching the specified string. Otherwise identical to the like() function.

numeric(double1 [, double2 ])
Computes the set of word nodes whose content when interpreted as a double (according to the semantics of the text) are indexed as have the indicated value, or (if two parameters are specified) range of values.

lang(string)
Computes the set of nodes within the prior nodeset which are under an element which qualifies the language of subordinate nodes as belonging to the specified language.

Note that the contains(), starts-with(), and lang() functions in computing nodesets within TextServer, provide more power than provided in the Xpath specification, where they operate as boolean functions returning either true or false. This historic functionality is still implied since when used without qualification within a boolean expression they evaluate to true when they compute a non-empty nodeset, and to false when they compute the empty nodeset. Also note that the contains(), and starts-with() functions depart from the Xpath specification which defines them as having two string parameters. It is not possible to say whether one string contains words specified in another string, without resorting to a definition of what constitutes a word, and in the absence of an input text, no such definition exists.

The set of axis used to traverse to desired nodesets are extended to include:

=>
The dereference axis (proposed in XQuery) takes those nodes within the prior nodeset which are attributes, computes the values associated with these attributes, and recovers those element nodes with matching id's.

The concept of an Xpath is generalized to allow for branching by allowing any step of an Xpath location expression to be bracketted and for branches within such brackets to be separated by the keywords AND, OR and optionally sub-bracketting. Arbitrary Xpath expressions may iteratively occur underneath such branches. Note that this blurs the distinction between the role of xpath expressions and the subordinate filter predicates. The key distinction remains that subordinate filter predicates are evaluated only once, and are subsequently not affected by future computations.

Thus for example chapter/(bold AND (italic OR strikeout)) describes a pattern in which chapters have bold tags within them as well as either italic, strikeout or both.

Each step in an Xpath expression may optionally be preceeded by any combination of the modifiers, MARK::, number::, OUTER::, in an arbitrary order.

All nodesets corresponding to steps marked as MARK will (when not occurring within an XPath predicate) be unioned together in order to determine the set of marks returned by the xpath_value function, and to determine the nodes that should be marked within texts returned by the xpath_table function. All steps marked with a column number (when not occuring within an Xpath predicate) will be extracted into distinct columns by the xpath_table function. The first column is thus denoted by a 1, the second by a 2, and so on. This notation improves readability and allows extended xpath expressions to be rewritten and/or internally simplified while preserving portability across distributed systems.

All node sets marked as OUTER are to be considered optional. The absence of any nodes within such node sets is not to result in any restrictions being imposed on the legitimate nodes occurring in earlier xpath steps. If any path under an OR is considered optional then the OR is itself considered optional. If every path under an AND is considered optional then so is the AND. Note that bracketted sequences or AND's or OR's may themselves be explicitly marked OUTER.

Set expressions

The valid Xpath node tests are extended to include four new set operations. These are:

  • node(xpath)
  • union(xpath1, xpath2)
  • intersect(xpath1, xpath2)
  • except(xpath1, xpath2)

Each of these nodetests begins by resolving their parameters. Resolution of each such xpath is with respect to the previous rather than the current xpath step, using the currently established nodeset associated with this prior step. Having fully resolved these xpaths each path is searched for steps within it, not themselves contained within subordinate set expressions, having the set attribute. If no steps within an xpath parameter have this attribute, the first step in the xpath is implicitly presumed to have this attribute. Set operations are performed on the union of node sets associated with all such explicitly or implicitly identified xpath steps.

Having computed the nodesets associated with each xpath parameter, the specified action if any is then performed on these nodeset nodes, producing a new nodeset. The node() set expression simply preserves the earlier computed nodeset.

Each of these set expression nodetests constrain the resulting members of the nodeset associated with the current xpath to also be members of the computed nodeset. Thus logically, the result returned by a set expression is intersected with the nodeset otherwise evaluated by an xpath step, to suitably constrain the nodeset associated with this xpath step.

Subsequently the xpath parameters contained within such nodesets behave as if they simply occurred directly under the parent xpath step, and may thus have their nodesets further reduced. As a special case the xpath expression associated with the second parameter of an except() expression once used, is presumed to have steps contain empty nodesets. This subsequent reduction allows steps within these xpath expressions to be assigned the mark attribute, and thus to contribute towards the process of creating marks within a resulting texts.

Xpath functions

All xpath operations are performed with the context of a specified text. This text may instantiated within a TextServer database as a value, or itself be constructed using appropriate functions.

xpath_value(text/xpath)
Computes the collection of marks produced by resolving the xpath expression, in the given text context. Returns a new text that is the same as the input text, having the computed marks.

xpath_table(text/xpath [, integer])
Computes the collection of nodesets produced by resolving the xpath expression, in the given text context. Outputs a table having as many columns as specified by integer if present. (This option exists primarily to allow portability of rewritten xpath expressions across distributed networks). Otherwise outputs as many columns as the maximum column number associated with a nodeset in the xpath expression that are marked with explicitly with a column number >= 1.

For each way in which the pattern can be matched against the text, while producing distinct output rows, an output row is formed. Each attribute in the output row contains a text derived from the input text, but rooted at the indicated position within the xpath pattern. If there is no such text, the output is null. All subordinate nodes under this subtree which are to be marked in the pattern are marked in the output text. An AND produces output from distinct branches which apply for each node in turn above the AND.

Nulls are produced in the following circumstances.

  1. An optional pattern fails to be matched in at least one way.
  2. Sub-patterns in an or list are processed separately, with only one sub-pattern at a time producing output rows. Columns associated with other sub-patterns within an or list are set null.
  3. Any extracted columns in the nodesets used to compute UNION, INTERSECT and EXCEPT will always be null.
  4. Any output column not associated with any extraction will be null.

Standard Xpath Functions

In addition to supporting all of the SQL2 standard function, TextServer also supports all of the proposed Xpath core functions. The functions supported are:

Arbitrary XPath Node Set expression

last()
Returns as an integer the number of nodes in the current context.

position()
Returns as an integer the logical offset (from 1) of the node representing the current context within the nodeset being reduced by an xpath predicate. Note that position when having operands is interpreted as the standard SQL2 function.

local_name([NodeSet])
Returns the local name of the current context node if nodeset is not specified. Otherwise returns the local name of the first node in the specified nodeset. Internally this computation is performed by computing either text_to_string(context(), 'localname'), or text_to_string(mark_to_text(xpath(context(), NodeSet)), 'localname'). See also text_to_string(text,'localname').

namespace-uri([NodeSet])
Returns the namespace URI of the current context node if nodeset is not specified. Otherwise returns the namespace URI of the first node in the specified nodeset. See also text_to_string(text,'namespaceuri').

name([NodeSet])
Returns the fully qualified name of the current context node if nodeset is not specified. Otherwise returns the fully qualified name of the first node in the specified nodeset. See also text_to_string(text,'name').

substring-after(String1, String2)
Returns the substring in String1 occurring after the first occurance of string2 in string1.

substring-before(String, String)
Returns the substring in String1 occurring before the first occurance of string2 in string1.

normalize-space([String])
Strips leading and trailing white space from the string, and reduces multiple adjacent whitespace to a single space character. If no string is provided the input string is computed as text_to_string(context(), 'xpath').

translate(String1, String2, String3)
Replaces characters in string1, occuring in string2, by the corresponding (if any) character in string3.

count(NodeSet)
Internally implemented as text_to_integer(xpath_value(context(), NodeSet), 'marks').

max(NodeSet)
Internally implemented using standard SQL aggregation functions.

min(NodeSet)
Internally implemented using standard SQL aggregation functions.

sum(NodeSet)
Internally implemented using standard SQL aggregation functions.

avg(NodeSet)
Internally implemented using standard SQL aggregation functions.

floor(value)
The floor of the value.

ceiling(value)
The ceiling of the value.

round(value)
The rounded value of the input.

value1 MOD value2
The remainder when value1 is divided by value2.

Maintainer
webmaster@textserver.com
Back