|
Introduction
TextServer provides extended support for employing Xpath expressions
directly within SQL. The set of nodes recovered by an Xpath expression is represented within TextServer as a
set of zero or more marks within an output text, that is, apart from the set of marks it contains, identical
to the input text. Each such mark is simply a start node number, together with the range of node numbers spanned
by this mark. The advantage of considering sets of nodes to be marks within a text rather than a new set of
texts is that the context (documenting from where sets of nodes originate within a text) is not lost. In
addition this approach lends itself to presenting fragments of text with suitable highlighting of marked
sections of text.
Xpath implementation and extensions
Some trivial modifications are made to SQL2 in order to make TextServer more compatable with Xpath.
The DIV keyword may be used as an alternative to the standard '/' divide operator anywhere, and the
MOD operator having the semantics described in the Xpath specification may be used anywhere. Under some
circumstances when constructing xpath expressions the '/' may not be employed as a division operator, because of
the resulting ambiguity.
The boolean functions true(), false(), and unknown() are, for
completeness with Xpath, defined as valid functions in SQL, and may be used anywhere within SQL.
The Xpath implementation is extended to be fully compatable with SQL. An arbitrary SQL predicate may occur
within any Xpath nodeset filter. The SQL predicates are extended to include as pseudo predicates arbitrary
Xpath expressions. The RANGE keyword (proposed for matching ranges of context nodes in XQuery) is for
consistency with SQL changed to the BETWEEN keyword. Thus chapter[1] is semantic short hand for
chapter[position() = 1] while chapter[between 2 and 3] is semantic shorthand for
chapter[position() between 2 and 3] which is equivalent to chapter[position() >= 2 and position()
<= 3].
The context node employed when evaluating nodeset filters is typically implicit within Xpath. However, there
may be cases where it is desirable to employ it explicitly within an appropriate SQL function. For this reason
the function context() when employed within an Xpath expression evaluates to a text node corresponding to the
current context node and span, having no internal marks.
Because Xpath is being integrated with SQL, it seems undesirable to have different rules for describing
variables (such as column references) inside and outside Xpath expressions. Therefore, variable names should
not be preceded by a '$' in TextServer's syntax.
Adopting this approach necessitates that some means be available to distinguish an arbitrary Xpath expression
from an arbitrary standard SQL expression. Most integral Xpath expressions must therefore be preceded by a '/'.
The sole exception to this rule is that degenerate standalone Xpath functions may when used in SQL boolean
expressions optionally omit the preceding '/'. To root the context of an expression at an arbitrary node (such
as the root node) the text function root_text(...) [rather than a
'/'] would be used. The main impact of this rule is that Xpath expressions occurring within predicates as
allowed for within the Xpath specification must also be preceeded by a '/'. Thus, for example, the Xpath expression
chapter/section[topic and @note and comment()] would be expressed either as /chapter/section[/topic
and /@note and comment()] or as /chapter/section[/topic and /@note and /comment()].
The set of functions that conventionally act as filters of nodesets on a given axis are extended to include:
- *
- Any node of the principle nodetype is matched.
- name
- Any node of the principle nodetype having the indicated name is matched.
To avoid conflict with SQL reserved words, it is recommended that this name
be double quoted.
- namespace : *
- Any node in the indicated namespace is matched.
To avoid conflict with SQL reserved words, it is recommended that this name
be double quoted.
- namespace : name
- Only nodes in the indicated namespace having the indicated name are matched.
- node()
- Essentially a no-op. Matches whatever the indicated axis matches.
- node(set expression)
- The set expression is evaluated using the same immediate ancestor as its node of the xpath expression. Only
nodes within the resulting set expression may occur at this level of the xpath tree.
- text()
- The text() function may be employed on the child/descendant axis of (1) attributes to recover attribute
values, (2) comments to recover the text of a comment, (3) process instructions to recover the text of a
process instruction, and/or (4) other types of node to recover the sections of text occuring outside of markup,
under elements.
- word()
- This test is similar to the text() function, but stipulates that the marks within the nodeset associated
with this step are to be those associated with each word of the text, rather than the nodes associated with
each fragment of text, as defined in the Xpath standard.
- marks()
- Each input text has associated with it a set of zero or more marks. This function specifies that only
marks within this input nodeset may be selected as candidates for inclusion within the output set of marks.
- contains(string [, integer])
- Computes the marks (ie. node spans) which minimally contain in sequence all of the words occurring within
string. If no integer is specified (or if it is negative) arbitrary numbers of words may occur between
consecutive words in this sequence. Otherwise at most the specified number of additional words may be
subsummed by each such mark. The parameters to this function may be derived from any appropriate SQL
expression. Note that the text context determines the definition of what constitutes a word. There is no
intrinsic requirement that the words occurring under a text match in any visual fashion the words occurring
within the specified string.
- starts-with(string [, integer])
- Computes the set of word nodes beginning at the first node number within the prior nodeset which contain
words matching the specified string. Otherwise identical to the contains() function.
- equals(string [, integer])
- Similar to the starts-with function, but also requires that last node in region being tested is last node
under this nodes parent.
- like(string [, integer])
- Behaves in a fashion similar to contains(), but interprets the '%' and '_' as wildcards. When these wild
cards occur adjacent to characters within a word, character wildcarding within that word is implied. When
occuring in other contexts these symbols are interpreted as wildcarding unknown words in the text.
- starts-like(string [, integer])
- Computes the set of word nodes beginning at the first node number within the prior nodeset which are like
words matching the specified string. Otherwise identical to the like() function.
- numeric(double1 [, double2 ])
- Computes the set of word nodes whose content when interpreted as a double (according to the semantics of
the text) are indexed as have the indicated value, or (if two parameters are specified) range of values.
- lang(string)
- Computes the set of nodes within the prior nodeset which are under an element which qualifies the language
of subordinate nodes as belonging to the specified language.
Note that the contains(), starts-with(), and lang() functions in computing nodesets within TextServer, provide
more power than provided in the Xpath specification, where they operate as boolean functions returning either
true or false. This historic functionality is still implied since when used without qualification within a
boolean expression they evaluate to true when they compute a non-empty nodeset, and to false when they compute
the empty nodeset. Also note that the contains(), and starts-with() functions depart from the Xpath
specification which defines them as having two string parameters. It is not possible to say whether one string
contains words specified in another string, without resorting to a definition of what constitutes a word, and
in the absence of an input text, no such definition exists.
The set of axis used to traverse to desired nodesets are extended to include:
- =>
- The dereference axis (proposed in XQuery) takes those nodes within the prior nodeset which are attributes,
computes the values associated with these attributes, and recovers those element nodes with matching id's.
The concept of an Xpath is generalized to allow for branching by allowing any step of an Xpath location
expression to be bracketted and for branches within such brackets to be separated by the keywords AND,
OR and optionally sub-bracketting. Arbitrary Xpath expressions may iteratively occur underneath such
branches. Note that this blurs the distinction between the role of xpath expressions and the subordinate filter
predicates. The key distinction remains that subordinate filter predicates are evaluated only once, and are
subsequently not affected by future computations.
Thus for example chapter/(bold AND (italic OR strikeout)) describes a pattern in which chapters have bold
tags within them as well as either italic, strikeout or both.
Each step in an Xpath expression may optionally be preceeded by any combination of the modifiers, MARK::,
number::, OUTER::, in an arbitrary order.
All nodesets corresponding to steps marked as MARK will (when not occurring within an XPath predicate) be
unioned together in order to determine the set of marks returned by the xpath_value function, and to determine
the nodes that should be marked within texts returned by the xpath_table function. All steps marked with a
column number (when not occuring within an Xpath predicate) will be extracted into distinct columns by the
xpath_table function. The first column is thus denoted by a 1, the second by a 2, and so on. This notation
improves readability and allows extended xpath expressions to be rewritten and/or internally simplified while
preserving portability across distributed systems.
All node sets marked as OUTER are to be considered optional. The absence of any nodes within such node sets is
not to result in any restrictions being imposed on the legitimate nodes occurring in earlier xpath steps. If
any path under an OR is considered optional then the OR is itself considered optional. If every path under an
AND is considered optional then so is the AND. Note that bracketted sequences or AND's or OR's may themselves
be explicitly marked OUTER.
Set expressions
The valid Xpath node tests are extended to include four new set operations. These are:
- node(xpath)
- union(xpath1, xpath2)
- intersect(xpath1, xpath2)
- except(xpath1, xpath2)
Each of these nodetests begins by resolving their parameters. Resolution of each such xpath is
with respect to the previous rather than the current xpath step, using the currently established
nodeset associated with this prior step. Having fully resolved these xpaths each path is searched
for steps within it, not themselves contained within subordinate set expressions, having the set
attribute. If no steps within an xpath parameter have this attribute, the first step
in the xpath is implicitly presumed to have this attribute. Set operations are performed on the union of
node sets associated with all such explicitly or implicitly identified xpath steps.
Having computed the nodesets associated with each xpath parameter, the specified action if any is then
performed on these nodeset nodes, producing a new nodeset. The node() set expression simply preserves
the earlier computed nodeset.
Each of these set expression nodetests constrain the resulting members of the nodeset associated with the
current xpath to also be members of the computed nodeset. Thus logically, the result returned by a
set expression is intersected with the nodeset otherwise evaluated by an xpath step, to suitably
constrain the nodeset associated with this xpath step.
Subsequently the xpath parameters contained within such nodesets behave as if they simply occurred directly
under the parent xpath step, and may thus have their nodesets further reduced. As a special case the xpath
expression associated with the second parameter of an except() expression once used, is presumed to have
steps contain empty nodesets. This subsequent reduction allows steps within these xpath expressions to
be assigned the mark attribute, and thus to contribute towards the process of creating marks within
a resulting texts.
Xpath functions
All xpath operations are performed with the context of a specified text. This
text may instantiated within a TextServer database as a value, or itself be constructed using appropriate
functions.
- xpath_value(text/xpath)
- Computes the collection of marks produced by resolving the xpath expression, in the given
text context. Returns a new text that is the same as the input text, having the computed marks.
- xpath_table(text/xpath [, integer])
- Computes the collection of nodesets produced by resolving the xpath expression, in the given
text context. Outputs a table having as many columns as specified by integer if present.
(This option exists primarily to allow portability of rewritten xpath expressions across distributed
networks). Otherwise outputs as many columns as the maximum column number associated with a nodeset in the
xpath expression that are marked with explicitly with a column number >= 1.
For each way in which the pattern can be matched against the text, while producing distinct output rows,
an output row is formed. Each attribute in the output row contains a text derived from the input text,
but rooted at the indicated position within the xpath pattern. If there is no such text, the output is
null. All subordinate
nodes under this subtree which are to be marked in the pattern are marked in the output text.
An AND produces output from distinct branches which apply for each node in turn above the AND.
Nulls are produced in the following circumstances.
- An optional pattern fails to be matched in at least one way.
- Sub-patterns in an or list are processed separately, with only one sub-pattern at a time producing
output rows. Columns associated with other sub-patterns within an or list are set null.
- Any extracted columns in the nodesets used to compute UNION, INTERSECT and EXCEPT will always be null.
- Any output column not associated with any extraction will be null.
Standard Xpath Functions
In addition to supporting all of the SQL2 standard function, TextServer also supports all of the proposed
Xpath core functions. The functions supported are:
- Arbitrary XPath Node Set expression
- last()
- Returns as an integer the number of nodes in the current context.
- position()
- Returns as an integer the logical offset (from 1) of the node representing the current context within the
nodeset being reduced by an xpath predicate. Note that position when having operands is interpreted as the
standard SQL2 function.
- local_name([NodeSet])
- Returns the local name of the current context node if nodeset is not specified. Otherwise returns the
local name of the first node in the specified nodeset. Internally this computation is performed by computing
either text_to_string(context(), 'localname'), or text_to_string(mark_to_text(xpath(context(), NodeSet)),
'localname'). See also text_to_string(text,'localname').
- namespace-uri([NodeSet])
- Returns the namespace URI of the current context node if nodeset is not specified. Otherwise returns the
namespace URI of the first node in the specified nodeset. See also text_to_string(text,'namespaceuri').
- name([NodeSet])
- Returns the fully qualified name of the current context node if nodeset is not specified. Otherwise
returns the fully qualified name of the first node in the specified nodeset. See also
text_to_string(text,'name').
- substring-after(String1, String2)
- Returns the substring in String1 occurring after the first occurance of string2 in string1.
- substring-before(String, String)
- Returns the substring in String1 occurring before the first occurance of string2 in string1.
- normalize-space([String])
- Strips leading and trailing white space from the string, and reduces multiple adjacent whitespace to a
single space character. If no string is provided the input string is computed as
text_to_string(context(), 'xpath').
- translate(String1, String2, String3)
- Replaces characters in string1, occuring in string2, by the corresponding (if any) character in string3.
- count(NodeSet)
- Internally implemented as text_to_integer(xpath_value(context(), NodeSet), 'marks').
- max(NodeSet)
- Internally implemented using standard SQL aggregation functions.
- min(NodeSet)
- Internally implemented using standard SQL aggregation functions.
- sum(NodeSet)
- Internally implemented using standard SQL aggregation functions.
- avg(NodeSet)
- Internally implemented using standard SQL aggregation functions.
- floor(value)
- The floor of the value.
- ceiling(value)
- The ceiling of the value.
- round(value)
- The rounded value of the input.
- value1 MOD value2
- The remainder when value1 is divided by value2.
|