StringExtractor

Declaration: StringExtractor (InString: string; Commands: TStringList; Output: TAssocArray): integer;
The function StringExtractor implements an information extraction engine which can be used to extract numeric or string data from complex strings. The principle of TExtractor is quite simple: the input string InString is processed by a set of commands which are passed to the extraction engine via the parameter Commands. The results of the extraction process are returned in the associative array Output.

The extraction commands always operate on the source string using the current position of the execution pointer. The execution pointer can be moved along the source string using several of the commands (such as "pos", "inc", or "find"). StringExtractor allows to define up to a maximum of 100 variables which can be filled with information obtained from the source string. The variables are Variants and need not be declared - they are created automatically whenever a command references one for the first time. These variables can then be used for calculations using the built-in in equation interpreter (see below, section "Math Expressions" for details on the available functions).

Commands

In general, each command has the same structure: the command is followed by the required parameters enclosed in parentheses and finished by a semicolon. Commands must not be nested (i.e. a command cannot be called from within another command). If a command creates a variable, the variable is created automatically without explicit declaration. Variables are always variants.

Commands starting with a hash character (#) are treated as comments. Do not forget to close a comment with a semicolon, otherwise the next command will be interpreted as comment and ignored.

Command Description
assign(destvar=value) Assigns the value (which may be either a numeric value or a string) to the variable destvar. String values have to be enclosed in single quotes. The variable identifier destvar may be any string starting with a character and containing only characters and digits. The variable identifier must not use reserved words of mathematical expressions (see "Math Expressions" below).
calc(destvar=expr) Calculates the arithmetic/logic expression expr and stores the result in the variable destvar. The variable identifier destvar may be any string starting with a character and containing only characters and digits. The variable identifier must not use reserved words of mathematical expressions (see "Math Expressions" below). The expression may contain any number of variables previously created by commands such as copy, or calc.
copy(n,destvar) Copies n characters from the current execution pointer to the variable destvar. The execution pointer is advanced by n characters after the command. The variable identifier destvar may be any string starting with a character and containing only characters and digits. The variable identifier must not use reserved words of mathematical expressions (see "Math Expressions" below).
copyuntil('str',destvar) Copies all characters between the current execution pointer and the position of the substring 'str' to the variable destvar (the substring itself is not copied). The execution pointer is advanced to the first character after the substring 'str'. If 'str' is not contained in the source string, the entire rest of the source string is copied. The variable identifier destvar may be any string starting with a character and containing only characters and digits. The variable identifier must not use reserved words of mathematical expressions (see "Math Expressions" below).
emit(id=expr) Calculates the arithmetic/logic expression expr and assigns it to the identifier id. The calculated value is added to the Output array. The id may be any combination of letters and digits, the expression expr can be a mathematical expression. The expression may contain any number of variables previously created by commands such as copy, or calc.
exiton (varname) Stops the extraction script if the variable varname is TRUE. In this case the function StringExtractor returns a value of -1.
find(n,'str') Positions the execution pointer to the n-th occurrence of the string str starting at the position of the current execution pointer. If str cannot be found the execution pointer is left unchanged. The find command is not case-sensitive.
findbw(n,'str') Positions the execution pointer to the n-th occurrence of the string str starting at the position of the current execution pointer. The search for str is performed backwards, starting at the given execution pointer and scanning the string from higher to lower indices. If str cannot be found the execution pointer is left unchanged. The find command is not case-sensitive.
inc(dx) Moves the execution pointer by dx characters. dx may be negative or positive. If the value of dx results in a execution pointer which is beyond the limits of the source string, the execution pointer is restricted to the beginning or end of the source string (whatever is closer).
makelc(varname) Converts the contents of variable varname to lower case letters. The command makelc has no effect if applied to numeric data. makelc may be used to convert the entire source string to lower case characters by using the special variable name $sourcestring.
makeuc(varname) Converts the contents of variable varname to upper case letters. The command makeuc has no effect if applied to numeric data. makeuc may be used to convert the entire source string to upper case characters by using the special variable name $sourcestring.
pos(x) Positions the execution pointer to the character at position x. If x is less than or equal to 1 the execution pointer is set to the beginning of the source string, if x is greater than the length of the source string, the execution pointer is set to the last character of the string.
scandatetime('fmt',destvar) Scans the source string starting at the current execution pointer for a date/time string using the format specifier fmt. The format specifier uses the same syntax as the ScanDateTime routine. The result is stored in the variable destvar. Please note that the command scandatetime does not change the execution pointer (in contrast to several other commands). The variable identifier destvar may be any string starting with a character and containing only characters and digits. The variable identifier must not use reserved words of mathematical expressions (see "Math Expressions" below).
strcomp(destvar= srcvar,'str') Compares the contents of the variable srcvar to the string str and stores the result in the variable destvar. The comparison is case-sensitive. Please note that the result can be used both as a boolean variable (TRUE or FALSE) or as an arithmetic variable (-1 or 0).

 

Math Expressions

The property Expression contains the mathematical or logical expression to be evaluated. The expression is not case sensitive, however you should be careful to avoid improper mixing of boolean and arithmetic subexpressions, for example:
(a>5) and (b=0) yields a boolean result, while
(a+5) and (b=0) yields an integer value (the and operator is used as a bitwise and)

You may use any number of user-defined variables, provided that the variable names are not equal to any of the reserved function names (see below). A user-defined variable always starts with a letter and may consist of any number of digits and letters and the underscore character ('_').

The expression may use the following pre-defined constants, operators, and functions:

--- Constants ---
true logical true (or -1, if used as number)
false logical false (or 0, if used as number)
pi the number Pi (3.14159...)
--- Arithmetic Operators ---
+ sum: x+y
- difference: x-y
* product: x*y
/ division: x/y
# modulo: round(x) mod round(y)
^ power: exponentiation x^y , x>0, y..any real values
--- Logic Operators ---
> greater than
>= greater than or equal
= equal
>< not equal
< less than
<= less than or equal
and boolean or bitwise and
not boolean or bitwise not
or boolean or bitwise or
xor boolean or bitwise exclusive or
--- Functions ---
abs absolute value: abs(x), x..any real value
arccos inverse cosine: arccos(x), x..angle in radians
arcsin inverse sine: arcsin(x), x..angle in radians
arctan inverse tangens: arctan(x), x..angle in radians
cos cosine: cos(x), x..angle in radians
exp exponential function: exp(x)
frac fraction: frac(x) = x - int(x)
gauss gauss creates normally distributed random numbers with zero mean and unit standard deviation
int round towards zero: int(x)
lg decadic logarithm: lg(x)
ln natural logarithm: ln(x)
mean returns the mean of a list of variables: mean(list), with list containing a list of variables reparated by commas;(1) a range of numbered variables may be abbreviated by the ':' sign (i.e. "xx8:11" expands to "xx8, xx9, xx10, xx11").
nddens density of the standard normal distribution: nddens(x)
ndint integral of the standard normal distribution from -infinity to x: ndint(x)
ndquant quantile of the standard normal distribution for a probability x: ndquant(x)
rand uniformly distributed random numbers: random(x), x..amplitude of noise (mean = 0.0)
round round to the nearest integer: round(x)
sign sign of x: sign(x)
sin sinus: sin(x)
sqr square: sqr(x)
sqrt square root: sqrt(x)
sum returns the sum of a list of variables: sum(list), with list containing a list of variables reparated by commas;(1) a range of numbered variables may be abbreviated by the ':' sign (i.e. "zz1:3" expands to "zz1, zz2, zz3").
tan tangens: tan(x)
var returns the variance of a list of variables: var(list), with list containing a list of variables reparated by commas;(1) a range of numbered variables may be abbreviated by the ':' sign (i.e. "xx8:10,aux4,y1:3" expands to "xx8, xx9, xx10, aux4, y1, y2, y3").



(1) The maximum number of list elements may not exceed 2000. However the exact limit depends on the structure of the formula, so the maximum number is slightly lower (most often between 1980 and 1999).

Example: Following is a short example script which extracts the geographical latitude from a string produced by a GPS recorder of an airplane:

The recorder delivers the following comma delimited data (time, latitude, north/south, longitude, east/west, date); the latitude is given by a substring whose first two digits represent the degrees, the rest being arc-minutes:

140054,4934.708,N,00949.363,E,070403

In order to extract the latitude as decimal degrees one could use the following commands for TExtractor:

## find the first comma in the string;
find (1, ',');
inc(1);
## copy the first two digits to the variable "Deg";
copy (2, Deg);
## copy the next 6 characters to the variable "Min";
copy (6, Min);
## find the north/south indicator and copy it to "LatSign";
find (1, ',');
inc (1);
copy (1,LatSign);
makelc (LatSign);
## compare LatSign against 's';
strcomp(issouth=LatSign, 's');
## calculate and emit the Latitude;
emit (Latitude=(Deg+Min/60)*2*(issouth+0.5));
Please note a simple but efficient trick in the last line: as the variables are of the type Variant the boolean variable "issouth" can be treated as an numeric variable, as well (true = -1, false = 0). Thus we can create a negative sign for southern latitudes simply by multiplying the variable "issouth" with a suitable factor.