Mappa.Mundi Magazine - space.cgi

space.cgi

Rocket Science Introduction »

The space.cgi Interface

	Select A Topic
	CGI Parameters Retrieval Parameters Basic retrieve.tag Boolean retrieve.tag retrieve.text retrieve.blocks retrieve.union Other Parameters retrieve.maxhits retrieve.tag.n url.* retrieve.server Evaluation Scripts Publication Scripts

Space.cgi is a web proxy which makes the services of the SpaceServer engine available to traditional web browsers. The EDGARspace portal is built using this interface. Underneath the space.cgi interface is a rich architecture of protocols, servers, and other modules which will be described in future issues of Rocket Science. To learn more about the basic architecture, see The Importance of Being EDGAR.

This tutorial explains how you, as a web developer, can use the space.cgi interface to issue calls to a SpaceServer engine. You place your calls to space.cgi using the cgi-bin interface, which is typically invoked using a form. For more information on how to build forms and use the cgi-bin interface, we recommend you read one of the O'Reilly and Associates books:

If you're really serious about learning modern HTML, the definitive guide is Danny Goodman's Dynamic HTML: The Definitive Reference.

If you're looking for a top-notch tutorial on JavaScript, we recommend David Flanagan's JavaScript: The Definitive Guide.

Tech Note: We support both the GET and POST methods of calling space.cgi. Because the number of parameters is large, you should seriously consider using the POST method.

Quick Start: If you don't believe in reading documentation, may we suggest you simply look at some examples. Rocket Science has several examples or you can go look at the source for one of the forms in EDGARspace, such as the basic Classic Query Form.

CGI Parameters

There are four sets of parameters in a call to space.cgi, prefixed by either "retrieve.", "evaluate.", "publish.", or "url." (e.g., "retrieve.tag.merge").

If you are using the GET method, you simply create a call to space.cgi, and add the parameters:

CGI Parameters

http://edgar.space.invisible.net/cgi-bin/space.cgi

?retrieve.tag.merge=and

&retrieve.tag.subtrees=doc.edgar

&retrieve.tag.e.conformed.name=micro

&retrieve.tag.a.form/type=10-K

&evaluate.scripts=evaluate.null.1

&publish.script=publish.debug.1

NOTE: This would all be on one line as part of a single call to space.cgi.
This query searches EDGARspace for 10-Ks (annual reports) for companies with micro in the name and publishes the results in a simple debug format. Try It!

An important principle to understand is that of passthrough parameters. Any parameter that is not understood by one stage of the space.cgi process is passed on to the next stage and ultimately back to the client as part of the options array (if the publish script decides to pass on the options array).

While you can use any parameters you want, we strongly recommend that you use "user" as the prefix for any things that are going to be utilized by the user interface. If you are writing an evaluate script, you should use evaluate.user as your prefix.

A good example of passthrough parameters is documented in the companion Rocket Science describing Danny Goodman's JavaScript tour de force interface to space.cgi. By passing in user parameters, Danny has set up his interface to perform actions on return of data to the user. For example, if the user.saveMapOptions.islandSortType parameter has a value of "filer.name" and the data coming back is from doc.edgar, the visual interface will perform a row sort based on the filer's company name.

Retrieval Parameters

The retrieval is the first stage of the process of bringing data out of the SpaceServer and over to the user interface. We provide you with several different ways of specifying the retrieval. In all cases, you are specifying a value for an element or an attribute for one of the pieces of meta-data that are stored in the SpaceSever. The examples section of this Rocket Science can give you a quick start, and the DTDs description will show you all the possible elements and attributes you can specify for your search.

For the retrieval step, you have three different styles of call you can place:

The simplest syntax is the basic retrieve.tag syntax which is used to put together a very simple form in which you can choose to "and" or "or" all your search terms together.

A variant of the retrieve.tag syntax is the retrieve.tag.boolean syntax, which gives you more control over the Boolean structure of your query.

The most powerful (read "complex") syntax is the union style of calls in which you use XML to construct arbitrarily complex searches.

In addition, you may specify retrieve.blocks to bypass the search process and go directly to a named block in the SpaceServer or retrieve.text to do full-text searches.

Basic retrieve.tag

The first step in specifying a search is to set the retrieve.tag.subtrees parameter, which specifies which portions of the Blocks Namespace you wish to search. The value of the tag is a space-separated list of Blocks subtrees. To date, there are two valid values:

doc.edgar - The SEC's EDGAR database

doc.rfc - The Internet Request for Comment series

Basic Retreive Tag

retrieve.tag.subtrees = "doc.edgar"

retrieve.tag.subtrees = "doc.edgar doc.rfc"

The next tag you specify is retrieve.tag.merge which has two valid values: "and" or "or." This tag is applied to all of your search terms that are done using Style 1. Note that other ways of specifying Boolean operations are used for all the other styles. Note also that if you combine styles, or use the retrieve.blocks or retrieve.text method of searching, all of these are combined together before being handed over to the evaluate and publish stages.

The Merge Tag

retrieve.tag.merge = "and"

retrieve.tag.merge = "or"

Note: You may only have one retrieve.tag.merge tag in your call.

Next, you specify how to search the XML content. Searching is case-insensitive. There are three ways to do this:

tag.e.*, which looks for text inside an element's character data;

tag.a.*, which looks for text inside a particular attribute value in any element; or,

tag.x.*, which looks for text inside any attribute value in a particular.

These attributes may occur multiple times (with results combined according to the value of the tag.merge parameter).

For rfc.space, here are the parameters to try, along with suggested values (if any):

tag.a.category: std, bcp, info, or exp

tag.e.doc.area: Applications, General, Internet, Management, Operations, Routing, Security, Transport, User

tag.e.doc.workgroup

tag.e.doc.keyword

tag.e.doc.title

tag.x.doc.author

So, if you wished to construct a query that looks for any attribute or element of type author with a value of "Rose," and a status of "Internet Standard," your call to space.cgi would consist of the following parameters:

Sample Query

retrieve.tag.subtrees = "doc.rfc"

retrieve.tag.merge = "and"

retrieve.tag.a.category = "std"

retrieve.tag.x.doc.author = "Rose"

One extension to the retrieve.tag syntax allows you to specify a path. For example, in the EDGARspace DTD, you will notice that many examples of dates are included, including the filing date and the period for which the filing pertains.

By specifying a path, you can indicate which specific date is to searched. Paths are wild-carded, so you do not specify the full path to an attribute or element, only the pieces of the path that make it unique. The path is specified by separating each of the elements with a "/" character.

Retrieve Tag Example Attribute "year"

retrieve.tag.a.filing.date/year = "1999"

This retrieve.tag parameter looks for the attribute year which is contained within an element called filing.date. Note that the SpaceServer uses wild-card paths, which means that filing.date is simply some higher element above. For example, the above search would match the following XML structure:

XML Result

<filing.date type="gregorian">

<actual.date year=1999 />

</filing.date>

</docroot>

Boolean retrieve.tag

The retrieve.tag syntax of calls described in the previous section is very simple, but it doesn't give you a lot of control over your searches. In particular, a single variable, retrieve.tag.merge, specifies the "and" or "or" joining of your search results. The Boolean variant of the retrieve.tag syntax is slightly more complex, but gives you more control.

Note: When you submit a query to space.cgi, if the retrieve.tag.boolean tag is present and not empty, the SpaceServer will ignore any examples of retrieve.tag.e, retrieve.tag.a, retrieve.tag.x, and retrieve.tag.merge. You can pick one or the other, with the retrieve.tag.boolean taking precedence.

As before, you would select a subtree using the retrieve.tag.subtrees parameter. Then, you would specify three tags:

retrieve.tag.k.* specifies the name of the element or attribute you want to search

retrieve.tag.v.* is the value of the element.

retrieve.tag.boolean specifies how your terms are put together.

You may have many examples of retrieve.tag.k.* and retrieve.tag.v.*, but they must always be in pairs. The last element of each pair is the name of a variable, which can be anything you want. The contents of the k tag are a path in the same style as the basic retrieve.tag syntax. The value is any valid search term.

Example: You want to search in the doc.edgar subtree for the element company name (conformed.name) for two different companies ("Amazon" and "Yahoo") and the attribute for form type with a value of "10-K."

Boolean Retrieve Example

retrieve.tag.subtrees = "doc.edgar"

retrieve.tag.k.Variable1 = "retrieve.tag.e.conformed.name"

retrieve.tag.v.Variable1 = "Amazon"

retrieve.tag.k.Variable2 = "retrieve.tag.e.conformed.name"

retrieve.tag.v.Variable2 = "Yahoo"

retrieve.tag.k.Variable3 = "retrieve.tag.a.type"

retrieve.tag.v.Variable3 = "10-K"

The next tag you specify is the retrieve.tag.boolean tag, which indicates how you want your search put together:

Boolean Tag

retrieve.tag.boolean = "(Variable1 OR Variable2) AND Variable3"

Note:The Boolean tag may only have variable names that you declared in your v/k tags, parentheses, and the words "and" or "or".

You may think that switching from Style 1 (simple retrieve.tag) to Style 2 loses you something important: the ability to give the user control over Boolean operations. In Style 1, you can establish a radio checkbox that lets the user choose "and" or "or":

Merge Example

and (matches all your criteria)

<BR>

or (matches any one of your criteria)

The same trick can be used in Boolean mode, but this time you can have more complex searches:

Boolean Example

<INPUT type="radio" name="retrieve.tag.boolean"

value="(Name AND SIC) AND YEAR" CHECKED>

and (Company Name AND Sic Code for this Year)

<BR>

<INPUT type="radio" name="retrieve.tag.boolean"

value="(Name OR SIC) AND YEAR" >

or (Company Name OR Sic Code for this Year)

retrieve.text

If you want to retrieve based on source text, use the retrieve.text.* parameters. The CGI script will use a full-text search engine on the Web server which conforms to the doc.edgar or doc.rfc subtrees. So, the first text parameter is retrieve.text.subtree, and valid values are "doc.edgar" and "doc.rfc". You may only specify one value for this parameter.

First retrieve.text Example

retrieve.text.subtree = "doc.edgar"

Next, you specify a value consisting of one or more words separated by spaces. Searching is case-insensitive.

retrieve.text Value Example

retrieve.text.subtree = "doc.rfc"

retrieve.text.full = "Marshall T. Rose"

Note: The full text searching uses our SpaceEngine, a full-text, XML-aware search engine written for Invisible Worlds by Nassib Nassar, the author of the classic Isearch open-source engine. This particular full-text search uses phrase searching with proximity matching.

You may specify the retrieve.text.full parameter more than once, and additionally may specify a retrieve.text.merge parameter which may have a value of "and" or "or". A phrase search is submitted for the contents of each retrieve.text.full parameter and the results are merged together.

retrieve.blocks

If you always want the retrieval to include certain blocks (or don't want to search at all), simply specify the blocks you're interested in, separated by spaces:

retrieve.blocks Example

retrieve.blocks = "doc.rfc.1006 doc.rfc.2223"

The block parameter can be found by examining your query results in debug mode. See the publish debug options below.

See 2 RFCs Retrieved In Debug Mode

The retrieve.blocks parameter is quite useful as part of building new things into search result pages. For example, if you search EDGARspace for annual reports, you might gather the metadata about each block found to format a list that includes pointers to the full text, the company name, and the form type. Next to that, you could preformat a search that extracts the business address of the company.

retrieve.blocks Example

http://edgar.space.invisible.net/cgi?

?retrieve.blocks=doc.edgar.1999.101829.27

&evaluate.scripts=evaluate.null.1

&publish.script=publish.doc.edgar.showme

&user.showme=business.address

&user.baseURL=http://edgar.space.invisible.net/

Try It!

retrieve.union

The third, and most advanced style, uses the powerful union operator. A separate tutorial is provided to explain this advanced syntax.

Other Retrieve Parameters

We have covered several different variants of retrieve parameters, including:

The basic retrieve.tag.* style

The Boolean retrieve.tag.* variant

retrieve.blocks

retrieve.text

retrieve.union

Normally, you would pick one of those styles and submit your query. It is possible to use multiple styles in a single query. The rules for combining your results are as follows:

If you specify both the basic retrieve.tag.merge (e.g., retrieve.tag.{aex}) and retrieve.tag.boolean, the Boolean syntax wins and the basic tags are ignored.

Each of the retrieve.tag, retrieve.blocks, retrieve.text, and retrieve.union queries are submitted and the results are merged together as a Boolean "or."

retrieve.maxhits

While your query may (potentially) bring back an arbitrary number of results, retrieve.maxhits tells the SpaceServer it may safely ignore any results that exceed the maxhits parameter.

Maxhits Parameter

retrieve.maxhits = "24"

Note: The default value for maxhits if this parameter is not present is 25. At present, the SpaceEngine, our data repository, uses arbitrary relevance ranking. In the future, you'll be able to specify relevance ranking, and then apply maxhits to those results.

An enhancement of the maxhits parameter is the optional retrieve.offset parameter. Maxhits specifies how many results you want returned. The offset specifies an offset in the total query results. Thus, if you are implementing a search results form that shows a user 25 results at a time, the offset parameter would let you implement a "next 25" button.

Offset Parameter

retrieve.maxhits = "25"

retrieve.offset = "25"

Note: Offsets start at 0 (e.g., the first block returned is numbered 0) so an offset of 1 would bring you back result number 2.

retrieve.tag.n

A special parameter that you may wish to use is for normalizing a user query. In the SEC's EDGAR database, a variety of transformations are applied to the name of a company, such as stripping out spaces, hyphens, and other types of punctuation. If your user would search for "U.S." that would return zero hits since the term "U.S." is not in the database.

The retrieve.tag.n parameter specifies an element or attribute that you wish to normalize before sending the query into the SpaceServer. At present, this is only valid for one element, conformed.name in the doc.edgar subtree, and the only valid value is 1, which specifies the generic normalization rules for EDGAR names.

Normalization Example

retrieve.tag.n.conformed.name = "1"

The normalization rules consist of the following heuristics applied to the user query:

If an apostrophe is present, it is stripped out. Thus, "O'Gready" is turned into "OGready".

If a hyphen is present, it is turned into a space. Thus, "Invisible-Worlds" is turned into "Invisible Worlds".

If periods are present in single letter words, they are stripped out. Thus "U.S." becomes "US" and "F.H.W.B." becomes "FHWB". Caution: The SEC process of stripping out these periods is inconsistent across company names. Thus, "U.S." sometimes becomes "US" and sometimes becomes "U S ".

Any other periods are turned into spaces. Thus "Broadcast.com" is turned into "Broadcast Com".

The following words are stripped from the query: CO COMPANY CORP CORPORATION INC INCORPORATED LIMITED LTD and THE.

url.*

Two parameters starting with url.* are provided so that you can call a web site that is password protected. These two tags specify the username and password to access the site. You would typically include those tags as hidden fields in your form, but be aware that view source is not considered an advanced function by most web users so this method does not provide a great deal of security. (The Blocks protocol includes full security at the transport and user layers, so you will be able to use much more sophisticated security mechanisms in the future when we release the protocol specifications.)

The syntax of the username and password parameters consists of the keyword "url" followed by a valid domain name, followed by the keyword "username" or "password." The domain is the domain name you are specifying in your calls to space.cgi or the domain name which you are specifying for your evaluate and publish scripts. Note that these might be different domain names and you can have several different url tags in your form or other calls.

url.* Example

url.edgar.space.invisible.net.username = "Invisible"

url.edgar.space.invisible.net.password = "Worlds"

url.myscripts.myhome.com.username = "myname"

url.myscripts.myhome.com.password = "mypassword"

retrieve.server

One final retrieve parameter, which is included for completeness, is the retrieve.server parameter. This is included because the space.cgi interface uses the Blocks protocol to communicate with a SpaceServer engine. In the future, a single space.cgi implementation might be used to retrieve data from several different SpaceServer engines. The valid value for this parameter is an IP address or domain name.

retreive.server Example

retrieve.server = "mySpaceServer.invisible.net"

Evaluation Parameters

You can specify a chain of evaluation scripts that are written in TCL and run in sequence. Those scripts can be specified one of two ways: the "scripts" method is based on scripts that are stored in the SpaceServer and the "URL" method is use used to specify any script available on the World Wide Web. Scripts are run in a Safe TCL sandbox, with safe filesystem and socket access allowed. More information on safe interpreters can be found on the Scriptics web site.

To specify evaluate scripts, use one of the styles of parameters:

Evaluation Parameters

evaluate.scripts = "evaluate.doc.rfc.1"

evaluate.uris = "http://example.com/evaluate.tcl"

Some scripts will want to fetch additional information to perform their evaluations. For example, the evaluate.doc.rfc.1 looks for a parameter called luminance which contains a text file available via the http method. This file contains a series of block names (e.g., "doc.rfc.2629") along with a set of name value pairs that specify the "luminance" (the importance) of a particular RFC:

Sample Evaluation Parameter

doc.rfc.2629 luminance 100

To tell this evaluate script where to find the file, you would specify the evaluate.luminance parameter:

evaluate.luminance Parameter Example

evaluate.luminance = "http://example.com/important/luminance.txt"

The evaluate script is run after the retrieve operation is done. Prior to calling the script, an options array is initialized which contains an entry for each argument that was passed into space.cgi:

Entries for Arguments Passed into space.cgi

set options(publish.script) publish.doc.rfc.1

set options(retrieve.maxhits) 25

Next, for each block that was retrieved, an item is added to the blocks array. The blocks array contains an entry for each block retrieved by the script. The contents of each entry is a list. The first item of the list is the name of the element. The list items of the list are any attributes, represented as a serialized array. The remaining items of the list, three through N, are the element's children. Each child is stored in the same list format. If the element contains text, then it has exactly one child and the name of that child is ".pcdata".

Sample XML Block

</outer>

is represented as:

Evaluation Array

{ outer { name1 value1 name2 value2 }

{ middle {} {.pcdata {} "some text"} }

{ middle {} {inner {} } } }

When the evaluate script completes, it should have created (or updated if it is a script in the middle of an evaluation chain), the relates array, which contains an entry for each block. The syntax of each entry is a "name value name value..." list, e.g.,

Evaluation Syntax

set relates(doc.rfc.1234) \

{score 1 superiors {doc.rfc.5678 doc.rfc.3456} luminance 1}

The semantics of each entry is dependent on how the publication script will use it.

Publication Parameters

The CGI script will execute a Tcl script to generate output data that is returned using HTTP to the caller of space.cgi. As with evaluate scripts, the publish script is specified using one of two methods:

Publication Parameters

publish.script = "publish.doc.rfc.1"

publish.uri = "http://example.com/publish.tcl"

Note: Notice that the words "script" and "uri" are singular because only one publish script may be specified.

Of special utility to publication scripts are two special parameters, retrieve.names and retrieve.allhits, both of which are created after the retrieval and before the first evaluation script is run.

The retrieve.names parameter is the list of block names, in order, that are being returned. Space.cgi will preserve name ordering from the things it talks to. So, if you want the "default" order in which to publish things, traverse retrieve.names, e.g.,

retrieve.names Parameter

foreach name $options(retrieve.names) ...

The retrieve.allhits parameter is the total size of the result set. What gets returned is up to retrieve.maxhits from the result set starting at position result.offset. Accordingly, if you are writing a publication script and you want to know whether there is more to come, you can do this:

Sample Publication Script

if {$options(retrieve.offset)+$options(retrieve.maxhits) < $options(retrieve.allhits)} {

# create a more button, using identical parameters EXCEPT options(retrieve.offset)

# is incremented by options(retrieve.maxhits)

The best way to understand scripts is to go the area of this Rocket Science tutorial that describes the scripts that were developed by Invisible Worlds. You can write your own, or you can simply specify one of the canned scripts already developed.

Next: Some Example Scripts »

contact | about | site map | home