The space.cgi Interface
Space.cgi is a web proxy which makes the services of the SpaceServer
engine available to traditional web browsers. The EDGARspace portal is
built using this interface. Underneath the space.cgi interface is
a rich architecture of protocols, servers, and other modules which will
be described in future issues of Rocket Science. To learn
more about the basic architecture, see
The Importance of Being EDGAR.
This tutorial explains how you, as a web developer, can use the space.cgi
interface to issue calls to a SpaceServer engine. You place your calls
to space.cgi using the cgi-bin interface, which is typically invoked
using a form. For more information on how to build forms and use
the cgi-bin interface, we recommend you read one of the O'Reilly and
Associates books:
Tech Note: We support both the GET and POST methods of
calling space.cgi. Because the number of parameters is large,
you should seriously consider using the POST method.
Quick Start: If you don't believe in reading documentation,
may we suggest you simply look at some examples.
Rocket Science has several examples
or you can go look at the source for one of the forms
in EDGARspace, such as the basic
Classic Query Form.
There are four sets of parameters in a call to space.cgi, prefixed by
either "retrieve.",
"evaluate.", "publish.", or "url."
(e.g., "retrieve.tag.merge").
If you are using the GET method, you simply create a call to
space.cgi, and add the parameters:
CGI Parameters |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
http://edgar.space.invisible.net/cgi-bin/space.cgi
?retrieve.tag.merge=and
&retrieve.tag.subtrees=doc.edgar
&retrieve.tag.e.conformed.name=micro
&retrieve.tag.a.form/type=10-K
&evaluate.scripts=evaluate.null.1
&publish.script=publish.debug.1
|
|
NOTE: This would all be on one line as part of a single
call to space.cgi.
This query searches EDGARspace for 10-Ks (annual reports) for
companies with micro in the name and publishes the results in
a simple debug format.
Try It!
An important principle to understand is that of passthrough
parameters. Any parameter that is not understood by one stage
of the space.cgi process is passed on to the next stage and
ultimately back to the client as part of the options array (if
the publish script decides to pass on the options array).
While you can use any parameters you want, we strongly recommend
that you use "user" as the prefix for any things that are going
to be utilized by the user interface. If you are writing an
evaluate script, you should use evaluate.user as your prefix.
A good example of passthrough parameters is documented in the
companion Rocket Science describing Danny Goodman's JavaScript
tour de force interface to space.cgi. By passing in user parameters,
Danny has set up his interface to perform actions on return
of data to the user. For example, if the user.saveMapOptions.islandSortType
parameter has a value of "filer.name" and the data coming back
is from doc.edgar, the visual interface will perform a row sort
based on the filer's company name.
The retrieval is the first stage of the process of bringing
data out of the SpaceServer and over to the user interface.
We provide you with several different ways of specifying the
retrieval. In all cases, you are specifying a value for
an element or an attribute for one of the pieces of meta-data
that are stored in the SpaceSever. The
examples section of this
Rocket Science can give you a quick start, and the
DTDs description will show you
all the possible elements and attributes you can specify
for your search.
For the retrieval step, you have three different styles of
call you can place:
- The simplest syntax is the basic retrieve.tag
syntax which is used to put together a very simple form
in which you can choose to "and" or "or" all your search
terms together.
- A variant of the retrieve.tag syntax is the retrieve.tag.boolean
syntax, which gives you more control over the Boolean structure
of your query.
- The most powerful (read "complex") syntax is the union style
of calls in which you use XML to construct arbitrarily
complex searches.
- In addition, you may specify retrieve.blocks to bypass
the search process and go directly to a named block in the
SpaceServer or retrieve.text to do full-text searches.
Basic retrieve.tag
The first step in specifying a search is to set the
retrieve.tag.subtrees parameter,
which specifies which portions of the Blocks Namespace you wish to
search. The value of the tag is a space-separated list of
Blocks subtrees. To date, there are two valid values:
- doc.edgar - The SEC's EDGAR database
- doc.rfc - The Internet Request for Comment series
Basic Retreive Tag |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
retrieve.tag.subtrees = "doc.edgar"
retrieve.tag.subtrees = "doc.edgar doc.rfc"
|
|
The next tag you specify is retrieve.tag.merge which
has two valid values: "and" or "or." This tag is applied to
all of your search terms that are done using Style 1. Note
that other ways of specifying Boolean operations are used
for all the other styles. Note also that if you combine styles,
or use the retrieve.blocks or retrieve.text method of searching,
all of these are combined together before being handed over
to the evaluate and publish stages.
The Merge Tag |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
retrieve.tag.merge = "and"
retrieve.tag.merge = "or"
|
|
Note: You may only have one retrieve.tag.merge tag in your call.
Next, you specify how to search the XML content. Searching is
case-insensitive. There are three ways to do this:
- tag.e.*, which looks for text inside an element's character data;
- tag.a.*, which looks for text inside a particular attribute value in any
element; or,
- tag.x.*, which looks for text inside any attribute value in a particular.
These attributes may occur multiple times (with results combined
according to the value of the tag.merge parameter).
For rfc.space, here are the parameters to try, along with suggested values
(if any):
- tag.a.category: std, bcp, info, or exp
- tag.e.doc.area: Applications, General, Internet, Management, Operations,
Routing, Security, Transport, User
- tag.e.doc.workgroup
- tag.e.doc.keyword
- tag.e.doc.title
- tag.x.doc.author
So, if you wished to construct a query that looks for any attribute
or element of type author with a value of "Rose," and a
status of "Internet Standard," your call to space.cgi would consist
of the following parameters:
Sample Query |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
retrieve.tag.subtrees = "doc.rfc"
retrieve.tag.merge = "and"
retrieve.tag.a.category = "std"
retrieve.tag.x.doc.author = "Rose"
|
|
One extension to the retrieve.tag syntax allows you to
specify a path. For example, in the EDGARspace DTD,
you will notice that many examples of dates are included,
including the filing date and the period for which
the filing pertains.
By specifying a path, you can indicate which specific
date is to searched. Paths are wild-carded, so you
do not specify the full path to an attribute or
element, only the pieces of the path that make it
unique. The path is specified by separating each
of the elements with a "/" character.
Retrieve Tag Example Attribute "year" |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
retrieve.tag.a.filing.date/year = "1999"
|
|
This retrieve.tag parameter looks for the attribute
year which is contained within an element called
filing.date. Note that the SpaceServer uses wild-card
paths, which means that filing.date is simply some
higher element above. For example, the above
search would match the following XML
structure:
XML Result |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
<docroot>
<filing.date type="gregorian">
<actual.date year=1999 />
</filing.date>
</docroot>
|
|
Boolean retrieve.tag
The retrieve.tag syntax of calls described in the previous
section is very simple, but it doesn't give you a lot
of control over your searches. In particular, a single
variable, retrieve.tag.merge, specifies the "and" or "or" joining
of your search results. The Boolean variant of the
retrieve.tag syntax is slightly more
complex, but gives you more control.
Note: When you submit a query to space.cgi, if the retrieve.tag.boolean
tag is present and not empty, the SpaceServer will ignore
any examples of retrieve.tag.e, retrieve.tag.a, retrieve.tag.x,
and retrieve.tag.merge. You can pick one or the other, with the
retrieve.tag.boolean taking precedence.
As before, you would select a subtree using the retrieve.tag.subtrees
parameter. Then, you would specify three tags:
- retrieve.tag.k.* specifies the name of the element or attribute you want to search
- retrieve.tag.v.* is the value of the element.
- retrieve.tag.boolean specifies how your terms are put together.
You may have many examples of retrieve.tag.k.* and retrieve.tag.v.*, but
they must always be in pairs.
The last element of each pair is the name of a variable,
which can be anything you want. The contents of the k tag are
a path in the same style as the basic retrieve.tag syntax.
The value is any valid search term.
Example: You want to search in the doc.edgar subtree for the
element company name (conformed.name) for two different
companies ("Amazon" and "Yahoo") and
the attribute for form type with a value of "10-K."
Boolean Retrieve Example |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
retrieve.tag.subtrees = "doc.edgar"
retrieve.tag.k.Variable1 = "retrieve.tag.e.conformed.name"
retrieve.tag.v.Variable1 = "Amazon"
retrieve.tag.k.Variable2 = "retrieve.tag.e.conformed.name"
retrieve.tag.v.Variable2 = "Yahoo"
retrieve.tag.k.Variable3 = "retrieve.tag.a.type"
retrieve.tag.v.Variable3 = "10-K"
|
|
The next tag you specify is the retrieve.tag.boolean tag,
which indicates how you want your search put together:
Boolean Tag |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
retrieve.tag.boolean = "(Variable1 OR Variable2) AND Variable3"
|
|
Note:The Boolean tag may only have variable names
that you declared in your v/k tags, parentheses, and
the words "and" or "or".
You may think that switching from Style 1 (simple retrieve.tag)
to Style 2 loses you something important: the ability to give
the user control over Boolean operations. In Style 1, you
can establish a radio checkbox that lets the user choose
"and" or "or":
Merge Example |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
<INPUT type="radio" name="retrieve.tag.merge" value="and" CHECKED>
and (matches all your criteria)
<BR>
<INPUT type="radio" name="retrieve.tag.merge" value="or" >
or (matches any one of your criteria)
|
|
The same trick can be used in Boolean mode, but this time you can
have more complex searches:
Boolean Example |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
<INPUT type="radio" name="retrieve.tag.boolean"
value="(Name AND SIC) AND YEAR" CHECKED>
and (Company Name AND Sic Code for this Year)
<BR>
<INPUT type="radio" name="retrieve.tag.boolean"
value="(Name OR SIC) AND YEAR" >
or (Company Name OR Sic Code for this Year)
|
|
retrieve.text
If you want to retrieve based on source text, use the retrieve.text.*
parameters. The CGI script will use a full-text search engine on the
Web server which conforms to the doc.edgar or doc.rfc subtrees. So, the
first text parameter is retrieve.text.subtree, and
valid values are "doc.edgar" and "doc.rfc". You may only specify
one value for this parameter.
First retrieve.text Example |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
retrieve.text.subtree = "doc.edgar"
|
|
Next, you specify a value consisting of one or more words
separated by spaces. Searching is case-insensitive.
retrieve.text Value Example |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
retrieve.text.subtree = "doc.rfc"
retrieve.text.full = "Marshall T. Rose"
|
|
Note: The full text searching uses our SpaceEngine, a full-text,
XML-aware search engine written for Invisible Worlds by Nassib
Nassar, the author of the classic Isearch open-source engine. This
particular full-text search uses phrase searching with proximity
matching.
You may specify the retrieve.text.full parameter more than
once, and additionally may specify a retrieve.text.merge
parameter which may have a value of "and" or "or".
A phrase search is submitted for the contents of each
retrieve.text.full parameter and the results are merged together.
retrieve.blocks
If you always want the retrieval to include certain blocks (or don't want to
search at all), simply specify the blocks you're interested in, separated by
spaces:
retrieve.blocks Example |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
retrieve.blocks = "doc.rfc.1006 doc.rfc.2223"
|
|
The block parameter can be found by examining your query results
in debug mode. See the publish debug options below.
The retrieve.blocks parameter is quite useful as part of
building new things into search result pages. For example,
if you search EDGARspace for annual reports, you might
gather the metadata about each block found to format a list
that includes pointers to the full text, the company name,
and the form type. Next to that, you could preformat
a search that extracts the business address of the
company.
retrieve.blocks Example |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
http://edgar.space.invisible.net/cgi?
?retrieve.blocks=doc.edgar.1999.101829.27
&evaluate.scripts=evaluate.null.1
&publish.script=publish.doc.edgar.showme
&user.showme=business.address
&user.baseURL=http://edgar.space.invisible.net/
|
|
This command feeds in a specific block, runs it through a null
evaluation, and then calls the showme publication script to
pull out the mailing address.
Try It!
retrieve.union
The third, and most advanced style, uses the powerful union
operator. A separate tutorial
is provided to explain this
advanced syntax.
Other Retrieve Parameters
We have covered several different variants of retrieve parameters,
including:
- The basic retrieve.tag.* style
- The Boolean retrieve.tag.* variant
- retrieve.blocks
- retrieve.text
- retrieve.union
Normally, you would pick one of those styles and submit your
query. It is possible to use multiple
styles in a single query. The rules for combining your results are as follows:
- If you specify both the basic retrieve.tag.merge (e.g.,
retrieve.tag.{aex}) and retrieve.tag.boolean, the
Boolean syntax wins and the basic tags are ignored.
- Each of the retrieve.tag, retrieve.blocks, retrieve.text,
and retrieve.union queries are submitted and the results are
merged together as a Boolean "or."
retrieve.maxhits
While your query may (potentially)
bring back an arbitrary number of results, retrieve.maxhits tells
the SpaceServer it may safely ignore any results that
exceed the maxhits parameter.
Maxhits Parameter |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
retrieve.maxhits = "24"
|
|
Note: The default value for maxhits if this parameter is
not present is 25. At present, the SpaceEngine, our
data repository, uses arbitrary relevance ranking. In
the future, you'll be able to specify relevance ranking,
and then apply maxhits to those results.
An enhancement of the maxhits parameter is the optional
retrieve.offset parameter. Maxhits specifies how many results
you want returned. The offset specifies an offset in the total
query results. Thus, if you are implementing a search
results form that shows a user 25 results at a time,
the offset parameter would let you implement a "next 25"
button.
Offset Parameter |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
retrieve.maxhits = "25"
retrieve.offset = "25"
|
|
Note: Offsets start at 0 (e.g., the first block returned
is numbered 0) so an offset of 1 would bring you back
result number 2.
retrieve.tag.n
A special parameter that you may wish to use is for
normalizing a user query. In the SEC's EDGAR database,
a variety of transformations are applied to the name of
a company, such as stripping out spaces, hyphens, and other
types of punctuation. If your user would search for "U.S."
that would return zero hits since the term "U.S." is not
in the database.
The retrieve.tag.n parameter specifies an element or attribute that
you wish to normalize before sending the query into the
SpaceServer. At present, this is only valid
for one element, conformed.name in the doc.edgar subtree, and
the only valid value is 1, which specifies the generic normalization
rules for EDGAR names.
Normalization Example |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
retrieve.tag.n.conformed.name = "1"
|
|
The normalization rules consist of the following heuristics
applied to the user query:
- If an apostrophe is present, it is stripped out. Thus,
"O'Gready" is turned into "OGready".
- If a hyphen is present, it is turned into a space. Thus,
"Invisible-Worlds" is turned into "Invisible Worlds".
- If periods are present in single letter words, they are
stripped out. Thus "U.S." becomes "US" and "F.H.W.B." becomes
"FHWB". Caution: The SEC process of stripping out these
periods is inconsistent across company names. Thus, "U.S."
sometimes becomes "US" and sometimes becomes "U S ".
- Any other periods are turned into spaces. Thus "Broadcast.com"
is turned into "Broadcast Com".
- The following words are stripped from the query:
CO COMPANY CORP CORPORATION INC INCORPORATED LIMITED LTD and THE.
url.*
Two parameters starting with url.* are provided so that you can
call a web site that is password protected.
These two tags specify the username and password to access
the site. You would typically include those tags as hidden
fields in your form, but be aware that view source is not
considered an advanced function by most web users so this method
does not provide a great deal of security. (The Blocks protocol
includes full security at the transport and user layers, so you
will be able to use much more sophisticated security mechanisms
in the future when we release the protocol specifications.)
The syntax of the username and password parameters consists of
the keyword "url" followed by a valid domain name, followed
by the keyword "username" or "password." The domain is
the domain name you are specifying in your calls to space.cgi
or the domain name which you are specifying for your evaluate
and publish scripts. Note that these might be different domain
names and you can have several different url tags in your
form or other calls.
url.* Example |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
url.edgar.space.invisible.net.username = "Invisible"
url.edgar.space.invisible.net.password = "Worlds"
url.myscripts.myhome.com.username = "myname"
url.myscripts.myhome.com.password = "mypassword"
|
|
retrieve.server
One final retrieve parameter, which is included for completeness,
is the retrieve.server parameter. This is included because
the space.cgi interface uses the Blocks protocol to communicate
with a SpaceServer engine. In the future, a single space.cgi
implementation might be used to retrieve data from several
different SpaceServer engines. The valid value for this
parameter is an IP address or domain name.
retreive.server Example |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
retrieve.server = "mySpaceServer.invisible.net"
|
|
You can specify a chain of evaluation scripts that are written
in TCL and run in sequence. Those scripts can be specified
one of two ways: the "scripts" method is based on scripts that
are stored in the SpaceServer and the "URL" method is use
used to specify any script available on the World Wide Web.
Scripts are run in a Safe TCL sandbox, with
safe filesystem and socket access allowed. More
information on safe interpreters can be found on the
Scriptics web site.
To specify evaluate scripts, use one of the styles of parameters:
Evaluation Parameters |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
evaluate.scripts = "evaluate.doc.rfc.1"
evaluate.uris = "http://example.com/evaluate.tcl"
|
|
Some scripts will want to fetch additional information
to perform their evaluations. For example, the evaluate.doc.rfc.1
looks for a parameter called luminance which contains a
text file available via the http method. This file contains
a series of block names (e.g., "doc.rfc.2629") along with a set
of name value pairs that specify the "luminance" (the importance)
of a particular RFC:
Sample Evaluation Parameter |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
doc.rfc.2629 luminance 100
|
|
To tell this evaluate script where to find the file, you
would specify the evaluate.luminance parameter:
evaluate.luminance Parameter Example |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
evaluate.luminance = "http://example.com/important/luminance.txt"
|
|
The evaluate script is run after the retrieve operation is done.
Prior to calling the script, an options array is initialized
which contains an entry for each argument that was passed into
space.cgi:
Entries for Arguments Passed into space.cgi |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
set options(publish.script) publish.doc.rfc.1
set options(retrieve.maxhits) 25
|
|
Next, for each block that was retrieved, an item is added
to the blocks array. The blocks array contains an entry
for each block retrieved by the script.
The contents of each entry is a list. The first item of the list is the name of
the element. The list items of the list
are any attributes, represented as a serialized array.
The remaining items of the list, three through N, are the
element's children. Each child is stored in the same list format. If the element
contains text, then it has exactly one child and the name of that child is
".pcdata".
Sample XML Block |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
<outer name1="value1" name2="value2">
<middle>some text</middle>
<middle><inner /></middle>
</outer>
|
|
is represented as:
Evaluation Array |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
{ outer { name1 value1 name2 value2 }
{ middle {} {.pcdata {} "some text"} }
{ middle {} {inner {} } } }
|
|
When the evaluate script completes, it should have created (or updated if it
is a script in the middle of an evaluation chain), the relates array,
which contains an entry for each block. The syntax of each entry is
a "name value name value..." list, e.g.,
Evaluation Syntax |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
set relates(doc.rfc.1234) \
{score 1 superiors {doc.rfc.5678 doc.rfc.3456} luminance 1}
|
|
The semantics of each entry is dependent on how the publication script will
use it.
The CGI script will execute a Tcl script to generate output
data that is returned using HTTP to the caller of space.cgi.
As with evaluate scripts, the publish script is specified using
one of two methods:
Publication Parameters |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
publish.script = "publish.doc.rfc.1"
publish.uri = "http://example.com/publish.tcl"
|
|
Note: Notice that the words "script" and "uri" are singular
because only one publish script may be specified.
Of special utility to publication scripts are two special
parameters, retrieve.names and retrieve.allhits, both of
which are created after the retrieval and before the
first evaluation script is run.
The retrieve.names parameter is the list of block names, in order,
that are being
returned. Space.cgi will preserve name ordering from the things it talks to.
So, if you want the "default" order in which to publish things, traverse
retrieve.names, e.g.,
retrieve.names Parameter |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
foreach name $options(retrieve.names) ...
|
|
The retrieve.allhits parameter is the total size of the result set.
What gets returned is up to retrieve.maxhits from the result set
starting at position result.offset. Accordingly, if you are writing a
publication script and you
want to know whether there is more to come, you can do this:
Sample Publication Script |
data:image/s3,"s3://crabby-images/08ca8/08ca805c71c98b7c60d4de8e140a9ad69ff2870f" alt="" |
if {$options(retrieve.offset)+$options(retrieve.maxhits) < $options(retrieve.allhits)} {
# create a more button, using identical parameters EXCEPT
options(retrieve.offset)
# is incremented by options(retrieve.maxhits)
|
|
The best way to understand scripts is to go the area
of this Rocket Science tutorial that describes the scripts
that were developed by Invisible Worlds. You can write your
own, or you can simply specify one of the canned scripts
already developed.
Next: Some Example Scripts »
|