
The Importance of Being EDGAR
The goal of our company, Invisible
Worlds, is to give you easy, powerful
access to information. Today, we're showing what our technology can do with
some challenging information sources, beginning with the
EDGARspaceTM
portal. Think of the Internet as a largely unexplored planet; we're
building surveying tools to gather and manage the data required to make
maps. We see a future Internet that is a lot smarter about organizing
itself. Although there is a lot of information available on the big search
portals, they only index about 16 percent of the public data on the Web,
according to researchers at the NEC Institute. They completely miss the
deep wells of specialized information such as the EDGAR database, because
those documents are inaccessible to today's indexing spiders.
Invisible Worlds is developing a new
protocol and class of distributed
Internet servers, aimed at making information about information, or
meta-information, easy to use and share. At its simplest, meta-information
is what you see in a library card catalog or the properties dialog in
Microsoft Office® documentstitles, authors, versions, subjects and so
forth. Another basic form of meta-information is structure identifying the
sections of a document or collection. In EDGAR, this includes the various
kinds of filings as well as specific items such as balance sheets and
beneficial ownership data. Looking further into the future,
meta-information can also include data such as reviewers' opinions,
statistics, comments, ratings, and all sorts of highly specific information
about information. Meta-information may be as commonplace as suggested
recipes for a food item or as esoteric as DNA sequence correlations.
Invisible Worlds' SpaceServerTM
engine supports a new meta-information
communications protocol developed by the same people who helped invent the
Internet's protocols for mail, domain names, and other fundamental services.
What can you do with meta-information? Quite a bit. We know that we cannot
even imagine all of the kinds of information about information people will
access via our own SpaceServer and other vendor's compatible engines.
We're starting by applying
this general-purpose tool to specific kinds of information services,
starting with the rich financial and company information in the EDGAR
database. Our theory is that if we can make this software work on 5
terabytes of data that we skulk (our term for getting documents and
enhancing them with meta-information) from the Internet, it will probably
handle your corporate Intranet. EDGAR is thus our testbed.
Overview of the Blocks Architecture |
|
At the core of this system is a new protocol, the Blocks Protocol. The SpaceServer speaks this protocol to communicate with other SpaceServers and to communicate with two other kinds of software components we have developed, Builders and Mixers.
- Mixers. Mixers are tools that find meta-information from a variety of
sources on the Internet. They use the Blocks protocol to send the data into
SpaceServer engines. In a broad sense, Mixers are text- or data-mining
tools. They are sometimes highly generic, such as tools that find features
such as names, or artfully handcrafted to discover features such as implicit
structure in a specific type of EDGAR filing.
- The SpaceServer engine. The SpaceServer engine is a general-purpose server
that receives, stores and shares meta-information. Our primary design
considerations have been speed and global scalability. It includes a full-text
search engine and incorporates other data storage systems such as databases
and text search engines. The Blocks protocol will allow a very large number of
SpaceServer engines to share meta-information on the Internet.
- Builders. Builders use the Blocks protocol to retrieve meta-information from
SpaceServer engines and to prepare it for display by a Web browser or other
tool you might be using to search, view and analyze information.

Overview of the Blocks Architecture (Click For A Detailed View)
When users send queries to the
SpaceServer engine, Builders follow three steps to bring the results
back to the desktop:
- A set of retrieve operations specify which data are
to be taken out of the data store. The retrieve operation is able to pull data out using any XML tag or attribute as a search criterion.
- The results from the retrieve operation are fed into the evaluate step, where a TCL script (or a script in another language) looks for relationships among the data.
For example, an evaluate script might look for references from
one EDGAR filing to another, or use a heuristic formula to determine
the relative importance of one document versus another.
- Last is the publish step, where the data are formatted for
the user interface, which might be an HTML browser, a JavaScript or
Java array that feeds an application, or spreadsheet.
While JavaScript or HTML are examples of a publish operation
for the existing browser world, the publish step can just as easily turn
data into an SQL update query, send a message to your pager,
or send a triggered
request over to E-Trade to sell stock if an event happens, such
as the CTO of one of your portfolio companies filing an insider
trading report to sell half of his or her stock position.
Just as Builders use a 3-step
process, so do Mixers. Mixers skulk a variety of sources
on the Internet, from real-time feeds to web sites to unstructured
deep wells like the EDGAR database. As part of the skulking process,
nuggets of meta-data are extracted and
transformed into valid XML, then stored into one
or many different SpaceServer engines.
We believe this architecture of Mixers, SpaceServer engines, and Builders is suited
to a wide variety of applications on the Internet. But we'd rather show real
results than just tell a good story.
EDGAR is an ideal database in many respects. It is very big and
increasing by some 30 gigabytes per year. We think our software screams, but there
is no better proof than trying it out on several hundred gigabytes of data and
thousands of users. We also like EDGAR because it is a good example of a
deep
well of useful information. Our team struggled
for years to get the government to put this database on the Internet. Now that
it is available, it is time to see what kind of value can be added to it.
Over the coming months,
we'll continue to reveal components of the Blocks protocol and how our software
works. We'll also apply our software to other deep wells of information that need
better access. We foresee an Internet that is much smarter about
organizing itself. As powerful as the Internet is today, tomorrow it will be far more
valuable as invisible worlds of information become visible.
Copyright © 1999, 2000 Invisible Worlds. All Rights Reserved.
|