Mappa.Mundi Magazine - The Importance of Being EDGAR


	Carl Malamud is Invisible Worlds', co-founder and board member. He previously founded the Internet Multicasting Service, the nonprofit group that helped pioneer some of the most important early content on the World Wide Web. Internet Multicasting is known for creating the first Internet radio station, for putting the SEC's EDGAR database on-line, and for creating the Internet 1996 World Exposition. » Archives



	Related Links Links that are related to the article. » EDGAR History » Architectural Diagram » Invisible Worlds » US Patents On-Line







	Editor's Choice Past articles by Carl Malamud. » E-Work » Multicasting Matters » Internet Prayer Wheel » Exploring the Internet

By Carl Malamud, Invisible Worlds

All Over the Month Archives»

	Been there, done that
	Carl Malamud's story of how he decided that the SEC's EDGAR database ought to be available for free on the Internet. » more

The Importance of Being EDGAR

The goal of our company, Invisible Worlds, is to give you easy, powerful access to information. Today, we're showing what our technology can do with some challenging information sources, beginning with the EDGARspace^TM portal. Think of the Internet as a largely unexplored planet; we're building surveying tools to gather and manage the data required to make maps. We see a future Internet that is a lot smarter about organizing itself. Although there is a lot of information available on the big search portals, they only index about 16 percent of the public data on the Web, according to researchers at the NEC Institute. They completely miss the “deep wells” of specialized information such as the EDGAR database, because those documents are inaccessible to today's indexing “spiders.”

Invisible Worlds is developing a new protocol and class of distributed Internet servers, aimed at making information about information, or meta-information, easy to use and share. At its simplest, meta-information is what you see in a library card catalog or the properties dialog in Microsoft Office® documents–titles, authors, versions, subjects and so forth. Another basic form of meta-information is structure identifying the sections of a document or collection. In EDGAR, this includes the various kinds of filings as well as specific items such as balance sheets and beneficial ownership data. Looking further into the future, meta-information can also include data such as reviewers' opinions, statistics, comments, ratings, and all sorts of highly specific information about information. Meta-information may be as commonplace as suggested recipes for a food item or as esoteric as DNA sequence correlations.

Invisible Worlds' SpaceServer^TM engine supports a new meta-information communications protocol developed by the same people who helped invent the Internet's protocols for mail, domain names, and other fundamental services. What can you do with meta-information? Quite a bit. We know that we cannot even imagine all of the kinds of information about information people will access via our own SpaceServer and other vendor's compatible engines. We're starting by applying this general-purpose tool to specific kinds of information services, starting with the rich financial and company information in the EDGAR database. Our theory is that if we can make this software work on 5 terabytes of data that we “skulk” (our term for getting documents and enhancing them with meta-information) from the Internet, it will probably handle your corporate Intranet. EDGAR is thus our testbed.

Overview of the Blocks Architecture

At the core of this system is a new protocol, the Blocks Protocol. The SpaceServer speaks this protocol to communicate with other SpaceServers and to communicate with two other kinds of software components we have developed, Builders and Mixers.

Mixers. Mixers are tools that find meta-information from a variety of sources on the Internet. They use the Blocks protocol to send the data into SpaceServer engines. In a broad sense, Mixers are text- or data-mining tools. They are sometimes highly generic, such as tools that find features such as names, or artfully handcrafted to discover features such as implicit structure in a specific type of EDGAR filing.

The SpaceServer engine. The SpaceServer engine is a general-purpose server that receives, stores and shares meta-information. Our primary design considerations have been speed and global scalability. It includes a full-text search engine and incorporates other data storage systems such as databases and text search engines. The Blocks protocol will allow a very large number of SpaceServer engines to share meta-information on the Internet.

Builders. Builders use the Blocks protocol to retrieve meta-information from SpaceServer engines and to prepare it for display by a Web browser or other tool you might be using to search, view and analyze information.

Overview of the Blocks Architecture (Click For A Detailed View)

When users send queries to the SpaceServer engine, Builders follow three steps to bring the results back to the desktop:

A set of retrieve operations specify which data are to be taken out of the data store. The retrieve operation is able to pull data out using any XML tag or attribute as a search criterion.

The results from the retrieve operation are fed into the evaluate step, where a TCL script (or a script in another language) looks for relationships among the data. For example, an evaluate script might look for references from one EDGAR filing to another, or use a heuristic formula to determine the relative importance of one document versus another.

Last is the publish step, where the data are formatted for the user interface, which might be an HTML browser, a JavaScript or Java array that feeds an application, or spreadsheet. While JavaScript or HTML are examples of a publish operation for the existing browser world, the publish step can just as easily turn data into an SQL update query, send a message to your pager, or send a triggered request over to E-Trade to sell stock if an event happens, such as the CTO of one of your portfolio companies filing an insider trading report to sell half of his or her stock position.

Just as Builders use a 3-step process, so do Mixers. Mixers skulk a variety of sources on the Internet, from real-time feeds to web sites to unstructured deep wells like the EDGAR database. As part of the skulking process, nuggets of meta-data are extracted and transformed into valid XML, then stored into one or many different SpaceServer engines.

Why EDGAR?

We believe this architecture of Mixers, SpaceServer engines, and Builders is suited to a wide variety of applications on the Internet. But we'd rather show real results than just tell a good story. EDGAR is an ideal database in many respects. It is very big and increasing by some 30 gigabytes per year. We think our software screams, but there is no better proof than trying it out on several hundred gigabytes of data and thousands of users. We also like EDGAR because it is a good example of a “deep well” of useful information. Our team struggled for years to get the government to put this database on the Internet. Now that it is available, it is time to see what kind of value can be added to it.

Over the coming months, we'll continue to reveal components of the Blocks protocol and how our software works. We'll also apply our software to other deep wells of information that need better access. We foresee an Internet that is much smarter about organizing itself. As powerful as the Internet is today, tomorrow it will be far more valuable as invisible worlds of information become visible.

contact | about | site map | home