INTRODUCTION
In 1965 all data I/O strategies depended on
prior knowledge of user applications.
Many government agencies were collecting large volumes of data,
but had no clue how applications might use the accumulated data.
The Advanced Research Projects Agency, ARPA, reasoned that
information contained in data relationships
was intrinsic to the data itself and should
not be diminished by a computer
representation.[ARPA]
ARPA: Data Relationships Are Independent of Data Representations.
In 1965 ARPA sponsored research to
investigate the feasibility of a
machine-independent data structure.
Since mathematical objects
maintain there relationships independent of their representations,
and since set theory already provided mathematical representations
of relationships, the research effort focused on how sets
could represent data relationships and what set operations
would be needed to support the requirements of data
structures.[Why]
Record Accessing Structures
rely on physical organizations of data representations.
Set Processing Operations
rely only on intrinsic properties of data relationships.
The basic distinction between record-accessing
and set-processing
systems is how data is delivered from storage.
Record-accessing systems use application shared index structures.
Set-processing systems use application independent set operations.
Index structures degrade multiple application performance
by sharing stored data.
Set operations support simultaneous execution of applications,
each having its own data.
Index structures have to be constructed for all new data.
Set operations are installed only once.
Index structures are machine-dependent.
Set operations are machine-independent.
By 1968 an ARPA developed set-theoretic data structure, STDS,
for representing, managing, and accessing all data as mathematically well-defined
sets, had been implemented on an IBM mainframe at the University of Michigan.
STDS supported interactive ad hoc queries
and served as the data access engine for
a commercially available RDBMS from 1971
through 1998.[MICRO]
STDS: Set-Theoretic Data Structure
STDS was developed under ARPA as a data access alternative
to using index structures.
Though index structures are recognized as a superior mechanism for
accessing individual records with known locations,
ARPA's interests were in accessing collections of records
satisfying common properties.
Since information is machine independent before being forced
into a machine representation,
research was initiated to separate those properties of
data structures which were machine independent, the
application environment,
from those properties which were machine dependent,
the
storage environment, then to develop a data structure supporting isolation
and separate control of these
properties.[CONCOMP-D]
The ARPA specification was quite emphatic about separating
application data representations from storage data representations,
but rather vague about how data was to be exchanged between the two.
Sets & Mappings
If both the application environment
and the storage environment are set-theoretically well-defined,
then data embedding and data extraction functions can
be defined to provide machine independent
data exchange between applications and storage.
Since feasibility of a machine-independent data structure
had to be demonstrated, simple array structures
were chosen for proof of concept.
Arrays can easily be represented as
sets of n-tuples. Which is a nice property for
a storage environment.
Unfortunately, an array with just eight domains
has over 40,000 different representations for the
same data relationship.
The research objective now focused on finding
a unique set representation for data relationships
and a mapping mechanism for
coupling application data with appropriate storage data.
"What we lack in rigor, we make up for with notation." -David Maier
Need for Notation
Since a unique n-array relationship may have
n! unique storage representations,
a notation is needed that provides a single application
representation and supports a 1-1 mapping
with any possible storage representation.
A slight perturbation of classical set notation
provides such a solution.
Records as Sets
It is common practice to model records, [a,b,c], using
tuples, ‹a,b,c›.
Records impose an ordering structure.
The record [a,b,c] is not the same as [b,c,a],
even thought the relationship between a, b, and c may be
the same.
Classical set theory does not support the concept of structure
as part of set membership.
The use of the tuple notation ‹a,b,c› is mostly a conceptual
convince to denote an ordered sequence of items.
Since the tuple notation does not represent a specific set,
tuples can not be used as operands for set
operations.[Skolem]
To support tuple operations, an alternate notation was chosen by
assigning superscripts to elements of sets.
The new notation equates tuples with specific structured sets:
‹a,b,c› ≡
{a1, b2, c3}.
Classical sets are structured sets with null structure:
{a,b,c} ≡
{a∅, b∅,
c∅}.
Labeled Arrays
Being able to define records as sets was essential
for modeling stored data where item ordering is relevant.
But it did not provide any structure independence
for modeling application data.
Application data still needs knowledge of how data is structured in storage.
An 8-column application array still has over 40,000 possible representations.
A simple solution was to expand the use of superscripts to include
names along with numbers.
Assume an 8-column array with arbitrary (but unique) names
A, B, C, D, E, F, G, H.
The typical element of a j-row array would be a set of the following form:
{aA,
bB,
cC,
dD,
eE,
fF,
gG,
hH}
.
Depending on how the system decided to physically represent the array in storage,
any one of 40,320 mapping tuples could be used to relate
application-array-set relationships
to storage-array-set relationships.
Mapping Tuples
Mapping tuples are essentially array headers
that define how application column names
map to storage column numbers.
For example:
‹ C,D,H,F,G,A,B,E ›
and
‹ B,A,H,C,G,D,F,E ›
are two system defined mappings of the same application array relationships to
two different storage representations.
Another example uses mapping tuples to select a subset of columns
‹ A,B,C,H › which
returns a sub-array to an application with just the first three and the last column
of an 8-column array.
Operations on Structured Sets
Since the initial objective was to demonstrate the feasibility
of machine independent data operations using set operations,
a necessary and sufficient condition was to implement
all relevant Classical set operations.
The relevant operations were four Boolean and seven
Relational.
Boolean:
Union,
Intersection,
Relative Compliment, and
Symmetric Difference.
Relational:
Image,
Range,
Domain,
Converse,
Restriction,
Relative Product,
and
Cartesian Product.
A total of 25 operations using structured sets as operands
were implemented by 1968.[AFIP]
These operations provided the data access engine for the University of Michigan's
MICRO-STDS RDBMS.
The final report of the CONCOMP Project in 1970 concluded that STDS
established the feasibility and realization of a machine-independent data structure.
MICRO-STDS RDBMS
Interest in the data analysis capabilities of STDS operations
resulted in development of an English language interface,
MICRO, using STDS for data storage organization and access.
Guided by ARPA's observation that data relationships were independent of
their representation, that English was a familiar mechanism
for expressing relationships,
and that set membership was easily expressible in English
it seemed quite natural to couple STDS with an English language
user interface. Which MICRO developers did in 1970.
MICRO Architecture
Since the discovery and manipulation of data relationships from raw
was the project's directive,
index structures were not needed for locating individual records.
Though not appreciated at the time, in retrospect the
requirement for ad hoc querying of bulk data and the
lack of index structures
negated any need for
schemas,
normalization,
destructive updates,
file locking,
or an SQL language.
Even though STDS had proven remarkably fast under research
experiments, it was assumed that the lack of index structures
would hinder overall performance.
Since mining for unknown relationships was the primary objective,
any performance at all was considered a success.
To the surprise of everyone involved, performance
defied expectations.
Access Acceleration
Since the MICRO-STDS architecture completely separated the
English query statements from the set-theoretic organization
of data, it was not obvious why extreme I/O activity
should be fast. The IBM time sharing environment complicated
any direct understanding since many applications were running
simultaneously.
The explanation turned out to be quite simple,
results of set operations could not be distinguished from
previously existing sets. As these sets were generated in
the course of a query execution they became used by the
continuing execution, even though they had not existed at the
start of the execution.
These intermediate sets were generally quite small and
contained highly relevant data.
The overall effect was that the database of relevant data
shrank rapidly as applications executed.
Once this was discovered, set operation sequences were
tuned for informationally dense I/O.
If MICRO were available today, it might be interesting to see
how a set-processing I/O system would compare to
the best available record-accessing I/O systems.
STDS & TPC-H
Though the MICRO-STDS implementation terminated in 1998 when the
mainframe was abandon, STDS research and commercial usage continued.
By 2001 STDS had been used in products for GM, XEROX, FAA, Delco,
Arthur Anderson, and had been incorporated in a backend database
machine developed by the STIS Corp.
STIS was acquired by an affiliate of Hitachi in 1984.
From 1985 trough today STDS has been incorporated in an
interactive data access and manipulation software system, iXSP.
iXSP (interactive STDS)
iXSP is the current incarnation of STDS.
It provides an interactive interface to STDS
that allows ad hoc queries using set operations
accessing available disk files.
iXSP currently recognizes over 90 set operations and is extensible.
iXSP also recognizes XML documents as
sets.[SAG]
Since MICRO-STDS used an English user interface mapping to set operations
to access storage
and since current RDBMS use SQL translations of English
queries to SELECT Statements, a performance comparison
of Set Processing and Record Accessing systems
can be made by
using the SQL Select statement as input to iXSP.
Since the TPC-H
Benchmark is the industry standard for
record-accessing I/O systems, it is reasonable to assume that
TPC-H results reflect the best performance of SQL Selects being
supported by Index Structures.
Since iXSP uses no Index Structures,
one way to compare user productivity
is to run the most challenging TPC-H SQL query on iXSP and
compare with current RDBMSs on the same platform.
This is exactly how Information Builders Inc.
evaluated set-processing I/O with record-accessing
I/O.
Results demonstrated that set-processing I/O provided
well over an order of magnitude improvement in
performance.[IBI]
CONCLUSION
The fundamental result of ARPA's research on defining and implementing a
machine-independent data structure was the discovery that
records could be recognized as sets.
A notation was developed to manipulate records as sets.
An implementation of this notation became STDS.
The ability to manipulate records as sets is what
distinguishes STDS implementations from traditional
DBMS implementations.
STDS is a body of well-defined set operations that are non-proprietary
and machine independent.
The only restriction on such a body of set operations,
being referenced as a STDS,
is that the operations must be
explicitly defined in terms of extended set notation,
publicly available and verifiable under execution.
The key to STDS performance is being able to manipulate records as
sets.[Ellis]
Since any logical or physical representation of data has a
set identity, in terms of extended set notation,
STDS implementations are ideal for supporting
future applications involving:
Data Access,
Data Mining,
Data Sharing,
Data Analysis,
Data Security,
Data Validation,
Data Integration,
Data Reliability,
Big Data, and
Cloud Computing.
.
Data That Can't Be Accessed, Can't Be Processed.
If Data Can Be Accessed, It Has A Set Identity.
If Data Has A Set Identity, It Can Be Processed By Set Operations.
If Data Can Be Processed By Set Operations, Processes Are Limited Only By Imagination.
ARPA Research
STDS Development ARPA research on machine-independent data structure. 1965-1970
Description of STDS: AFIP fall joint computer conference San Francisco CA, December 1968
Extended Sets Data representations for near-optimal I/O performance.
Commercial RDBMS
MICRO-STDS RDBMS from 1971-1998: English Queries - Set Storage
Set I/O vs. Record I/O Performance comparison of STDS with Oracle & IBM - 2005
RDM & SQL A missed opportunity for I/O access.
Sets Not Records?
Why Not Sets? All computer behavior is set processing.
Extended Set Theory A General Model For Very Large, Distributed, Backend Information Systems.
RDM+XML XSP: An Integration Technology for Systems Development and Evolution - 2001
SQL Rejuvenated An Open Source Opportunity
Formal Foundations
Axioms and Models for an Extended Set Theory Tuples uniquely defined as sets. - 2011
Functions as Set Behavior Formal modeling of systems behavior. - 2016
Data as Sets
Twelve RDM tables
R1 - R12 expressed by a single Labeled
set Ri, RDM.
A very simple XML-structure expressed as a labeled
set, XML.
Three extended relations expressed as labeled
sets, Xrel1,
Two complex extended relations expressed as labeled
sets, Xrel2.
More detailed information to assist implementations and I/O optimization strategies using
set operations to Manipulate Data by its Mathematical Identity
can be found
at Extended Set Processing.
Since STDS operations and the foundations for all extended set processing operations
are already available on the web,
it would seem that open source developers might
get a jump on the rest of the industry
by providing systems designed to improve user productivity.
References
-
[IFIP]
Feasibility of a Set-theoretic Data Structure
A general structure based on a reconstituted definition of relation
IFIP Congress, Edinburgh Scotland, August 1968
-
[AFIP]
Description of a Set-theoretic Data Structure:
AFIPS fall joint computer conference San Francisco CA, December 1968
- [ARPA]
Westervelt, F. H.:
CONCOMP: Research in Conversational use of Computers: final report
Office of Research Administration, Ann Arbor, December 1970
-
[CONCOMP]
CONCOMP Project Appendix D: Description of a Set-Theoretic Data Structure
-
[MICRO]
MICRO-STDS RDBMS 1971-1998
-
[VLDB77]
Extended Set Theory:
A General Model For Very Large, Distributed, Backend Information Systems - VLDB77
-
[LLL]
Lawrence Livermore Lab.:
Set Theoretic Data Structures (STDS): a tutorial
Technical Report, 1977.
-
[VLDB84]
1984 VLDB Panel: Inexpensive Large Capacity Storage Will Revolutionize
The Design Of Database Management Systems
Proceedings of the Tenth International Conference on Very Large Data Bases.
Singapore, August, 1984
-
[NATO]
A Mathematical Foundation for Systems Development - NATO-ASI Series, Vol. F24, 1986
-
[SAG]
Champion, M.:
XSP: An Integration Technology for Systems Development and Evolution
- Software AG - 2001
-
[IBIl
Stout, R.:
Information Access Accelerator
Information Builders Inc. - 2005
-
[WHY]
Why Not Sets? - 2010
- [Nor10]
North, K.:
Three articles
presenting a short historical perspective on the
role of set theory,
mathematically sound data models,
and the importance of data independence. - 2010
PART I:
Sets, Data Models and Data Independence
PART II:
Laying the Foundation: Revolution, Math for Databases and Big Data
PART III:
Information Density, Mathematical Identity, Set Stores and Big Data
-
[XST]
Blass, A., Childs, D L:
Axioms and Models for an Extended Set Theory - 2011
- [Skolem]
Skolem, T.:
Two Remarks on Set Theory (2. The ordered n-tuples as sets)
MATH. SCAND, 5 (1957) 40-46
-
[Ellis]
Ellis, T.:
Extended Set Theory: A Summary - 2015
Supplement
-
Pebble Piles & Index Structures: A Parable
-
XSP: Extended Set Processing:
Mathematically Managing Data Representations (1 page summary)
-
Set Processing vs. Record Processing Performance:
Dynamic Data Restructuring vs. Prestructured Data Storage
-
Functions As Set Behavior Essential Concepts: Conceptual & Formal
-
SQL Rejuvenated Data, Relationships, RDM, Tables, NoSQL & Set Theory
-
Content-Container Data Access Strategies Content for Functionality - Containers for Performance
-
Set-Accessing I/O for SQL Relational Capabilities without SQL Overhead
-
iXSP Information Access Accelerator Information Builders Inc. STDS performance evaluation results
-
Data Representations as Mathematical Objects Considering Content Compatibility of Relational & XML Data Representations
-
Adaptive Data Restructuring Functions A High Performance Alternative to Indexed Data Access Structures
-
STDS: Glossary Obscura Exposing Key Concepts & Definitions
-
Processes, Functions & Composition Essential Concepts
Copyright © 2017
INTEGRATED INFORMATION SYSTEMS
« Last modified on 10/13/2017 »
|