INTRODUCTION
In 1965 all data I/O strategies depended on
prior knowledge of user applications.
Many government agencies were collecting large volumes of data,
but had no clue how applications might use the accumulated data.
The Advanced Research Projects Agency, ARPA, reasoned that
information contained in data relationships
was intrinsic to the data itself and should
not be diminished by a computer
representation.[ARPA]
ARPA: Data Relationships Are Independent of Data Representations.
In 1965 ARPA sponsored research to
investigate the feasibility of a
machineindependent data structure.
Since mathematical objects
maintain there relationships independent of their representations,
and since set theory already provided mathematical representations
of relationships, the research effort focused on how sets
could represent data relationships and what set operations
would be needed to support the requirements of data
structures.[Why]
Record Accessing Structures
rely on physical organizations of data representations.
Set Processing Operations
rely only on intrinsic properties of data relationships.
The basic distinction between recordaccessing
and setprocessing
systems is how data is delivered from storage.
Recordaccessing systems use application shared index structures.
Setprocessing systems use application independent set operations.
Index structures degrade multiple application performance
by sharing stored data.
Set operations support simultaneous execution of applications,
each having its own data.
Index structures have to be constructed for all new data.
Set operations are installed only once.
Index structures are machinedependent.
Set operations are machineindependent.
By 1968 an ARPA developed settheoretic data structure, STDS,
for representing, managing, and accessing all data as mathematically welldefined
sets, had been implemented on an IBM mainframe at the University of Michigan.
STDS supported interactive ad hoc queries
and served as the data access engine for
a commercially available RDBMS from 1971
through 1998.[MICRO]
STDS: SetTheoretic Data Structure
STDS was developed under ARPA as a data access alternative
to using index structures.
Though index structures are recognized as a superior mechanism for
accessing individual records with known locations,
ARPA's interests were in accessing collections of records
satisfying common properties.
Since information is machine independent before being forced
into a machine representation,
research was initiated to separate those properties of
data structures which were machine independent, the
application environment,
from those properties which were machine dependent,
the
storage environment, then to develop a data structure supporting isolation
and separate control of these
properties.[CONCOMPD]
The ARPA specification was quite emphatic about separating
application data representations from storage data representations,
but rather vague about how data was to be exchanged between the two.
Sets & Mappings
If both the application environment
and the storage environment are settheoretically welldefined,
then data embedding and data extraction functions can
be defined to provide machine independent
data exchange between applications and storage.
Since feasibility of a machineindependent data structure
had to be demonstrated, simple array structures
were chosen for proof of concept.
Arrays can easily be represented as
sets of ntuples. Which is a nice property for
a storage environment.
Unfortunately, an array with just eight domains
has over 40,000 different representations for the
same data relationship.
The research objective now focused on finding
a unique set representation for data relationships
and a mapping mechanism for
coupling application data with appropriate storage data.
"What we lack in rigor, we make up for with notation." David Maier
Need for Notation
Since a unique narray relationship may have
n! unique storage representations,
a notation is needed that provides a single application
representation and supports a 11 mapping
with any possible storage representation.
A slight perturbation of classical set notation
provides such a solution.
Records as Sets
It is common practice to model records, [a,b,c], using
tuples, ‹a,b,c›.
Records impose an ordering structure.
The record [a,b,c] is not the same as [b,c,a],
even thought the relationship between a, b, and c may be
the same.
Classical set theory does not support the concept of structure
as part of set membership.
The use of the tuple notation ‹a,b,c› is mostly a conceptual
convince to denote an ordered sequence of items.
Since the tuple notation does not represent a specific set,
tuples can not be used as operands for set
operations.[Skolem]
To support tuple operations, an alternate notation was chosen by
assigning superscripts to elements of sets.
The new notation equates tuples with specific structured sets:
‹a,b,c› ≡
{a^{1}, b^{2}, c^{3}}.
Classical sets are structured sets with null structure:
{a,b,c} ≡
{a^{∅}, b^{∅},
c^{∅}}.
Labeled Arrays
Being able to define records as sets was essential
for modeling stored data where item ordering is relevant.
But it did not provide any structure independence
for modeling application data.
Application data still needs knowledge of how data is structured in storage.
An 8column application array still has over 40,000 possible representations.
A simple solution was to expand the use of superscripts to include
names along with numbers.
Assume an 8column array with arbitrary (but unique) names
A, B, C, D, E, F, G, H.
The typical element of a jrow array would be a set of the following form:
{a^{A},
b^{B},
c^{C},
d^{D},
e^{E},
f^{F},
g^{G},
h^{H}}
.
Depending on how the system decided to physically represent the array in storage,
any one of 40,320 mapping tuples could be used to relate
applicationarrayset relationships
to storagearrayset relationships.
Mapping Tuples
Mapping tuples are essentially array headers
that define how application column names
map to storage column numbers.
For example:
‹ C,D,H,F,G,A,B,E ›
and
‹ B,A,H,C,G,D,F,E ›
are two system defined mappings of the same application array relationships to
two different storage representations.
Another example uses mapping tuples to select a subset of columns
‹ A,B,C,H › which
returns a subarray to an application with just the first three and the last column
of an 8column array.
Operations on Structured Sets
Since the initial objective was to demonstrate the feasibility
of machine independent data operations using set operations,
a necessary and sufficient condition was to implement
all relevant Classical set operations.
The relevant operations were four Boolean and seven
Relational.
Boolean:
Union,
Intersection,
Relative Compliment, and
Symmetric Difference.
Relational:
Image,
Range,
Domain,
Converse,
Restriction,
Relative Product,
and
Cartesian Product.
A total of 25 operations using structured sets as operands
were implemented by 1968.[AFIP]
These operations provided the data access engine for the University of Michigan's
MICROSTDS RDBMS.
The final report of the CONCOMP Project in 1970 concluded that STDS
established the feasibility and realization of a machineindependent data structure.
MICROSTDS RDBMS
Interest in the data analysis capabilities of STDS operations
resulted in development of an English language interface,
MICRO, using STDS for data storage organization and access.
Guided by ARPA's observation that data relationships were independent of
their representation, that English was a familiar mechanism
for expressing relationships,
and that set membership was easily expressible in English
it seemed quite natural to couple STDS with an English language
user interface. Which MICRO developers did in 1970.
MICRO Architecture
Since the discovery and manipulation of data relationships from raw
was the project's directive,
index structures were not needed for locating individual records.
Though not appreciated at the time, in retrospect the
requirement for ad hoc querying of bulk data and the
lack of index structures
negated any need for
schemas,
normalization,
destructive updates,
file locking,
or an SQL language.
Even though STDS had proven remarkably fast under research
experiments, it was assumed that the lack of index structures
would hinder overall performance.
Since mining for unknown relationships was the primary objective,
any performance at all was considered a success.
To the surprise of everyone involved, performance
defied expectations.
Access Acceleration
Since the MICROSTDS architecture completely separated the
English query statements from the settheoretic organization
of data, it was not obvious why extreme I/O activity
should be fast. The IBM time sharing environment complicated
any direct understanding since many applications were running
simultaneously.
The explanation turned out to be quite simple,
results of set operations could not be distinguished from
previously existing sets. As these sets were generated in
the course of a query execution they became used by the
continuing execution, even though they had not existed at the
start of the execution.
These intermediate sets were generally quite small and
contained highly relevant data.
The overall effect was that the database of relevant data
shrank rapidly as applications executed.
Once this was discovered, set operation sequences were
tuned for informationally dense I/O.
If MICRO were available today, it might be interesting to see
how a setprocessing I/O system would compare to
the best available recordaccessing I/O systems.
STDS & TPCH
Though the MICROSTDS implementation terminated in 1998 when the
mainframe was abandon, STDS research and commercial usage continued.
By 2001 STDS had been used in products for GM, XEROX, FAA, Delco,
Arthur Anderson, and had been incorporated in a backend database
machine developed by the STIS Corp.
STIS was acquired by an affiliate of Hitachi in 1984.
From 1985 trough today STDS has been incorporated in an
interactive data access and manipulation software system, iXSP.
iXSP (interactive STDS)
iXSP is the current incarnation of STDS.
It provides an interactive interface to STDS
that allows ad hoc queries using set operations
accessing available disk files.
iXSP currently recognizes over 90 set operations and is extensible.
iXSP also recognizes XML documents as
sets.[SAG]
Since MICROSTDS used an English user interface mapping to set operations
to access storage
and since current RDBMS use SQL translations of English
queries to SELECT Statements, a performance comparison
of Set Processing and Record Accessing systems
can be made by
using the SQL Select statement as input to iXSP.
Since the TPCH
Benchmark is the industry standard for
recordaccessing I/O systems, it is reasonable to assume that
TPCH results reflect the best performance of SQL Selects being
supported by Index Structures.
Since iXSP uses no Index Structures,
one way to compare user productivity
is to run the most challenging TPCH SQL query on iXSP and
compare with current RDBMSs on the same platform.
This is exactly how Information Builders Inc.
evaluated setprocessing I/O with recordaccessing
I/O.
Results demonstrated that setprocessing I/O provided
well over an order of magnitude improvement in
performance.[IBI]
CONCLUSION
The fundamental result of ARPA's research on defining and implementing a
machineindependent data structure was the discovery that
records could be recognized as sets.
A notation was developed to manipulate records as sets.
An implementation of this notation became STDS.
The ability to manipulate records as sets is what
distinguishes STDS implementations from traditional
DBMS implementations.
STDS is a body of welldefined set operations that are nonproprietary
and machine independent.
The only restriction on such a body of set operations,
being referenced as a STDS,
is that the operations must be
explicitly defined in terms of extended set notation,
publicly available and verifiable under execution.
The key to STDS performance is being able to manipulate records as
sets.[Ellis]
Since any logical or physical representation of data has a
set identity, in terms of extended set notation,
STDS implementations are ideal for supporting
future applications involving:
Data Access,
Data Mining,
Data Sharing,
Data Analysis,
Data Security,
Data Validation,
Data Integration,
Data Reliability,
Big Data, and
Cloud Computing.
.
Data That Can't Be Accessed, Can't Be Processed.
If Data Can Be Accessed, It Has A Set Identity.
If Data Has A Set Identity, It Can Be Processed By Set Operations.
If Data Can Be Processed By Set Operations, Processes Are Limited Only By Imagination.
ARPA Research
STDS Development ARPA research on machineindependent data structure. 19651970
Description of STDS: AFIP fall joint computer conference San Francisco CA, December 1968
Extended Sets Data representations for nearoptimal I/O performance.
Commercial RDBMS
MICROSTDS RDBMS from 19711998: English Queries  Set Storage
Set I/O vs. Record I/O Performance comparison of STDS with Oracle & IBM  2005
RDM & SQL A missed opportunity for I/O access.
Sets Not Records?
Why Not Sets? All computer behavior is set processing.
Extended Set Theory A General Model For Very Large, Distributed, Backend Information Systems.
RDM+XML XSP: An Integration Technology for Systems Development and Evolution  2001
SQL Rejuvenated An Open Source Opportunity
Formal Foundations
Axioms and Models for an Extended Set Theory Tuples uniquely defined as sets.  2011
Functions as Set Behavior Formal modeling of systems behavior.  2016
Data as Sets
Twelve RDM tables
R1  R12 expressed by a single Labeled
set Ri, RDM.
A very simple XMLstructure expressed as a labeled
set, XML.
Three extended relations expressed as labeled
sets, Xrel1,
Two complex extended relations expressed as labeled
sets, Xrel2.
More detailed information to assist implementations and I/O optimization strategies using
set operations to Manipulate Data by its Mathematical Identity
can be found
at Extended Set Processing.
Since STDS operations and the foundations for all extended set processing operations
are already available on the web,
it would seem that open source developers might
get a jump on the rest of the industry
by providing systems designed to improve user productivity.
References

[IFIP]
Feasibility of a Settheoretic Data Structure
A general structure based on a reconstituted definition of relation
IFIP Congress, Edinburgh Scotland, August 1968

[AFIP]
Description of a Settheoretic Data Structure:
AFIPS fall joint computer conference San Francisco CA, December 1968
 [ARPA]
Westervelt, F. H.:
CONCOMP: Research in Conversational use of Computers: final report
Office of Research Administration, Ann Arbor, December 1970

[CONCOMP]
CONCOMP Project Appendix D: Description of a SetTheoretic Data Structure

[MICRO]
MICROSTDS RDBMS 19711998

[VLDB77]
Extended Set Theory:
A General Model For Very Large, Distributed, Backend Information Systems  VLDB77

[LLL]
Lawrence Livermore Lab.:
Set Theoretic Data Structures (STDS): a tutorial
Technical Report, 1977.

[VLDB84]
1984 VLDB Panel: Inexpensive Large Capacity Storage Will Revolutionize
The Design Of Database Management Systems
Proceedings of the Tenth International Conference on Very Large Data Bases.
Singapore, August, 1984

[NATO]
A Mathematical Foundation for Systems Development  NATOASI Series, Vol. F24, 1986

[SAG]
Champion, M.:
XSP: An Integration Technology for Systems Development and Evolution
 Software AG  2001

[IBIl
Stout, R.:
Information Access Accelerator
Information Builders Inc.  2005

[WHY]
Why Not Sets?  2010
 [Nor10]
North, K.:
Three articles
presenting a short historical perspective on the
role of set theory,
mathematically sound data models,
and the importance of data independence.  2010
PART I:
Sets, Data Models and Data Independence
PART II:
Laying the Foundation: Revolution, Math for Databases and Big Data
PART III:
Information Density, Mathematical Identity, Set Stores and Big Data

[XST]
Blass, A., Childs, D L:
Axioms and Models for an Extended Set Theory  2011
 [Skolem]
Skolem, T.:
Two Remarks on Set Theory (2. The ordered ntuples as sets)
MATH. SCAND, 5 (1957) 4046

[Ellis]
Ellis, T.:
Extended Set Theory: A Summary  2015
Supplement

Pebble Piles & Index Structures: A Parable

XSP: Extended Set Processing:
Mathematically Managing Data Representations (1 page summary)

Set Processing vs. Record Processing Performance:
Dynamic Data Restructuring vs. Prestructured Data Storage

Functions As Set Behavior Essential Concepts: Conceptual & Formal

SQL Rejuvenated Data, Relationships, RDM, Tables, NoSQL & Set Theory

ContentContainer Data Access Strategies Content for Functionality  Containers for Performance

SetAccessing I/O for SQL Relational Capabilities without SQL Overhead

iXSP Information Access Accelerator Information Builders Inc. STDS performance evaluation results

Data Representations as Mathematical Objects Considering Content Compatibility of Relational & XML Data Representations

Adaptive Data Restructuring Functions A High Performance Alternative to Indexed Data Access Structures

STDS: Glossary Obscura Exposing Key Concepts & Definitions

Processes, Functions & Composition Essential Concepts
Copyright © 2017
INTEGRATED INFORMATION SYSTEMS
« Last modified on 10/13/2017 »
