INTRODUCTION
The critical distinction between
data access structures and
data access functions
is that
access structures rely on physical data objects,
while access functions rely on mathematical data objects.
The difference
for data management is significant.
Physical data objects are concrete objects representing
a meaningful relationship among previously defined objects.
Physical objects are dependent on time and space.
They are mutable.
Mathematical data objects are abstract objects representing
a formal relationship among previously defined objects.
Mathematical objects are independent of time and space.
They are immutable.
This brief comparison of physical objects and mathematical objects
seems to favor physical objects for data management.
If an object is immutable (can't be updated) what value might it have for data management systems?
Answer:
reliability,
consistency,
recovery,
validation,
scalability,
data diversity,
decentralization,
and time spanning analysis.
To the extent that a physical instantiation of
a mathematical object faithfully preserves its mathematical
identity, the behavior of the physical object will
correspond to the behavior of the mathematical
object under like conditions.
This provides motivation for exploring
the integration and access of data expressed as a mathematical object.
Data Integration & Data Access
Data integration and data access are well known
and throughly investigated problem areas.
It may not be immediately obvious why managing data
as a mathematical object would resolve any known concerns.
The argument for data integration is quite simple.
Under a mathematical umbrella all data is homogeneous.
Data content is easily distinguishable from data representation.
Operations can extract content from one representation and
reformat to a totally different representation.
As long as data content is compatible with an application, it can be shared.
The argument for improving data access by manipulating data
as a mathematical object is a little more involved.
First the process of accessing data needs to be examined.
Accessing data is a three stage process: LDR.
1) LOCATE a covering of relevant data.
2) DERIVE relevant data.
3) FORMAT relevant data for application delivery.>
Indexing data is a common practice for quickly locating data.
There are two distinct implementations: Index Structures and Index Functions.
They are very different.
Index structures only locate data,
then return what is located directly to an application.
Index structures have to be freshly constructed for any
data addition or modification.
Index functions (set operations) locate, derive, and format data.
Index functions (set operations) only have to be written once
and are applicable to all new data.
Data as a Mathematical Object
Modeling data as a mathematical object was explored
at the University of Michigan (19651968) in the development of
a SetTheoretic Data Structure, STDS,
and independently pursued at IBM (19691970) in the development
of the Rational Data Model, RDM.
Though both developments were
initiated and developed independent of each other,
they both relied on data being defined as a mathematical object.
The RDM focused on application data being managed using set operations,
while STDS focused on storage data being managed using set operations
Prior to 1970 there was no distinction made
between application data and
storage data.
All systems were WYSIWYP,
'What You Store Is What You Process'.
There was no distinction made between application data management
and storage data management.
In 1970
Codd introduced the
Relational Data ModelRelational
data model is the primary data model, which is used widely around the
world for data storage and processing. This model is simple and it has
all the properties and capabilities required to process data with
storage efficiency.,
RDM.
Codd's approach was to abstract away the physical data structures and
lowlevel algorithms of previous data management systems with
mathematically reliable operations that could be used
to manage application
data. Storage data
organization and access were still managed by physically
dependent routines.
Codd's contribution to data management was modeling application
data as a mathematical object.
RDBMS: RDM Implementations
The need to access large amounts of diverse distributed data
was not a concern when traditional data management was being developed.
The primary concern, forty years ago, was just accessing stored data.
Since all data was managed by routines
using physical properties of data
representations,^{[Cod70]}
users were burdened with having to know the physical organization of
data in
storage.^{[Cod70]}
The Relational Data Model, RDM, is an abstract model presenting a
strategy for modeling data as mathematical objects and manipulating
these objects with mathematically well defined operations.
A Relational Database Management SystemA
relational database management system (RDBMS) is a database management
system (DBMS) that is based on the relational model as invented by E. F.
Codd, of IBM's San Jose Research Laboratory. Many popular databases
currently in use are based on the relational database model.,
RDBMS, is an implementation combining the data management specifications
of the RDM with some data access and storage organization data
management strategy.
In his Turing acceptance paper in 1981,^{[Cod81]}
Codd provided some design objectives for
building future data management systems.
Codd's Data Management Objectives
(1) Data Independence Objective:
Separate logical/physical aspects of data management.
(2) Communicability Objective:
Make the model structurally simple.
(3) Setprocessing Objective:
Provide a foundation for setoriented processing.
(4) Foundation Objective:
Provide a sound theoretical foundation for data management.
In 1981 the foundations of set theory were inadequate to support
the requirements for faithfully modeling data representations,
making it impossible for any implementation to satisfy all four
data management conditions.
Thus actual RDBMs implementations had to compromise the mathematical
reliability of the relational data model and use physically dependent
strategies to assess data from storage.
RDM: mathematically reliable, declarative, setprocessing.
RDBMS: physically dependent, procedural, recordprocessing.
Despite the fundamental mismatch between RDM and RDBMS data management
strategies, RDBMS ability to pragmatically manipulate data
representations has allowed Relational systems to dominate the industry
for over thirty years. Unfortunately the mathematical muscle of the RDM
currently supported by RDBMS installations is not sufficient to support
requirements for Big Data applications as the physical data structures
are more and more distributed across processors and around the world.
The RDM was successful because it provided operations to somewhat
minimize the physical dependency between the user's view of data and the
system's representation of data, and this was sufficient for most data
centers applications. But many of Codd's lessons have been lost in the
transition to cloud computing. Today the physical details of data
storage and management in the cloud make Big Data applications more
difficult to develop, expensive to use, and difficult to design and
maintain than they would if Codd's principles were respected.
Recently, extensions to the foundations of set theory
have been completed and proven
consistent.^{[Bla11]}
Extended set theory, XST, provides a sound theoretical foundation
for data management by supporting a mathematical identity
for all data
representations, (RDM,
XML,
Xrel1,
Xrel2).
Most importantly, XST allows all of Codd's data management objectives to be achieved.
This means that the physical dependencies that distinguish Relational
(SQLOriginally
based upon relational algebra and tuple relational calculus, SQL
consists of a data definition language, data manipulation language, and a
data control language. The scope of SQL includes data insert, query,
update and delete, schema creation and modification, and data access
control.)
from nonRelational
(NoSQLA
NoSQL database provides a mechanism for storage and retrieval of data
that is modeled in means other than the tabular relations used in
relational databases.)
implementations can be isolated and removed.
Big Data Management
Scalability, diverse data, schema development overhead, accessing large
volumes of distributed data are a few of the concerns facing developers
of Big Data management strategies.
Some early pioneering developers are even suggesting the need for
a complete system overhaul.
The DBMS vendors (and the research community) should start with a clean
sheet of paper and design systems for tomorrow’s requirements, not
continue to push code lines and architectures designed for yesterday’s
needs.^{[Sto7]}
A system overhaul is not required. Existing systems can easily be upgraded
and new systems developed to provide optimal performance for all applications.
What is required is evolving physically dependent data management strategies
into mathematically reliable data management strategies.
Physically dependent data management strategies (those that manage data
using physical properties of data representations)
require using indexedrecord structures to access data.
Mathematically reliable data management strategies (those that manage data
using a mathematical identity of data)
use metadata index sets and set operations to access data.
Mathematically reliable data management and analysis software, iXSP, already exists
and is available to augment the performance of current systems
and to assist in the development of future systems.
iXSP is a suite of I/O operations for accessing and
processing disk resident data, providing a 10 to 90 times performance improvement over
IBM and
Oracle.^{[Sto5]}
A summary of iXSP properties are:
* Relationally complete.
*
Supports all four of Codd's data management objectives.
*
Requires no schema design.
*
Allows spontaneous SQL queries.
*
Has no indexedrecord structure overhead.
*
Recognizes all data representations.
*
Both vertically and horizontally scalable.
*
Uses constructive updates instead of destructive updates.
*
I/O performance improvement, up to three orders of magnitude.
*
Parallel I/O access to very large distributed data.
The foundations of iXSP software were initiated under the
CONCOMPThe
everpresent problems of providing rapid, flexible access to complexly
interrelated data resulted in several efforts in the area of data
structures under CONCOMP.,
project sponsored by an ARPA
grant from 1965 to 1970. The project was directed by F. H. Westervelt.
One of his objectives was to "find a mathematical foundation for data structures".
In 1968 two papers presented some success by reconstituting some formal definitions
of classical set theory to better reflect intuitive notions.
Further efforts lead to the development and implementation of a collection of set operations that
worked directly on data stored as arrays.
This software, STDS, was used as the data access engine for the Michigan University
MICRO RDBMSMICRO managed data with a mathematically reliable data storage and
access system (STDS) which used no indexedrecord structures,
required no schema design, provided nondestructive updates,
and supported both structured and semistructured data.,
from 1972 to 1998.
STDS: SetTheoretic Data Structure
In 1968 two papers were published suggesting the potential
advantages of
recognizing data as a mathematical object,
managing data
with set operations,
and organizing data on disks using
a settheoretic data structure,
STDS.^{[A]}^{[B]}
Conceptually, the settheoretic concept of
ntupleIn mathematics, an ntuple is a sequence (or ordered list) of n elements, where n is a nonnegative integer.
seemed like an ideal
model for a data record. It was, and still is, an illusion.
As articulated by
Skolem,^{[Sk57]}
there is currently no suitable way to (settheoretically) define an ntuple.
Intuitively, however, the settheoretic operations defined for ntuples were ideally suited
to manipulate data records modeled as ntuples.
By ignoring the classical set theory, CST, difficulties, an ntuple was considered to be
a sequence of ordered values. A reconstituted definition was defined to be
any set of ntuples, such that "n" was the same. The settheoretic condition of unique elements was also relaxed.
The resulting definition of a "relation" was just a homogeneous array.
Given this mapping of relations to arrays, STDS operations were just a collection of classical set operations
modified to work on homogeneous arrays. All relations were considered to be sets, that is autonomous and immutable.
No conditions were imposed regarding key fields and no values could be modified.
Set operations were required to read data from disks and write results back to disks.
These conditions insured (what was later to be defined as)
ACID
In computer science, ACID
(Atomicity, Consistency, Isolation, Durability) is a set of properties
that guarantee that database transactions are processed reliably. In the
context of databases, a single logical operation on the data is called a
transaction.
protection.
iXSP: Interactive, Extensible, SetProcessing, Data Management Software
Though STDS proved to be a very fast intrinsically reliable data management and data access system,
its limitation to homogeneous array data seemed restrictive.
Unfortunately, if mathematically reliability were to be maintained, arrays were to remain a limitation
or extensions to the foundations of set theory were necessary to provide mathematical identities
for other data representations.
In 1985 Andreas Blass and the author initiated an investigation on how to
legitimately (provide axioms for) extending the membership condition of classical set theory
to provide recognition of set structure. Though the concept of scopes had been
explored earlier and had proven successful for computer programming, the actually axioms
supporting an extended set theory
were not consummated until
2011.^{[Bla11]}
RDBMS vs. iXSP
All RDBMS implementations use a mathematically reliable application
model, the RDM,
but support system data access with physically dependent indexedrecord
structures.
This means that indexedrecord structures must be planned for, designed,
implemented, and maintained in order to be useful. This consumes
considerable nonrecurring engineering resources, as well as recurring
processing and system administration overhead. In a high rate, low
latency data processing system, there may not be time to effectively
execute the necessary index and related statistical updates to
effectively use them. In large database instances with 100's of millions
or billions of records, the indexedrecord structures themselves can
get very large and deep (in btree structures), often requiring
secondary index structures to manage the primary indexes.
In addition, RDBMS do not have a means to avoid transfer of significant
irrelevant data across the I/O barrier. The entire table (or tables)
involved in a query must be considered each time the query is executed.
All columns of a table must be included in any record retrieval before
performing any requested projections to obtain the desired output
columns. In addition, records are often stored in random accessed pages,
which can result in whole records being transferred over the I/O
barrier for no other reason than they share a page with a desired
record. All of these conditions result in a highly inefficient use of
the storage system and exacerbates storage I/O limitations.
In iXSP, processing is done on sets. Since there are no records, there
is no need for indexedrecord structures. Instead, set theoretic
operations are performed on sets as operands. And the resulting sets,
along with their identity and relationship to their parent sets, are
saved into a continually evolving set universe. Each requested set
operation is first reduced algebraically to its lowest cost equivalent,
using already computed sets where possible. This step is extremely fast
as it uses set metadata and requires no I/O accesses. The resulting set
equation is then executed and the results returned and saved.
Typical queries are looking for specific relevant information and result
in sets that are several orders of magnitudes smaller in size than the
original. In addition, it is typical for multiple queries to be related
in one or more attribute as analyses proceeds along a line of
investigation or "drill down". This results in the creation of multiple,
small data sets containing highly relevant data related to the recent
queries. Instead of relying on indexedrecord structures to improve
efficiency in finding requested records, iXSP uses these small set
partitions of the full data set to rapidly and efficiently find the
equivalent result set using minimum data transfer across the storage I/O
boundary.
By exposing only mathematically welldefined data and operations to
users and applications, set processing DBMS such as MICRO maximize
performance by minimizing system overhead. For example, set processing
does not require extensive schema development and indexedrecord
structure design and maintenance such as is required by RDBMS that
expose these implementation details to users.
Furthermore, by exposing only the set theoretic data and operations the
DBMS implementers and administrators, MICRO and other set theoretic DBMS
enabled dramatic performance benefit using dynamically generated set
processing results rather than predefined indexedrecord structures that
add so much overhead to today’s RDBMS
systems.^{[Har08]}
By focusing only on the sets and set operations, the implementers of MICRO anticipated database innovations such as
column storesA
columnoriented DBMS is a database management system (DBMS) that stores
data tables as sections of columns of data rather than as rows of data.
In comparison, most relational DBMSs store data in rows. Column stores
or transposed files have been implemented from the early days of DBMS
development.
,
data sharding
A database shard is a horizontal partition of data in a database or
search engine. Each individual partition is referred to as a shard or
database shard. Each shard is held on a separate database server
instance, to spread load.,
and
data cracking
Database indices provide a nondiscriminative navigational
infrastructure to localize tuples of interest. Their maintenance cost is
taken during database updates. In this paper, we study the
complementary approach, addressing index maintenance as part of query
processing using continuous physical reorganization, i.e., cracking the
database into manageable pieces.
that came along much later as ad hoc solutions to RDBMS performance
bottlenecks.
The performance advantage of set processing over record processing is
not only in what it does by using set operations, but also by what it
does not do by not requiring protracted schema development nor
extraneous indexedrecord structure overhead.
Big Data Challenges (Physically dependent data management)
Big Data applications introduce issues of
scalability,
data diversity, and
distributed data access.
All of which are restricted by
physically dependent data management strategies.
Replacing physically depended data management strategies
with mathematically reliable data management strategies
can resolve the restrictions on
scalability,
data diversity, and
distributed data access.
This can be achieved with the aid of a
Structured Set Data Model.
The definition of a structuredset data model, SSDM,
is deceptively simple:
SSDM: Any collection of data representations and operations
on them that are welldefined under the axioms of
extended set theory, XST.
There is no restriction on how many operations are defined,
nor on what the operations do.
There is no restriction on how many data representations
are defined,
nor on how they are defined.
The only condition is that all operations and
data representations be defined using
extended set notation, XSN.
A structuredset
is a set as defined under the axioms of
extended set theory.
Conceptually an extended set
is just a classical
set with an extended membership
to provide two conditions for
set membership
instead of just one.
The particulars are rather boring, but the utility
of the extension
allows a set theoretic dimension for structure.
The only difference between classical sets
and structuredsets is that classical sets have only
one condition for membership
while structuredsets require two conditions.
The structure component of an extended set membership
can be used to distinguish the container part
of a data representation
from the content part.
Though set theory has been tried many times as a formal data model,
it has always failed to provide the ability
to suitably define data records as unambiguous sets.
Structuredsets provide an additional component to classical set membership
allowing a
formally defined representation of
data that uniquely distinguishes the logical data relationships (content)
from the physical data representation (container).
All Data Representations
ARE^{[a]}
StructuredSets.
Thus, All Data Can Be Managed Using Set Operations.
Since structuredsets can faithfully represent any and all
application and storage data
with the ability to distinguish data content from data
container, structuredset based access strategies can
manipulate data content and data containers independently
to provide nearoptimal access performance for each and every application.
With structuredsets the distinction between content and structure is an innate property
of extended set membership. This property makes structuredsets a natural choice for
modeling representations of data.
Under a structuredset data model
all logical and physical representations of data are structuredsets.
All manipulations of logical and physical data representations
can be modeled by set operations.
For presentation convenience or performance
considerations extended
set operations can be defined that
map the content of one structuredset to another structuredset having a totally deferent structure.
Thus a
structuredset data model
is an ideal choice for modeling data
management systems satisfying all
four of Codd's data management objectives.
Conclusion
In 1970 Codd introduced the advantage of set processing over record processing.
His influence has dominated data management systems for over forty years.
Though Codd advocated using set processing for both application data and
system data, classical set theory, CST, was not suitable for
faithfully modeling physical data representations.
Recent developments in extending the foundations of set theory
now allow all data representations (both abstract and physical) to be faithfully modeled by set theory.
It is now possible to fulfill Codd's data management objectives by developing
mathematically reliable data management systems that provide all of the following:
(1) Data Independence Objective:
Separate logical/physical aspects of data management.
(2) Communicability Objective:
Make the model structurally simple.
(3) Setprocessing Objective:
Provide a foundation for setoriented processing.
(4) Foundation Objective:
Provide a sound theoretical foundation for data management.
Since all data representations have a mathematical identity under extended set theory, XST,
all existing data representations can be mathematically identified and
manipulated by set operations.
This allows existing systems to be upgraded
from physically dependent systems to mathematically reliable systems.
iXSP software (or its equivalent) can be used to facilitate a system upgrade
or to complement existing systems with a mathematically reliable
data management extension
"It's like 1973 for Moving Data Around in the Cloud."
 Vint Cerf  2010
"There is a difficult road ahead for
enterprise database applications."
A commercial evaluation of iXSP, IBM, and Oracle comparing SQL query performance on an industry standard database.
"The DBMS vendors (and the research community) should start with a clean
sheet of paper and design systems for tomorrow’s requirements, not
continue to push code lines and architectures designed for yesterday’s
needs."
The original motivation for this development of XST was to provide an
improved, flexible mathematical foundation for certain constructions,
like tuples and functions, that play important roles in computing.
The original motivation for this development of XST was to provide an
improved, flexible mathematical foundation for certain constructions,
like tuples and functions, that play important roles in computing.
Subsequent developments revealed that XST also provides a smooth way of
handling the setclass distinction that turns up especially in category
theory.
The 1981 ACM Turing Award Lecture:
"The relational model calls not only for relational
structures (which can be thought of as tables), but also
for a particular kind of set processing called relational
processing. Relational processing entails treating whole
relations as operands."
Imagine trying to convince ancient sheepherders (who calculated and
compared their wealth by equating piles of pebbles to the size of herds
of sheep) that a piece of parchment with numbers on it would give the
same results as a pile of pebbles, but that would be more accurate,
easier to use, faster in processing, less resource intensive, and more
portable? Now imagine trying to convince modern database researchers
that the mathematical identity of records can be used to replace the
physical indexing of records
"For a wide range of reasons, designing for and maintaining optimal data
access poses a genuine challenge to even the most sophisticated
enterprises."
DBAs are under greater pressure than ever to optimize database structure for system performance and administration.
Something as simple as improving the table index design has a profound impact on performance.
Some computing professionals currently run their own consulting
businesses doing little else than
helping customers improve their table indexing design.
Their efforts can improve query performance by as much as 50 times.
If A is a structuredset and if
Γ_{A}(x,s)
is true,
then
x is a smember of A.
If Γ_{A}(x,s) is false,
x is not a smember of A.
For example: let A = <a,b,c> =
{a^{1}, b^{2}, c^{3}},
then
Γ_{A}(b,2)
is true, while
Γ_{A}(a,2)
is false.
Classical sets have no structure.
The membership test for any Classical set
A is Γ_{A}(x,∅).
Thus, all Classical sets are structuredsets with null structure.
faithful means isomorphic representation of content, structure, and behavior.
In set theory sets are defined
by their membership and since data representations
are a physical expression of the content and structure of the data
that can be expressed mathematically,
all data representations are, by definition, structuredsets.
Substantial time is spent in logging, latching, locking, Btree, and
buffer management operations. Over 90% of an application process is
indexedrecord structure overhead.
Databases were optimized for computer technology of the late 1970’s.
Advances in modern processors, memories, and networks mean that today’s
computers are vastly different from those of 30 years ago. Yet database
architecture (record accessing I/O) has changed little.
By optimizing I/O traffic with informationally dense data transfers,
using no physical indexes of any kind, lowlevel setprocessing has
demonstrated a substantial, scalable performance improvement over
locationdependent indexedrecord structures.
Traditional rowstore and columnstore architectures rely on imposed physical representations of data.
Setstore architectures rely on a mathematical identity of data
to represent,
organize, manipulate and access data with high performance informationally dense parallel I/O transfers.
Given the advances in hardware platforms, set processing I/O technology
can offer one to three orders of magnitude better system performance
than traditional record accessing I/O technology.
Skolem concludes:
"I shall not pursue these considerations here, but only emphasize that
it is still a problem how the ordered ntuple can be defined in the most
suitable way."
Notes

⋏
In set theory sets are defined
by their membership and since data representations
are a physical expression of the content and structure of the data
that can be expressed mathematically,
all data representations are, by definition, structuredsets.

⋏
faithful means isomorphic representation of content, structure, and behavior.

⋏
MICRO(19721998) used mathematically well defined operations
in a time sharing environment
to manage application data,
storage data,
and all transformations between the two.
References

[Bla11]
^{⋏}^{a}^{b}
Blass, A., Childs, D L:
Axioms and Models for an Extended Set Theory  2011
♦
This paper presents the formal foundation for supporting "structuredsets".
5.1 TUPLES:
Traditional set theory has several ways of coding ordered tuples
< a_{1}, a_{2}, .. , a_{n} >
as sets, none of which is really
canonical^{
[Sk57]}.
XST provides a simple and natural way to represent tuples,
namely to use natural numbers as scopes. Thus, a tuple
< a_{1}, a_{2}, .. , a_{n} >
is identified with the (extended) set
{ a_{1}^{1}, a_{2}^{2}, .. , a_{n}^{n} }.
The natural numbers used here as
scopes can be represented by the traditional von~Neumann coding (with
scope ∅),
5.2 GENERALIZED TUPLES:
Instead of indexing the components of a tuple by (consecutive)
natural numbers, one could index them by arbitrary, distinct labels.
The same XST representation still works; use the labels as scopes.
This provides, for example, a convenient way to deal with what are
often called records in computing. Records have fields, in
which data are inserted. We can represent them settheoretically by
taking the field names as scopes with the data as elements.
Similarly, we can represent relations, in the sense of relational
databases, by sets of generalized tuples, one for each row of the
relation. The scopes for such a generalized tuple would be the
attribute names, while the corresponding elements would be the values,
in that row, of the attributes.
5.3 FUNCTIONS and MULTIFUNCTIONS (functions defined by set behavior):
In general, a set f of generalized tuples, all having the same scope
set D, can be regarded as describing several (possibly multivalued)
operations, called the behaviors of f.

[Boy73]⋏
Boyce, R. F.; Chamberlin, D. D.; King, W. F.; Hammer, M. M.:
Specifying Queries as Relational Expressions: SQUARE
 IBM Technical Report RJ 1291, 1973
♦
This paper presents a data sublanguage called SQUARE, intended for use
in ad hoc, interactive problem solving by noncomputer specialists.
SQUARE is based on the relational model of data, and is shown to be
relationally complete; however, it avoids the quantifiers and bound
variables required by languages based on the relational calculus.
Facilities for query, insertion, deletion, and update on tabular data
bases are described. A syntax is given, and suggestions are made for
alternative syntaxes, including a syntax based on English key words for
users with limited mathematical background.

[Cer10]↑
Cerf, V.:
"It's like 1973 for Moving Data Around in the Cloud"
♦
Using a cloud computing service may sound enticing, but you better
consider how that data can be moved around if you want to switch to a
different provider. It's a big problem that now has the attention of
Vint Cerf, who is calling for standards to define how customer data gets
passed between different cloud service providers.

[Cha74]⋏
Chamberlin, D. D.; Boyce, R. F.:
SEQUEL: A Structured English Query Language  IBM Research Laboratory, 1974
♦
ABSTRACT: In this paper we present the data manipulation facility for a
structured English query language (SEQUEL) which can be used for accessing
data in an integrated relational data base. Without resorting to the concepts
of bound variables and quantifiers SEQUEL identifies a set of simple operations
on tabular structures, which can be shown to be of equivalent power to
the first order predicate calculus. A SEQUEL user is presented with a consistent
set of keyword English templates which reflect how people use tables to
obtain information. Moreover, the SEQUEL user is able to compose these basic
templates in a structured manner in order to form more complex queries.
SEQUEL is intended as a data base sub language for both the professional programmer
and the more infrequent data base user.

[Cha01]⋏
Champion, M.:
XSP: An Integration Technology for Systems Development and Evolution
 Software AG  2001
♦
The mathematics of the relational model is based on classical set theory,
CST, and this is both its strength and its weakness.
An "extended set theory", XST, can be used to model
ordering and containment relationships
that are simply too "messy" to handle in classical set theory and the
formalisms (such as relational algebra) that are based on it.

[Cod70]
^{⋏}^{a}^{b}
Codd, E. F.:
A Relational Model of Data for Large Shared Data Banks CACM 13, No. 6 (June) 1970
♦
Abstract: Future users of large data banks must be protected from having to know
how the data is organized in the machine (the internal representation).

[Cod81]⋏
Codd, E. F.:
A Relational database: a practical foundation for performance CACM 25, No. 6 (June) 1970
♦
The 1981 ACM Turing Award Lecture:
"The relational model calls not only for relational
structures (which can be thought of as tables), but also
for a particular kind of set processing called relational
processing. Relational processing entails treating whole
relations as operands."

[Ell15]⋏
Ellis, T.:
Extended Set Theory: A Summary  2015
♦
Extended Set Theory (XST) was originally developed under an ARPA
contract to address the limitations of Classical Set Theory (CST). At
the root of the problem is the set membership definition of CST, which
defines sets and the results of set operations based on a single
membership condition: content.

[Fay13]⋏
Fayyad, U. M.:
Big Data Everywhere, and No SQL in Sight
SIGKDD Explorations, Volume 14, Issue 2  2013
♦
"The term BigData is not welldefined, but is
generally used to refer to the type of data that breaks the limits of
traditional data storage and management stateoftheart."

[Har76]⋏
Hardgrave, W. T.:
A technique for implementing a set processor
Proceedings of the 1976 conference on Data : Abstraction, definition and structure archive ACM New York, NY, USA 1976
♦
Extended set theory is an effective method for describing the complex
data structures needed to support largescale data base applications.
This paper describes the incorporation of the salient concepts of
extended set theory into a general tool for the design and
implementation of largescale database systems. This tool is called a
set processor. This implementation is based on the representation of
sets of positive integers as bit strings and the application of the
Cauchy/Cantor Diagonal Method.

[Har08]⋏
^{a}
^{b}
Harizopoulos,~S.; Madden,~S.; Abadi,~D.; Stonebraker,~M.:
OLTP Through the Looking Glass, and What We Found There
SIGMOD'08, June 912, 2008
♦
Databases include a suite of features that were optimized for computer
technology of the late 1970’s. Advances in modern processors, memories,
and networks mean that today’s computers are vastly different from those
of 30 years ago. Yet database architecture (record accessing I/O) has
changed little.
Substantial time is spent in logging, latching, locking, Btree, and
buffer management operations.
Over 90% of an application process is indexedaccess overhead.

[Her72]⋏
Hershey, W; Easthope, C.:
A set theoretic data structure and retrieval language
Paper from the Session on Data Structures, Spring Joint Computer
Conference, May 1972 in ACM SIGIR Forum, Volume 7, Issue 4 (December
1972)
♦
This paper describes a data structure and retrieval program for general
large scale information systems. Retrieval operations are done by means
of set theoretic operations. The general structure of the data structure
and retrieval program are outlined here, and the retrieval language is
briefly explained with accompanying examples. The major components of
the data structure are discussed in detail: The directory, the
dictionary, and the data itself. Limitations of the system and plans for
improvement are mentioned.

[Lar09]⋏
Larsen, S. M.:
The Business Value of Intelligent Data Access  March 2009
♦
Article provides an excellent description on how difficult
it is to optimize data access paths.
"For a wide range of reasons, designing for and maintaining optimal data
access poses a genuine challenge to even the most sophisticated
enterprises." p. 2

[Lig07]⋏
^{a}
^{b}
^{c}
Lightstone, S; Teorey, T.; Nadeau, T.:
Physical Database Design
 Morgan Kaufmann, 2007
♦
The rapidly increasing volume of information contained in relational
databases places a strain on databases, performance, and
maintainability: DBAs are under greater pressure than ever to optimize
database structure for system performance and administration.
Physical Database Design discusses the concept of how physical
structures of databases affect performance.
Something as simple as improving the table index design has a profound
impact on performance.
 Book provides a comprehensive analysisss of how complicated the physical database
design process can be
without the guidance of a formal data access model.
Without such formal support physical file data access structures
typically impede system performance.
a) File data access strategies are extremely difficult to optimize.
"Some computing professionals currently run their own consulting
businesses doing little else than
helping customers improve their table indexing design."
Their efforts can improve query performance by as much as 50 times. (p. 2)

[LLL]⋏
Lawerence Livermore Lab.:
Set Theoretic Data Structures (STDS): a tutorial
Technical Report, 1977.
♦
Set Theoretic Data Structures (STDS): a tutorial
Extended Set Theory as a data base management discipline has received
attention in the data base literature. Set Theoretic Information Systems
Corporation has, for some time, marketed a data base system based on
the foundation of Extended Set Theory. This system is called Set
Theoretic Data Structures (STDS). A series of examples shows that STDS
is similar to relational algebraic data base management systems. The
advantages of STDS are its straightforward data base design, compact
data representation, and flexible, powerful data manipulation operators,
while its limitations are its lowlevel primitive user interface and
the partial implementation of the Extended Set Theoretic concepts. To
make STDS very attractive, a ''userfriendly'' interface should be
developed, and some distinctive features of extended Set Theory (such as
sets of sets) should be implemented. 8 figures, 1 table.

[Man14]⋏
Manoochehri, M.:
Data Just Right  Introduction to LargeScale Data & Analytics)
AddisonWesley, 2014.
♦
Largescale data analysis is now vitally important to virtually every
business. Mobile and social technologies are generating massive
datasets; distributed cloud computing offers the resources to store and
analyze them; and professionals have radically new technologies at their
command, including NoSQL databases. .

[MICRO]
⋏
MICRO RDBMS 19721998; User Manual:
MICRO A Relational Database Management System 1992
♦
MICRO supported timesharing commercial applications from 1972 to 1998.
It was the first system to use settheoretic operations to create and
manage stored data.
MICRO managed data with a mathematically reliable data storage and
access system (STDS) which used no indexedrecord structures,
required no schema design, provided nondestructive updates,
and supported both structured and semistructured data.

[MSQL]
^{⋏}
SQL Server Performance Team:
Great New TPCH Results with SQL Server 2008 17 Aug. 2009
♦
"HP also published a 300GB result on their ProLiant DL785 platform with
SQL Server 2008. This publication illustrates the high performance and
solid price/performance using industry standard components from HP and
Microsoft." (Load time: 300GB in 13.33 hours)

[Nor10]⋏
North, K.:
♦
Three articles
presenting a short historical perspective on the
role of set theory,
mathematically sound data models,
and the importance of data independence.  2010
PART I:
Sets, Data Models and Data Independence
PART II:
Laying the Foundation: Revolution, Math for Databases and Big Data
PART III:
Information Density, Mathematical Identity, Set Stores and Big Data

[Sk57]⋏
^{a}
^{b}
^{c}
Skolem, T.:
Two Remarks on Set Theory (The ordered ntuples as sets)
MATH. SCAND, 5 (1957) 4046
♦
Skolem concludes:
"I shall not pursue these considerations here, but only emphasize that
it is still a problem how the ordered ntuple can be defined in the
most suitable way."

[Sto5]⋏
Stout, R.:
Information Access Accelerator
Information Builders Inc.  2005
♦
Slide presentation: An evaluation of iXSP, IBM, and Oracle comparing SQL
query performance on an industry standard database. Performance
improvement varied from 10 to 90 times faster.

[Sto7]⋏
Stonebraker, M.; Madden, S.; Abadi, D.; Harizopoulos, S.; Hachem, N.; Helland, P.:
The End of an Architectural Era (It's Time for a Complete Rewrite)
33rd International Conference on Very Large Data Bases, 2007.

[Sto+14]⋏
Stonebraker, M., et. al.:
Enterprise Data Applications and the Cloud: A Difficult Road Ahead
Proc IEEE IC2E, Boston, Ma., March 2014
♦
Paper succinctly delineates potential growing pains for future DBMS,
if developers continue to rely on physically dependent data access technologies.
"There is considerable interest in moving DBMS
applications from inside enterprise data centers to the cloud,
both to reduce cost and to increase flexibility and elasticity.
In some circumstances, achieving good DBMS performance on
current cloud architectures and future hardware technologies will
be nontrivial.
In summary, there is a difficult road ahead for
enterprise database applications."

[Sto14]⋏
Stonebraker, M.:
Hadoop at a Crossroads?
BLOG@CACM, August, 2014.
♦
Persistent use of file systems perpetuating use of physical data
location access strategies
will present serious challenges for future system developers.
"Hiding the (physical) location of data from the DBMS is death, and the
DBMS will go to great lengths to circumvent this feature."

[Teo11]⋏
Teorey, T.;Lightstone, S; Nadeau, T.Jagadish, H. V.:
Database Modeling and Design
Morgan Kaufmann, 2011, Fifth Edition.
♦
Many in the industry consider this to be the best book available on
classic database design and for explaining how to build database
applications, complemented with objective commentary. For example in
Chapt. 8: ``In short, transferring data between a database and an
application program is an onerous process, because of both difficulty of
programming and performance overhead."

[Wes70]⋏
Westervelt, F. H.:
CONCOMP: Research in Conversational use of Computers : final report
OFFICE OF RESEARCH ADMINISTRATION, Ann Arbor, December 1970
♦
This report describes the final research results of the CONCOMP Project:
Research in the Conversational Use of Computers, which was funded from
1965/70.
2.5 DATA STRUCTURES
The everpresent problems of providing rapid, flexible access to
complexly interrelated data resulted in several efforts in the area of
data structures under CONCOMP.
The work of Childs reported in Technical Reports 3 and 6 and Appendix D
of this report represent the scope of efforts expanded. The
theoretically derived structures have been implemented and employed by
numerous applications users in MTS. The flexibility and versatility
represented deserve greater emphasis than is possible in this brief
summary.
Bibliography

⋏
Feasibility of a Settheoretic Data Structure
A general structure based on a reconstituted definition of relation
IFIP Cong., Edinburgh Scotland, August 1968
♦
This antique paper presented the thesis that mathematical control over
the representation, management, and access of data was critical for the
functional freedom of applications and I/O performance of future
systems.

⋏
Description of a Settheoretic Data Structure:
AFIPS fall joint computer conference San Fransico CA, December 1968
♦
Presents early development of STDS,
a machineindependent settheoretic data structure allowing rapid processing of
data related by arbitrary assignment.

⋏
Extended Set Theory:
A General Model For Very Large, Distributed, Backend Information Systems
VLDB 1977 (Invited paper)
♦
ABSTRACT Three distinct components comprise an Information System:
INFORMATION MANAGEMENT, DATA MANAGEMENT, and STORAGE MANAGEMENT. Until
recently, all three have been subsumed under data management. As
applications become more demanding, as support criteria become more
complex, and as storage capacity becomes very large, the need for
functional independence of these three management areas has become more
apparent. Recognition of this situation has been popularized through the
phrase, "data independence", or more precisely, "data independence from
information" and "data independence from storage".
The difficulty in achieving data independence
arises through the incompatibility of a complex information space being
supported by a simple storage space. The popular, but limiting approach,
has been to force the information space into a restrictive record
space. This achieves a deceptive compatibility allowing only the
appearance of data independence at the user level. This record oriented
approach has become pervasive for small databases even though it
constrains user applications, requires substantial storage overhead, and
imposes inherent processing inefficiencies.
As databases become very large and as
distributed systems become desirable the need for inherent (not
superficial) data independence becomes crucial. This paper is intended
as a tutorial and will describe conditions for data independence and
summaries the concepts of Extended Set Theory as a general model for
expressing information systems embodying data independence. This
generality will be demonstrated by considering some major problems
pertinent to the design and support of very large, distributed, backend
information systems.
It should be emphasized that Extended Set
Theory is a formalism for expressing solutions and is not a specific
solution in itself. Though "redundant membership condition",
"distributed membership condition", and "settheoretic interface" may be
new concepts, Extended Set Theory does not preclude any current DBMS
concepts, data structures, or existing implementations. Rather, Extended
Set Theory embraces them all under a unifying model.

⋏
PEBBLE PILES & INDEX STRUCTURES: A Parable
 2005
♦
Imagine trying to convince ancient sheepherders (who calculated and
compared their wealth by equating piles of pebbles to the size of herds
of sheep) that a piece of parchment with numbers on it would give the
same results as a pile of pebbles, but that would be more accurate,
easier to use, faster in processing, less resource intensive, and more
portable? Now imagine trying to convince modern database researchers
that the mathematical identity of records can be used to replace the
physical indexing of records.

⋏
1984 VLDB Panel: Inexpensive Large Capacity Storage Will Revolutionize
The Design Of Database Management Systems
Proceedings of the Tenth International Conference on Very Large Data Bases.
Singapore, August, 1984
♦
As secondary storage devices increase in capacity and decrease in cost,
current DBMS design philosophies become less adequate for addressing
the demands to be imposed by very large database environments. Future
database management systems must be designed to allow dynamic
optimization of the I/O overhead, while providing more sophisticated
applications involving increasingly complex data relationships.

⋏
A Mathematical Foundation for Systems Development  NATOASI Series, Vol. F24, 1986
♦
Paper presents a Hypermodel syntax for precision modeling of arbitrarily complex systems
by providing
a function space continuum with explosive resolution and extended set notation to provide
generality and rigor to the concept of a Hypermodel.

⋏
Managing Data Mathematically:
Data As A Mathematical Object:
♦
"Using Extended Set Theory for High Performance Database Management"
Presentation given at Microsoft Research Labs. with an introduction by Phil Bernstein.
(video: duration 1:10:52)  2006

⋏
Why Not Sets?  2010
♦
Sets are well defined collections of uniquely identifiable items. Data
used by computers are well defined collections of items representing
situations of interest. Computers themselves are just well defined
collections of bits that change value over time. It would seem that all
computer processing is highly set oriented. Why are sets not more widely
used in modeling the behavior and assisting the development of
computing systems?

⋏
Functions as Set Behavior: A Formal Foundation Based On Extended Set Axioms  2011
♦
ABSTRACT: The term function suggests an action or process or behavior
of something applied to something. Within the framework of extended set
theory, XST, the concept of a function will be defined in terms of a
liberal definition of morphism. Which in turn will be equated with the
behavior of how specific sets interact with each other. It will be shown
that all Classical set theory, CST, graph based function behavior can
be expressed in terms of XST function nongraph based behavior; that the
behavior of functions applied to themselves is supported; and that the
concepts of Category theory can be subsumed under XST. A notable
consequence of this approach is that the use of functions need no longer
be constrained by properties of a Cartesian product.

⋏
XSP TECHNOLOGY: Theory & Practice:
Formal Modeling & Practical Implementation of XML & RDM Systems
 2011
♦
INTRODUCTION XSP (extended set processing) Technology introduces: [1] a
formal mathematical foundation, extended set theory, XST, for defining
and mapping user application expectations into internal machine
representations; and [2] practical software implementations based on XST
that assist in the design, development, and use of XML and RDM systems.

⋏
SETPROCESSING AT THE I/O LEVEL:
A Performance Alternative to Traditional Index Structures
 2011
♦
ABSTRACT It is generally believed that index structures are essential
for highperformance information access. This belief is false. For,
though indexing is a venerable, valuable, and mathematically sound
identification mechanism, its logical potential for identifying unique
data items is restricted by structuredependent implementations that are
extremely inefficient, costly, functionally restrictive, information
destructive, resource demanding, and, most importantly, that preclude
data independence. A lowlevel logical data access alternative to
physical indexed data access is setprocessing. System I/O level
setprocessing minimizes the overall I/O workload by more efficiently
locating relevant data to be transferred, and by greatly increasing the
information transfer efficiency over that of traditional indexed record
access strategies. Instead of accessing records through imposed
locations, the setprocessing alternative accesses records by their
intrinsic mathematical identity. By optimizing I/O traffic with
informationally dense data transfers, using no physical indexes of any
kind, lowlevel setprocessing has demonstrated a substantial, scalable
performance improvement over locationdependent index structures.

⋏
SETSTORE DATA ACCESS ARCHITECTURES:
For High Performance Informationally Dense I/O Transfers
 2011
♦
ABSTRACT For high performance analytic processing of vast amounts of
data buried in secondary storage, traditional performance strategies
generally advocate minimizing system I/O. This paper advocates the
converse supported by the use of setstore architectures. Traditional
rowstore and columnstore architectures rely on mechanical data models
(based on an imposed physical representation of data) for accessing and
manipulating system data. Setstore architectures rely on a mathematical
data model (based solely on the intrinsic mathematical identity of
data) to control the representation, organization, manipulation and
access of data for high performance informationally dense parallel I/O
transfers.

⋏
iXSP Interactive Extended Set Processor:
Programmer’s Manual, User’s Tutorial and Mathematical Backgrounds
 v1.7 (44 pages) Draft 2013
♦
This manual has been produced for those interested in an introduction to the Extended
Set Processor from IIS and the commands that the processor supports.

⋏
I/O Technology For Big Data:
Massively Parallel Data Access of Big Data Environments  2011
♦
Traditional I/O technology is based on the storage and retrieval of
records, records that are physically preserved in storage. Set
processing I/O technology is based on the exchange of collections (sets)
of records, records which may or may not physically exist in storage.
Given the advances in hardware platforms, set processing I/O technology
can offer one to three orders of magnitude better system performance
than traditional I/O technology.

⋏
XSP: Extended Set Processing:
Mathematically Managing Data Representations (1 page summary)
♦
This paper presents a high level overview of why a mathematical model of
data representations is necessary, and how an extended set processing
model accomplishes the task.

⋏
Set Processing vs. Record Processing Performance:
Dynamic Data Restructuring vs. Prestructured Data Storage(1 page summary))
♦
Record processing systems and set processing systems are very different.
This paper attempts to clarify the major differences and demonstrate
the performance advantages of set processing.

⋏
Functions As Set Behavior Essential Concepts: Conceptual & Formal  2016
♦
The basic
notion of the term function suggests a process or behavior
that accepts an input and produces an output.
This behavior can be formally defined in terms of set processing.
A practical consequence of defining processes in terms of
set processing allows future
computer systems to dynamically optimize data processing performance.

⋏
SQL Rejuvenated Data, Relationships, RDM, Tables, NoSQL & Set Theory  2016
♦
SQL has always been a Set Query Language.
However, it has never been implemented to fully
exploit the advantage of its settheoretic heritage.
The fault is not with system developers,
it is with the settheoretic foundations.

⋏
ContentContainer Data Access Strategies Content for Functionality  Containers for Performance  2016
♦
Since data represented for processing is not always ideal
for preservation, and since data represented for
preservation is not always ideal for processing,
accessing applicationfriendly data
from storagefriendly data
poses a genuine challenge.
Copyright © 2017
INTEGRATED INFORMATION SYSTEMS
« Last modified on 01/15/2017 »
 CONTACT 
