The obvious distinction between set processing
and record processing systems is that set processing systems
process sets, while record processing systems
process records.
What may not be so obvious is that sets are mathematical objects,
while records are physical objects.
The difference is significant.
Record processing systems access records by knowledge of
their physical location.
Set processing systems access sets by knowledge of
their mathematical identity.
The use of a mathematical identity of data representations
allows replacing physically dependent record processing systems
with mathematically reliable set processing systems.
Examples of such surgery have shown two orders of magnitude performance improvement over
commercial RDBMS implementations.
Though existing and future data management systems can readily benefit
from set processing,
the ability to process all data as a mathematical object
could also provide a potentially powerful analytic tool for
data scientistsData scientists use their data and analytical ability to find and interpret rich data sources; manage large amounts of data despite hardware, software, and bandwidth constraints; merge data sources; ensure consistency of datasets; create visualizations to aid in understanding data; build mathematical models using the data; and present and communicate the data insights/findings.
Mathematical Identity of Data
In 1970
Codd introduced the
Relational Data ModelRelational data model is the primary data model, which is used widely around the world for data storage and processing. This model is simple and it has all the properties and capabilities required to process data with storage efficiency.,
RDM.
Codd's approach was to abstract away the physical data structures and lowlevel algorithms of previous data management systems with mathematically reliable operations that could be used at the application level. Storage data and data access were still managed by physically dependent routines, but these physical dependencies were hidden from application users.
Codd's contribution
to data management
was introducing the concept of
a mathematical identity of data,
manipulated by settheoretic operations.
RDM: Relational Data Model
The need to access large amounts of diverse distributed data
was not a concern when traditional data management was being developed.
The primary concern, forty years ago, was just accessing stored data.
Since all data was managed by routines
using physical properties of data
representations,^{[Cod70]}
users were burdened with having to know the physical organization of
data in
storage.^{[Cod70]}
The Relational Data Model, RDM, is an abstract model presenting a strategy for modeling data as mathematical objects and manipulating these objects with mathematically well defined operations.
A Relational Database Management SystemA relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model as invented by E. F. Codd, of IBM's San Jose Research Laboratory. Many popular databases currently in use are based on the relational database model.,
RDBMS, is an implementation combining the data management specifications of the RDM with some data access and storage organization data management strategy.
In his Turing acceptance paper in 1981,^{[Cod81]}
Codd provided some design objectives for
building future data management systems.
Codd's Data Management Objectives
(1) Data Independence Objective:
Separate logical/physical aspects of data management.
(2) Communicability Objective:
Make the model structurally simple.
(3) Setprocessing Objective:
Provide a foundation for setoriented processing.
(4) Foundation Objective:
Provide a sound theoretical foundation for data management.
In 1981 the foundations of set theory were inadequate to support
the requirements for faithfully modeling data representations,
making it impossible for any implementation to satisfy all four
data management conditions.
Thus actual RDBMs implementations had to compromise the mathematical reliability of the relational data model and use physically dependent strategies to assess data from storage.
RDM: mathematically reliable, declarative, setprocessing.
RDBMS: physically dependent, procedural, recordprocessing.
Despite the fundamental mismatch between RDM and RDBMS data management strategies, RDBMS ability to pragmatically manipulate data representations has allowed Relational systems to dominate the industry for over thirty years. Unfortunately the mathematical muscle of the RDM currently supported by RDBMS installations is not sufficient to support requirements for Big Data applications as the physical data structures are more and more distributed across processors and around the world.
The RDM was successful because it provided operations to somewhat minimize the physical dependency between the user's view of data and the system's representation of data, and this was sufficient for most data centers applications. But many of Codd's lessons have been lost in the transition to cloud computing. Today the physical details of data storage and management in the cloud make Big Data applications more difficult to develop, expensive to use, and difficult to design and maintain than they would if Codd's principles were respected.
Recently, extensions to the foundations of set theory
have been completed and proven
consistent.^{[Bla11]}
Extended set theory, XST, provides a sound theoretical foundation
for data management by supporting a mathematical identity
for all data
representations, (RDM,
XML,
Xrel1,
Xrel2).
Most importantly, XST allows all of Codd's data management objectives to be achieved.
This means that the physical dependencies that distinguish Relational
(SQLOriginally based upon relational algebra and tuple relational calculus, SQL consists of a data definition language, data manipulation language, and a data control language. The scope of SQL includes data insert, query, update and delete, schema creation and modification, and data access control.)
from nonRelational
(NoSQLA NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.)
implementations can be isolated and removed.
Big Data Management
Scalability, diverse data, schema development overhead, accessing large volumes of distributed data are a few of the concerns facing developers of Big Data management strategies.
Some early pioneering developers are even suggesting the need for
a complete system overhaul.
The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for tomorrow’s requirements, not continue to push code lines and architectures designed for yesterday’s needs.^{[Sto7]}
A system overhaul is not required. Existing systems can easily be upgraded
and new systems developed to provide optimal performance for all applications.
What is required is evolving physically dependent data management strategies
into mathematically reliable data management strategies.
Physically dependent data management strategies (those that manage data
using physical properties of data representations)
require using indexedrecord structures to access data.
Mathematically reliable data management strategies (those that manage data
using a mathematical identity of data)
use metadata index sets and set operations to access data.
Mathematically reliable data management and analysis software, iXSP, already exists
and is available to augment the performance of current systems
and to assist in the development of future systems.
iXSP is a suite of I/O operations for accessing and
processing disk resident data, providing a 10 to 90 times performance improvement over
IBM and
Oracle.^{[Sto5]}
A summary of iXSP properties are:
* Relationally complete.
*
Supports all four of Codd's data management objectives.
*
Requires no schema design.
*
Allows spontaneous SQL queries.
*
Has no indexedrecord structure overhead.
*
Recognizes all data representations.
*
Both vertically and horizontally scalable.
*
Uses constructive updates instead of destructive updates.
*
I/O performance improvement, up to three orders of magnitude.
*
Parallel I/O access to very large distributed data.
The foundations of iXSP software were initiated under the
CONCOMPThe everpresent problems of providing rapid, flexible access to complexly interrelated data resulted in several efforts in the area of data structures under CONCOMP.,
project sponsored by an ARPA
grant from 1965 to 1970. The project was directed by F. H. Westervelt.
One of his objectives was to "find a mathematical foundation for data structures".
In 1968 two papers presented some success by reconstituting some formal definitions
of classical set theory to better reflect intuitive notions.
Further efforts lead to the development and implementation of a collection of set operations that
worked directly on data stored as arrays.
This software, STDS, was used as the data access engine for the Michigan University
MICRO RDBMSMICRO managed data with a mathematically reliable data storage and
access system (STDS*) which used no indexedrecord structures,
required no schema design, provided nondestructive updates,
and supported both structured and semistructured data.,
from 1972 to 1998.
STDS: SetTheoretic Data Structure
In 1968 two papers were published suggesting the potential
advantages of
recognizing data as a mathematical object,
managing data
with set operations,
and organizing data on disks using
a settheoretic data structure,
STDS.^{[A]}^{[B]}
Conceptually, the settheoretic concept of
ntupleIn mathematics, an ntuple is a sequence (or ordered list) of n elements, where n is a nonnegative integer.
seemed like an ideal
model for a data record. It was, and still is, an illusion.
As articulated by
Skolem,^{[Sk57]}
there is currently no suitable way to (settheoretically) define an ntuple.
Intuitively, however, the settheoretic operations defined for ntuples were ideally suited
to manipulate data records modeled as ntuples.
By ignoring the classical set theory, CST, difficulties, an ntuple was considered to be
a sequence of ordered values. A reconstituted definition was defined to be
any set of ntuples, such that "n" was the same. The settheoretic condition of unique elements was also relaxed.
The resulting definition of a "relation" was just a homogeneous array.
Given this mapping of relations to arrays, STDS operations were just a collection of classical set operations
modified to work on homogeneous arrays. All relations were considered to be sets, that is autonomous and immutable.
No conditions were imposed regarding key fields and no values could be modified.
Set operations were required to read data from disks and write results back to disks.
These conditions insured (what was later to be defined as)
ACID
In computer science, ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction.
protection.
iXSP: Interactive, Extensible, SetProcessing, Data Management Software
Though STDS proved to be a very fast intrinsically reliable data management and data access system,
its limitation to homogeneous array data seemed restrictive.
Unfortunately, if mathematically reliability were to be maintained, arrays were to remain a limitation
or extensions to the foundations of set theory were necessary to provide mathematical identities
for other data representations.
In 1985 Andreas Blass and the author initiated an investigation on how to
legitimately (provide axioms for) extending the membership condition of classical set theory
to provide recognition of set structure. Though the concept of scopes had been
explored earlier and had proven successful for computer programming, the actually axioms
supporting an extended set theory
were not consummated until
2011.^{[Bla11]}
RDBMS vs. iXSP
All RDBMS implementations use a mathematically reliable application model, the RDM,
but support system data access with physically dependent indexedrecord structures.
This means that indexedrecord structures must be planned for, designed, implemented, and maintained in order to be useful. This consumes considerable nonrecurring engineering resources, as well as recurring processing and system administration overhead. In a high rate, low latency data processing system, there may not be time to effectively execute the necessary index and related statistical updates to effectively use them. In large database instances with 100's of millions or billions of records, the indexedrecord structures themselves can get very large and deep (in btree structures), often requiring secondary index structures to manage the primary indexes.
In addition, RDBMS do not have a means to avoid transfer of significant irrelevant data across the I/O barrier. The entire table (or tables) involved in a query must be considered each time the query is executed. All columns of a table must be included in any record retrieval before performing any requested projections to obtain the desired output columns. In addition, records are often stored in random accessed pages, which can result in whole records being transferred over the I/O barrier for no other reason than they share a page with a desired record. All of these conditions result in a highly inefficient use of the storage system and exacerbates storage I/O limitations.
In iXSP, processing is done on sets. Since there are no records, there is no need for indexedrecord structures. Instead, set theoretic operations are performed on sets as operands. And the resulting sets, along with their identity and relationship to their parent sets, are saved into a continually evolving set universe. Each requested set operation is first reduced algebraically to its lowest cost equivalent, using already computed sets where possible. This step is extremely fast as it uses set metadata and requires no I/O accesses. The resulting set equation is then executed and the results returned and saved.
Typical queries are looking for specific relevant information and result in sets that are several orders of magnitudes smaller in size than the original. In addition, it is typical for multiple queries to be related in one or more attribute as analyses proceeds along a line of investigation or "drill down". This results in the creation of multiple, small data sets containing highly relevant data related to the recent queries. Instead of relying on indexedrecord structures to improve efficiency in finding requested records, iXSP uses these small set partitions of the full data set to rapidly and efficiently find the equivalent result set using minimum data transfer across the storage I/O boundary.
By exposing only mathematically welldefined data and operations to users and applications, set processing DBMS such as MICRO maximize performance by minimizing system overhead. For example, set processing does not require extensive schema development and indexedrecord structure design and maintenance such as is required by RDBMS that expose these implementation details to users.
Furthermore, by exposing only the set theoretic data and operations the DBMS implementers and administrators, MICRO and other set theoretic DBMS enabled dramatic performance benefit using dynamically generated set processing results rather than predefined indexedrecord structures that add so much overhead to today’s RDBMS
systems.^{[Har08]}
By focusing only on the sets and set operations, the implementers of MICRO anticipated database innovations such as
column storesA columnoriented DBMS is a database management system (DBMS) that stores data tables as sections of columns of data rather than as rows of data. In comparison, most relational DBMSs store data in rows. Column stores or transposed files have been implemented from the early days of DBMS development.
,
data sharding A database shard is a horizontal partition of data in a database or search engine. Each individual partition is referred to as a shard or database shard. Each shard is held on a separate database server instance, to spread load.,
and
data cracking Database indices provide a nondiscriminative navigational infrastructure to localize tuples of interest. Their maintenance cost is taken during database updates. In this paper, we study the complementary approach, addressing index maintenance as part of query processing using continuous physical reorganization, i.e., cracking the database into manageable pieces.
that came along much later as ad hoc solutions to RDBMS performance bottlenecks.
The performance advantage of set processing over record processing is not only in what it does by using set operations, but also by what it does not do by not requiring protracted schema development nor extraneous indexedrecord structure overhead.
Big Data Challenges (Physically dependent data management)
Big Data applications introduce issues of
scalability,
data diversity, and
distributed data access.
All of which are restricted by
physically dependent data management strategies.
Replacing physically depended data management strategies
with mathematically reliable data management strategies
can resolve the restrictions on
scalability,
data diversity, and
distributed data access.
This can be achieved with the aid of a
Structured Set Data Model.
The definition of a structuredset data model, SSDM,
is deceptively simple:
SSDM: Any collection of data representations and operations
on them that are welldefined under the axioms of
extended set theory, XST.
There is no restriction on how many operations are defined,
nor on what the operations do.
There is no restriction on how many data representations
are defined,
nor on how they are defined.
The only condition is that all operations and
data representations be defined using
extended set notation, XSN.
A structuredset
is a set as defined under the axioms of
extended set theory.
Conceptually an extended set
is just a classical
set with an extended membership
to provide two conditions for
set membership
instead of just one.
The particulars are rather boring, but the utility
of the extension
allows a set theoretic dimension for structure.
The only difference between classical sets
and structuredsets is that classical sets have only
one condition for membership
while structuredsets require two conditions.
The structure component of an extended set membership
can be used to distinguish the container part
of a data representation
from the content part.
Though set theory has been tried many times as a formal data model,
it has always failed to provide the ability
to suitably define data records as unambiguous sets.
Structuredsets provide an additional component to classical set membership
allowing a
formally defined representation of
data that uniquely distinguishes the logical data relationships (content)
from the physical data representation (container).
All Data Representations
ARE^{[a]}
StructuredSets.
Thus, All Data Can Be Managed Using Set Operations.
Since structuredsets can faithfully represent any and all
application and storage data
with the ability to distinguish data content from data
container, structuredset based access strategies can
manipulate data content and data containers independently
to provide nearoptimal access performance for each and every application.
With structuredsets the distinction between content and structure is an innate property
of extended set membership. This property makes structuredsets a natural choice for
modeling representations of data.
Under a structuredset data model
all logical and physical representations of data are structuredsets.
All manipulations of logical and physical data representations
can be modeled by set operations.
For presentation convenience or performance
considerations extended
set operations can be defined that
map the content of one structuredset to another structuredset having a totally deferent structure.
Thus a
structuredset data model
is an ideal choice for modeling data
management systems satisfying all
four of Codd's data management objectives.
Conclusion
In 1970 Codd introduced the advantage of set processing over record processing.
His influence has dominated data management systems for over forty years.
Though Codd advocated using set processing for both application data and
system data, classical set theory, CST, was not suitable for
faithfully modeling physical data representations.
Recent developments in extending the foundations of set theory
now allow all data representations (both abstract and physical) to be faithfully modeled by set theory.
It is now possible to fulfill Codd's data management objectives by developing
mathematically reliable data management systems that provide all of the following:
(1) Data Independence Objective:
Separate logical/physical aspects of data management.
(2) Communicability Objective:
Make the model structurally simple.
(3) Setprocessing Objective:
Provide a foundation for setoriented processing.
(4) Foundation Objective:
Provide a sound theoretical foundation for data management.
Since all data representations have a mathematical identity under extended set theory, XST,
all existing data representations can be mathematically identified and
manipulated by set operations.
This allows existing systems to be upgraded
from physically dependent systems to mathematically reliable systems.
iXSP software (or its equivalent) can be used to facilitate a system upgrade
or to complement existing systems with a mathematically reliable
data management extension
"It's like 1973 for Moving Data Around in the Cloud."
 Vint Cerf  2010
"There is a difficult road ahead for
enterprise database applications."
A commercial evaluation of iXSP, IBM, and Oracle comparing SQL query performance on an industry standard database.
"The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for tomorrow’s requirements, not continue to push code lines and architectures designed for yesterday’s needs."
The original motivation for this development of XST was to provide an improved, flexible mathematical foundation for certain constructions, like tuples and functions, that play important roles in computing.
The original motivation for this development of XST was to provide an improved, flexible mathematical foundation for certain constructions, like tuples and functions, that play important roles in computing. Subsequent developments revealed that XST also provides a smooth way of handling the setclass distinction that turns up especially in category theory.
The 1981 ACM Turing Award Lecture:
"The relational model calls not only for relational
structures (which can be thought of as tables), but also
for a particular kind of set processing called relational
processing. Relational processing entails treating whole
relations as operands."
Imagine trying to convince ancient sheepherders (who calculated and compared their wealth by equating piles of pebbles to the size of herds of sheep) that a piece of parchment with numbers on it would give the same results as a pile of pebbles, but that would be more accurate, easier to use, faster in processing, less resource intensive, and more portable? Now imagine trying to convince modern database researchers that the mathematical identity of records can be used to replace the physical indexing of records
"For a wide range of reasons, designing for and maintaining optimal data
access poses a genuine challenge to even the most sophisticated
enterprises."
DBAs are under greater pressure than ever to optimize database structure for system performance and administration.
Something as simple as improving the table index design has a profound impact on performance.
Some computing professionals currently run their own consulting
businesses doing little else than
helping customers improve their table indexing design.
Their efforts can improve query performance by as much as 50 times.
If A is a structuredset and if
Γ_{A}(x,s)
is true,
then
x is a smember of A.
If Γ_{A}(x,s) is false,
x is not a smember of A.
For example: let A = <a,b,c> =
{a^{1}, b^{2}, c^{3}},
then
Γ_{A}(b,2)
is true, while
Γ_{A}(a,2)
is false.
Classical sets have no structure.
The membership test for any Classical set
A is Γ_{A}(x,∅).
Thus, all Classical sets are structuredsets with null structure.
faithful means isomorphic representation of content, structure, and behavior.
In set theory sets are defined
by their membership and since data representations
are a physical expression of the content and structure of the data
that can be expressed mathematically,
all data representations are, by definition, structuredsets.
Substantial time is spent in logging, latching, locking, Btree, and buffer management operations. Over 90% of an application process is indexedrecord structue overhead.
Databases were optimized for computer technology of the late 1970’s. Advances in modern processors, memories, and networks mean that today’s computers are vastly different from those of 30 years ago. Yet database architecture (record accessing I/O) has changed little.
By optimizing I/O traffic with informationally dense data transfers, using no physical indexes of any kind, lowlevel setprocessing has demonstrated a substantial, scalable performance improvement over locationdependent indexedrecord structures.
Traditional rowstore and columnstore architectures rely on imposed physical representations of data.
Setstore architectures rely on a mathematical identity of data
to represent,
organize, manipulate and access data with high performance informationally dense parallel I/O transfers.
Given the advances in hardware platforms, set processing I/O technology can offer one to three orders of magnitude better system performance than traditional record accessing I/O technology.
Skolem concludes:
"I shall not pursue these considerations here, but only emphasize that it is still a problem how the ordered ntuple can be defined in the most suitable way."
Notes

⋏
In set theory sets are defined
by their membership and since data representations
are a physical expression of the content and structure of the data
that can be expressed mathematically,
all data representations are, by definition, structuredsets.

⋏
faithful means isomorphic representation of content, structure, and behavior.

⋏
MICRO(19721998) used mathematically well defined operations
in a time sharing environment
to manage application data,
storage data,
and all transformations between the two.
References

[Bla11]
^{⋏}^{a}^{b}
Blass, A., Childs, D L:
Axioms and Models for an Extended Set Theory  2011
♦
This paper presents the formal foundation for supporting "structuredsets".
5.1 TUPLES:
Traditional set theory has several ways of coding ordered tuples
< a_{1}, a_{2}, .. , a_{n} >
as sets, none of which is really
canonical^{
[Sk57]}.
XST provides a simple and natural way to represent tuples,
namely to use natural numbers as scopes. Thus, a tuple
< a_{1}, a_{2}, .. , a_{n} >
is identified with the (extended) set
{ a_{1}^{1}, a_{2}^{2}, .. , a_{n}^{n} }.
The natural numbers used here as
scopes can be represented by the traditional von~Neumann coding (with
scope ∅),
5.2 GENERALIZED TUPLES:
Instead of indexing the components of a tuple by (consecutive)
natural numbers, one could index them by arbitrary, distinct labels.
The same XST representation still works; use the labels as scopes.
This provides, for example, a convenient way to deal with what are
often called records in computing. Records have fields, in
which data are inserted. We can represent them settheoretically by
taking the field names as scopes with the data as elements.
Similarly, we can represent relations, in the sense of relational
databases, by sets of generalized tuples, one for each row of the
relation. The scopes for such a generalized tuple would be the
attribute names, while the corresponding elements would be the values,
in that row, of the attributes.
5.3 FUNCTIONS and MULTIFUNCTIONS (functions defined by set behavior):
In general, a set f of generalized tuples, all having the same scope
set D, can be regarded as describing several (possibly multivalued)
operations, called the behaviors of f.

[Boy73]⋏
Boyce, R. F.; Chamberlin, D. D.; King, W. F.; Hammer, M. M.:
Specifying Queries as Relational Expressions: SQUARE
 IBM Technical Report RJ 1291, 1973
♦
This paper presents a data sublanguage called SQUARE, intended for use
in ad hoc, interactive problem solving by noncomputer specialists.
SQUARE is based on the relational model of data, and is shown to be
relationally complete; however, it avoids the quantifiers and bound
variables required by languages based on the relational calculus.
Facilities for query, insertion, deletion, and update on tabular data
bases are described. A syntax is given, and suggestions are made for
alternative syntaxes, including a syntax based on English key words for
users with limited mathematical background.

[Cer10]↑
Cerf, V.:
"It's like 1973 for Moving Data Around in the Cloud"
♦
Using a cloud computing service may sound enticing, but you better
consider how that data can be moved around if you want to switch to a
different provider. It's a big problem that now has the attention of
Vint Cerf, who is calling for standards to define how customer data gets
passed between different cloud service providers.

[Cha74]⋏
Chamberlin, D. D.; Boyce, R. F.:
SEQUEL: A Structured English Query Language  IBM Research Laboratory, 1974
♦
ABSTRACT: In this paper we present the data manipulation facility for a
structured English query language (SEQUEL) which can be used for accessing
data in an integrated relational data base. Without resorting to the concepts
of bound variables and quantifiers SEQUEL identifies a set of simple operations
on tabular structures, which can be shown to be of equivalent power to
the first order predicate calculus. A SEQUEL user is presented with a consistent
set of keyword English templates which reflect how people use tables to
obtain information. Moreover, the SEQUEL user is able to compose these basic
templates in a structured manner in order to form more complex queries.
SEQUEL is intended as a data base sub language for both the professional programmer
and the more infrequent data base user.

[Cha01]⋏
Champion, M.:
XSP: An Integration Technology for Systems Development and Evolution
 Software AG  2001
♦
The mathematics of the relational model is based on classical set theory,
CST, and this is both its strength and its weakness.
An "extended set theory", XST, can be used to model
ordering and containment relationships
that are simply too "messy" to handle in classical set theory and the
formalisms (such as relational algebra) that are based on it.

[Cod70]
^{⋏}^{a}^{b}
Codd, E. F.:
A Relational Model of Data for Large Shared Data Banks CACM 13, No. 6 (June) 1970
♦
Abstract: Future users of large data banks must be protected from having to know
how the data is organized in the machine (the internal representation).

[Cod81]⋏
Codd, E. F.:
A Relational database: a practical foundation for performance CACM 25, No. 6 (June) 1970
♦
The 1981 ACM Turing Award Lecture:
"The relational model calls not only for relational
structures (which can be thought of as tables), but also
for a particular kind of set processing called relational
processing. Relational processing entails treating whole
relations as operands."

[Ell15]⋏
Ellis, T.:
Extended Set Theory: A Summary  2015
♦
Extended Set Theory (XST) was originally developed under an ARPA contract to address the limitations of Classical Set Theory (CST). At the root of the problem is the set membership definition of CST, which defines sets and the results of set operations based on a single membership condition: content.

[Fay13]⋏
Fayyad, U. M.:
Big Data Everywhere, and No SQL in Sight
SIGKDD Explorations, Volume 14, Issue 2  2013
♦
"The term BigData is not welldefined, but is
generally used to refer to the type of data that breaks the limits of
traditional data storage and management stateoftheart."

[Har76]⋏
Hardgrave, W. T.:
A technique for implementing a set processor
Proceedings of the 1976 conference on Data : Abstraction, definition and structure archive ACM New York, NY, USA 1976
♦
Extended set theory is an effective method for describing the complex data structures needed to support largescale data base applications. This paper describes the incorporation of the salient concepts of extended set theory into a general tool for the design and implementation of largescale database systems. This tool is called a set processor. This implementation is based on the representation of sets of positive integers as bit strings and the application of the Cauchy/Cantor Diagonal Method.

[Har08]⋏
^{a}
^{b}
Harizopoulos,~S.; Madden,~S.; Abadi,~D.; Stonebraker,~M.:
OLTP Through the Looking Glass, and What We Found There
SIGMOD'08, June 912, 2008
♦
Databases include a suite of features that were optimized for computer technology of the late 1970’s. Advances in modern processors, memories, and networks mean that today’s computers are vastly different from those of 30 years ago. Yet database architecture (record accessing I/O) has changed little.
Substantial time is spent in logging, latching, locking, Btree, and buffer management operations.
Over 90% of an application process is indexedaccess overhead.

[Her72]⋏
Hershey, W; Easthope, C.:
A set theoretic data structure and retrieval language
Paper from the Session on Data Structures, Spring Joint Computer Conference, May 1972 in ACM SIGIR Forum, Volume 7, Issue 4 (December 1972)
♦
This paper describes a data structure and retrieval program for general large scale information systems. Retrieval operations are done by means of set theoretic operations. The general structure of the data structure and retrieval program are outlined here, and the retrieval language is briefly explained with accompanying examples. The major components of the data structure are discussed in detail: The directory, the dictionary, and the data itself. Limitations of the system and plans for improvement are mentioned.

[Lar09]⋏
Larsen, S. M.:
The Business Value of Intelligent Data Access  March 2009
♦
Article provides an excellent description on how difficult
it is to optimize data access paths.
"For a wide range of reasons, designing for and maintaining optimal data
access poses a genuine challenge to even the most sophisticated
enterprises." p. 2

[Lig07]⋏
^{a}
^{b}
^{c}
Lightstone, S; Teorey, T.; Nadeau, T.:
Physical Database Design
 Morgan Kaufmann, 2007
♦
The rapidly increasing volume of information contained in relational databases places a strain on databases, performance, and maintainability: DBAs are under greater pressure than ever to optimize database structure for system performance and administration.
Physical Database Design discusses the concept of how physical structures of databases affect performance.
Something as simple as improving the table index design has a profound impact on performance.
 Book provides a comprehensive analysiss of how complicated the physical database
design process can be
without the guidance of a formal data access model.
Without such formal support physical file data access structures
typically impede system performance.
a) File data access strategies are extremely difficult to optimize.
"Some computing professionals currently run their own consulting
businesses doing little else than
helping customers improve their table indexing design."
Their efforts can improve query performance by as much as 50 times. (p. 2)

[Man14]⋏
Manoochehri, M.:
Data Just Right  Introduction to LargeScale Data & Analytics)
AddisonWesley, 2014.
♦
Largescale data analysis is now vitally important to virtually every business. Mobile and social technologies are generating massive datasets; distributed cloud computing offers the resources to store and analyze them; and professionals have radically new technologies at their command, including NoSQL databases. .

[MICRO]
⋏
MICRO RDBMS 19721998; User Manual:
MICRO A Relational Database Management System 1992
♦
MICRO supported timesharing commercial applications from 1972 to 1998.
It was the first system to use settheoretic operations to create and
manage stored data.
MICRO managed data with a mathematically reliable data storage and
access system (STDS*) which used no index structures,
required no schema design, provided nondestructive updates,
and supported both structured and semistructured data.

[MSQL]
^{⋏}
SQL Server Performance Team:
Great New TPCH Results with SQL Server 2008 17 Aug. 2009
♦
"HP also published a 300GB result on their ProLiant DL785 platform with SQL Server 2008. This publication illustrates the high performance and solid price/performance using industry standard components from HP and Microsoft." (Load time: 300GB in 13.33 hours)

[Nor10]⋏
North, K.:
♦
Three articles
presenting a short historical perspective on the
role of set theory,
mathematically sound data models,
and the importance of data independence.  2010
PART I:
Sets, Data Models and Data Independence
PART II:
Laying the Foundation: Revolution, Math for Databases and Big Data
PART III:
Information Density, Mathematical Identity, Set Stores and Big Data

[Sk57]⋏
^{a}
^{b}
^{c}
Skolem, T.:
Two Remarks on Set Theory (The ordered ntuples as sets)
MATH. SCAND, 5 (1957) 4046
♦
Skolem concludes:
"I shall not pursue these considerations here, but only emphasize that
it is still a problem how the ordered ntuple can be defined in the
most suitable way."

[Sto5]⋏
Stout, R.:
Information Access Accelerator
Information Builders Inc.  2005
♦
Slide presentation: An evaluation of iXSP, IBM, and Oracle comparing SQL query performance on an industry standard database. Performance improvement varied from 10 to 90 times faster.

[Sto7]⋏
Stonebraker, M.; Madden, S.; Abadi, D.; Harizopoulos, S.; Hachem, N.; Helland, P.:
The End of an Architectural Era (It's Time for a Complete Rewrite)
33rd International Conference on Very Large Data Bases, 2007.
♦
"The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for tomorrow’s requirements, not continue to push code lines and architectures designed for yesterday’s needs."

[Sto+14]⋏
Stonebraker, M., et. al.:
Enterprise Data Applications and the Cloud: A Difficult Road Ahead
Proc IEEE IC2E, Boston, Ma., March 2014
♦
Paper succinctly delineates potential growing pains for future DBMS,
if developers continue to rely on physically dependent data access technologies.
"There is considerable interest in moving DBMS
applications from inside enterprise data centers to the cloud,
both to reduce cost and to increase flexibility and elasticity.
In some circumstances, achieving good DBMS performance on
current cloud architectures and future hardware technologies will
be nontrivial.
In summary, there is a difficult road ahead for
enterprise database applications."

[Sto14]⋏
Stonebraker, M.:
Hadoop at a Crossroads?
BLOG@CACM, August, 2014.
♦
Persistent use of file systems perpetuating use of physical data location access strategies
will present serious challenges for future system developers.
"Hiding the (physical) location of data from the DBMS is death, and the DBMS will go to great lengths to circumvent this feature."

[Teo11]⋏
Teorey, T.;Lightstone, S; Nadeau, T.Jagadish, H. V.:
Database Modeling and Design
Morgan Kaufmann, 2011, Fifth Edition.
♦
Many in the industry consider this to be the best book available on classic database design and for explaining how to build database applications, complemented with objective commentary. For example in Chapt. 8: ``In short, transferring data between a database and an application program is an onerous process, because of both difficulty of programming and performance overhead."

[Wes70]⋏
Westervelt, F. H.:
CONCOMP: Research in Conversational use of Computers : final report
OFFICE OF RESEARCH ADMINISTRATION, Ann Arbor, December 1970
♦
This report describes the final research results of the CONCOMP Project: Research in the Conversational Use of Computers, which was funded from 1965/70.
2.5 DATA STRUCTURES
The everpresent problems of providing rapid, flexible access to complexly interrelated data resulted in several efforts in the area of data structures under CONCOMP.
The work of Childs reported in Technical Reports 3 and 6 and Appendix D of this report represent the scope of efforts expanded. The theoretically derived structures have been implemented and employed by numerous applications users in MTS. The flexibility and versatility represented deserve greater emphasis than is possible in this brief summary.
Bibliography

⋏
Feasibility of a Settheoretic Data Structure
A general structure based on a reconstituted definition of relation
IFIP Cong., Edinburgh Scotland, August 1968
♦
This antique paper presented the thesis that mathematical control over
the representation, management, and access of data was critical for the
functional freedom of applications and I/O performance of future
systems.

⋏
Description of a Settheoretic Data Structure:
AFIPS fall joint computer conference San Fransico CA, December 1968
♦
Presents early development of STDS,
a machineindependent settheoretic data structure allowing rapid processing of
data related by arbitrary assignment.

⋏
Extended Set Theory:
A General Model For Very Large, Distributed, Backend Information Systems
VLDB 1977 (Invited paper)
♦
ABSTRACT Three distinct components comprise an Information System: INFORMATION MANAGEMENT, DATA MANAGEMENT, and STORAGE MANAGEMENT. Until recently, all three have been subsumed under data management. As applications become more demanding, as support criteria become more complex, and as storage capacity becomes very large, the need for functional independence of these three management areas has become more apparent. Recognition of this situation has been popularized through the phrase, "data independence", or more precisely, "data independence from information" and "data independence from storage".
The difficulty in achieving data independence arises through the incompatibility of a complex information space being supported by a simple storage space. The popular, but limiting approach, has been to force the information space into a restrictive record space. This achieves a deceptive compatibility allowing only the appearance of data independence at the user level. This record oriented approach has become pervasive for small databases even though it constrains user applications, requires substantial storage overhead, and imposes inherent processing inefficiencies.
As databases become very large and as distributed systems become desirable the need for inherent (not superficial) data independence becomes crucial. This paper is intended as a tutorial and will describe conditions for data independence and summaries the concepts of Extended Set Theory as a general model for expressing information systems embodying data independence. This generality will be demonstrated by considering some major problems pertinent to the design and support of very large, distributed, backend information systems.
It should be emphasized that Extended Set Theory is a formalism for expressing solutions and is not a specific solution in itself. Though "redundant membership condition", "distributed membership condition", and "settheoretic interface" may be new concepts, Extended Set Theory does not preclude any current DBMS concepts, data structures, or existing implementations. Rather, Extended Set Theory embraces them all under a unifying model.

⋏
PEBBLE PILES & INDEX STRUCTURES: A Parable
 2005
♦
Imagine trying to convince ancient sheepherders (who calculated and compared their wealth by equating piles of pebbles to the size of herds of sheep) that a piece of parchment with numbers on it would give the same results as a pile of pebbles, but that would be more accurate, easier to use, faster in processing, less resource intensive, and more portable? Now imagine trying to convince modern database researchers that the mathematical identity of records can be used to replace the physical indexing of records.

⋏
1984 VLDB Panel: Inexpensive Large Capacity Storage Will Revolutionize
The Design Of Database Management Systems
Proceedings of the Tenth International Conference on Very Large Data Bases.
Singapore, August, 1984
♦
As secondary storage devices increase in capacity and decrease in cost,
current DBMS design philosophies become less adequate for addressing
the demands to be imposed by very large database environments. Future
database management systems must be designed to allow dynamic
optimization of the I/O overhead, while providing more sophisticated
applications involving increasingly complex data relationships.

⋏
A Mathematical Foundation for Systems Development  NATOASI Series, Vol. F24, 1986
♦
Paper presents a Hypermodel syntax for precision modeling of arbitrarily complex systems
by providing
a function space continuum with explosive resolution and extended set notation to provide
generality and rigor to the concept of a Hypermodel.

⋏
Managing Data Mathematically:
Data As A Mathematical Object:
♦
"Using Extended Set Theory for High Performance Database Management"
Presentation given at Microsoft Research Labs. with an introduction by Phil Bernstein.
(video: duration 1:10:52)  2006

⋏
Why Not Sets?  2010
♦
Sets are well defined collections of uniquely identifiable items. Data used by computers are well defined collections of items representing situations of interest. Computers themselves are just well defined collections of bits that change value over time. It would seem that all computer processing is highly set oriented. Why are sets not more widely used in modeling the behavior and assisting the development of computing systems?

⋏
Functions as Set Behavior: A Formal Foundation Based On Extended Set Axioms  2011
♦
ABSTRACT: The term function suggests an action or process or behavior of something applied to something. Within the framework of extended set theory, XST, the concept of a function will be defined in terms of a liberal definition of morphism. Which in turn will be equated with the behavior of how specific sets interact with each other. It will be shown that all Classical set theory, CST, graph based function behavior can be expressed in terms of XST function nongraph based behavior; that the behavior of functions applied to themselves is supported; and that the concepts of Category theory can be subsumed under XST. A notable consequence of this approach is that the use of functions need no longer be constrained by properties of a Cartesian product.

⋏
XSP TECHNOLOGY: Theory & Practice:
Formal Modeling & Practical Implementation of XML & RDM Systems
 2011
♦
INTRODUCTION XSP (extended set processing) Technology introduces: [1] a formal mathematical foundation, extended set theory, XST, for defining and mapping user application expectations into internal machine representations; and [2] practical software implementations based on XST that assist in the design, development, and use of XML and RDM systems.

⋏
SETPROCESSING AT THE I/O LEVEL:
A Performance Alternative to Traditional Index Structures
 2011
♦
ABSTRACT It is generally believed that index structures are essential for highperformance information access. This belief is false. For, though indexing is a venerable, valuable, and mathematically sound identification mechanism, its logical potential for identifying unique data items is restricted by structuredependent implementations that are extremely inefficient, costly, functionally restrictive, information destructive, resource demanding, and, most importantly, that preclude data independence. A lowlevel logical data access alternative to physical indexed data access is setprocessing. System I/O level setprocessing minimizes the overall I/O workload by more efficiently locating relevant data to be transferred, and by greatly increasing the information transfer efficiency over that of traditional indexed record access strategies. Instead of accessing records through imposed locations, the setprocessing alternative accesses records by their intrinsic mathematical identity. By optimizing I/O traffic with informationally dense data transfers, using no physical indexes of any kind, lowlevel setprocessing has demonstrated a substantial, scalable performance improvement over locationdependent index structures.

⋏
SETSTORE DATA ACCESS ARCHITECTURES:
For High Performance Informationally Dense I/O Transfers
 2011
♦
ABSTRACT For high performance analytic processing of vast amounts of data buried in secondary storage, traditional performance strategies generally advocate minimizing system I/O. This paper advocates the converse supported by the use of setstore architectures. Traditional rowstore and columnstore architectures rely on mechanical data models (based on an imposed physical representation of data) for accessing and manipulating system data. Setstore architectures rely on a mathematical data model (based solely on the intrinsic mathematical identity of data) to control the representation, organization, manipulation and access of data for high performance informationally dense parallel I/O transfers.

⋏
iXSP Interactive Extended Set Processor:
Programmer’s Manual, User’s Tutorial and Mathematical Backgrounds
 v1.7 (44 pages) Draft 2013
♦
This manual has been produced for those interested in an introduction to the Extended
Set Processor from IIS and the commands that the processor supports.

⋏
I/O Technology For Big Data:
Massively Parallel Data Access of Big Data Environments  2011
♦
Traditional I/O technology is based on the storage and retrieval of records, records that are physically preserved in storage. Set processing I/O technology is based on the exchange of collections (sets) of records, records which may or may not physically exist in storage. Given the advances in hardware platforms, set processing I/O technology can offer one to three orders of magnitude better system performance than traditional I/O technology.
Copyright © 2015
INTEGRATED INFORMATION SYSTEMS
« Last modified on 10/13/2015 »
 CONTACT 
