Introduction
A growing interest in
Big Data
"The term BigData is not welldefined, but is
generally used to refer to the type of data that breaks the limits of
traditional data storage and management stateoftheart."
 [Fay13]
and
Enterprise Cloud computing
"There is a difficult road ahead for
enterprise database applications."
 [Sto+14]
is already stresstesting the
limits of traditional (relational) data storage and management systems.
Specific limitations cited against relational implementations
are:
restricted data representation,
protracted application development time,
and rigid storage organization inhibiting scaling.
There are a chorus of other issues, but they are mitigated when
the previous ones are eliminated.
In 1970 the research effort was to provide
application data representation independence
from storage data representation dependence.
Problems of horizontal scaling and distributed data access
were not even on the horizon.
It now seems prudent
to revisit the foundations of relational systems
to see if any research since 1970 might
be able to augment existing systems
by preserving relational capabilities
while supporting
optimized data access for any combination of applications
with any combination of storage representation and data distribution.
Relational, Scalable & Schemaless
The ability to mathematically manipulate data representations
has allowed Relational systems to dominate the industry for
over thirty years.
Unfortunately the mathematical muscle currently supporting
RDBMS installations is not sufficient to
support requirements for Big Data applications.
The NoSQL movement has provided an alternative approach
by offering separate storage implementations for
specialized data representation access needs.
This approach is not universally appealing
since it lacks the maturity of relational systems,
requires a different system for each desired data
representation, and currently provides
no mathematical (relational) capability
for processing data.
Since 1970 there has been substantial research into the foundations
of set theory.
This research provides a broader mathematical foundation for
supporting and extending visions originally proposed by Codd
for the RDM.
For extended mathematical capabilities to be integrated into existing
and future systems
it first has to be shown how such capabilities could impact any
real practical implementation issues.
Three problems need to be addressed
Preserve Relational:
Though the underlying strength of the RDM is its mathematical
foundation in classical set theory, CST,
this limits data representations
to relations as arrays.
A simple (in principle) solution is to broaden the definition
of "relation" to include any and all possible
computer based representations of data.
Provide Scalability:
As long as applications are dependent on
physical properties of storage data representations and organization,
horizontal scaling will be difficult to achieve.
By divorcing applications from ALL physical awareness
of storage data, horizontal scaling is readily achievable.
A physical insulation can be provided
by a mathematical conduit between applications and storage.
Eliminate Schemas:
The need for schemas in current relational systems
is an unfortunate design consequence of
requiring storage organizations to be
percognizant of application data access needs.
With operational support for the
dynamic restructuring of storage data,
this prestructuring requirement is depreciated.
Each of these three concerns can be resolved
by extending the RDM, based of classical set theory,
to a structured set data model, based on using structured sets,
as defined under extended set theory, XST.
SSDM: Structured Set Data Model
The definition of a structured set data model, SSDM,
is deceptively simple:
SSDM: Any collection of data representations and operations
on them that are welldefined under the axioms of
extended set theory, XST.
There is no restriction on how many operations are defined,
nor on what the operations do.
There is no restriction on how many data representations
are defined,
nor on how they are defined.
The only condition is that all operations and
data representations be defined using
extended set notation, XSN.
Structured Sets
A structured set
is an extended set as defined under the axioms of
extended set
theory^{ [Bla11]}.
Conceptually an extended set
is just a classical
set with an extended membership
to provide two conditions for
set membership
instead of just one.
The particulars are rather boring, but the utility
of the extension
allows a set theoretic dimension for structure.
The only difference between classical sets
and structured sets is that classical sets have only
one condition for membership
while structured sets require two conditions.
If A is a structured set and if
Γ_{A}(x,s)
is true, then
x is a smember of A.
If Γ_{A}(x,s) is false,
x is not a smember of A.
For example: let A = <a,b,c> =
{a^{1}, b^{2}, c^{3}},
then
Γ_{A}(b,2)
is true, while
Γ_{A}(a,2)
is false.
Classical sets have no structure.
The membership test for any Classical set
A is Γ_{A}(x,∅).
Thus, all Classical sets are structured sets with null structure.
The structure component of an extended set membership
can be used to distinguish the container part
of a data representation
from the content part.
Though set theory has been tried many times as a formal data model,
it has always failed to provide the ability
to suitably define data records as unambiguous sets.
Structured sets provide an additional component to classical set membership
allowing a
formally defined representation of
data that uniquely distinguishes the logical data relationships (content)
from the physical data representation (container).
All Data Representations
ARE^{ [8]}
Structured Sets.
Thus, All Data Can Be Managed Using Set Operations.
Since structured sets can formally represent any and all
application and storage data
with the ability to distinguish data content from data
container, structured set based access strategies can
manipulate data content and data containers independently
to provide nearoptimal access performance for each and every application.
With structured sets the distinction between content and structure is an innate property
of extended set membership. This property makes structured sets a natural choice for
modeling representations of data.
Under a structured set data model
all logical and physical representations of data are structured sets.
All manipulations of logical and physical data representations
can be modeled by set operations.
For presentation convenience or performance
considerations extended
set operations can be defined that
map the content of one structured set to another structured set having a totally deferent structure.
Thus a
structured set data model
is an ideal choice for modeling data independent access systems.
SQL & Structured Sets
The SQL SELECT statement was
originally^{
[Cha74]}
introduced as a set operation.
The language provided applications the means for specifying the membership of a result set
as derived from a given collection of sets.
There was no need for applications to know how data was actually stored.
Originally the sets were restricted to being arrays, since that was
the only structure reasonably supported by classical set theory.
Now that structured sets are available for modeling
any representation of data, there is
no need to restrict SELECT statements to just arrays.
Conclusion
All application data representations are structured sets.
All storage data representations are structured sets.
All data transformations between applications and storage
are structured set operations.
A SSDM provides developers complete data management freedom,
limited only by their choice of hardware,
their imagination,
and their tolerance for set theory.
Integrated Information Systems
Integrated Information Systems, IIS, was established in 1985
to provide research and consulting services.
Assistance in understanding and applying SSDM techniques
for preserving relational capabilities
while supporting
dynamic storage reorganization to
optimize data access for any combination of applications
with any combination of storage representation and data distribution.
is
available.
Please email with questions, if interested.
Notes

⋏
CST sets are defined by axioms.
CST sets are defined by axioms.

⋏
CST axioms impose no structure on the members of sets.
CST axioms impose no structure on the members of sets.

⋏
container is use instead of structure to avoid disruptive associations.
Note 3: The operations were set operations and the operands were relations.

⋏
unexpected results: <a,b> ∩ <a,c> = <a,a>

⋏
"Implementation of systems to
support the relational model are not discussed."
[Cod70 p.377].

⋏
Codd used the term data independence, but it has since been
corrupted to include physical file dependent systems.

⋏
"faithful" means isomorphic representation of content, structure, and behavior.

⋏^{a}^{b}^{c}
In set theory sets are defined
by their membership and since data representations
are a physical expression of the content and structure of the data
that can be expressed mathematically,
all data representations are, by definition, structured sets.

⋏
Using Codd's original sense of data independence
for data to be disassociated from any machine organization.

⋏
This early system only supported flat files as mathematical objects.

⋏
MICRO(19721998) used mathematically well defined operations
in a time sharing environment
to manage application data,
storage data,
and all transformations between the two.
References

[Bla11]⋏
Blass, A., Childs, D L:
Axioms and Models for an Extended Set Theory  2011
♦
This paper presents the formal foundation for supporting "structured sets".
5.1 TUPLES:
Traditional set theory has several ways of coding ordered tuples
< a_{1}, a_{2}, .. , a_{n} >
as sets, none of which is really
canonical^{
[Sk57]}.
XST provides a simple and natural way to represent tuples,
namely to use natural numbers as scopes. Thus, a tuple
< a_{1}, a_{2}, .. , a_{n} >
is identified with the (extended) set
{ a_{1}^{1}, a_{2}^{2}, .. , a_{n}^{n} }.
The natural numbers used here as
scopes can be represented by the traditional von~Neumann coding (with
scope ∅),
5.2 GENERALIZED TUPLES:
Instead of indexing the components of a tuple by (consecutive)
natural numbers, one could index them by arbitrary, distinct labels.
The same XST representation stil works; use the labels as scopes.
This provides, for example, a convenient way to deal with what are
often called records in computing. Records have fields, in
which data are inserted. We can represent them settheoretically by
taking the field names as scopes with the data as elements.
Similarly, we can represent relations, in the sense of relational
databases, by sets of generalized tuples, one for each row of the
relation. The scopes for such a generalized tuple would be the
attribute names, while the corresponding elements would be the values,
in that row, of the attributes.
5.3 FUNCTIONS and MULTIFUNCTIONS (functions defined by set behavior):
In general, a set f of generalized tuples, all having the same scope
set D, can be regarded as describing several (possibly multivalued)
operations, called the behaviors of f.

[Boy73]⋏
Boyce, R. F.; Chamberlin, D. D.; King, W. F.; Hammer, M. M.:
Specifying Queries as Relational Expressions: SQUARE
 IBM Technical Report RJ 1291, 1973
♦
This paper presents a data sublanguage called SQUARE, intended for use
in ad hoc, interactive problem solving by noncomputer specialists.
SQUARE is based on the relational model of data, and is shown to be
relationally complete; however, it avoids the quantifiers and bound
variables required by languages based on the relational calculus.
Facilities for query, insertion, deletion, and update on tabular data
bases are described. A syntax is given, and suggestions are made for
alternative syntaxes, including a syntax based on English key words for
users with limited mathematical background.

[Cer10]⋏
Vint Cerf:
"It's like 1973 for Moving Data Around in the Cloud"
♦
Article calls attention to concerns with moving data in Cloud environments.

[Cha74]⋏
Chamberlin, D. D.; Boyce, R. F.:
SEQUEL: A Structured English Query Language  IBM Research Laboratory, 1974
♦
ABSTRACT: In this paper we present the data manipulation facility for a
structured English query language (SEQUEL) which can be used for accessing
data in an integrated relational data base. Without resorting to the concepts
of bound variables and quantifiers SEQUEL identifies a set of simple operations
on tabular structures, which can be shown to be of equivalent power to
the first order predicate calculus. A SEQUEL user is presented with a consistent
set of keyword English templates which reflect how people use tables to
obtain information. Moreover, the SEQUEL user is able to compose these basic
templates in a structured manner in order to form more complex queries.
SEQUEL is intended as a data base sublanguage for both the professional programmer
and the more infrequent data base user.

[Cha01]⋏
Champion, M.:
XSP: An Integration Technology for Systems Development and Evolution
 Software AG  2001
♦
The mathematics of the relational model is based on classical set theory,
CST, and this is both its strength and its weakness.
An "extended set theory", XST, can be used to model
ordering and containment relationships
that are simply too "messy" to handle in classical set theory and the
formalisms (such as relational algebra) that are based on it.

[Chi68a]⋏
Childs, D L:
Feasibility of a Settheoretic Data Structure
A general structure based on a reconstituted definition of relation
IFIP Cong., Edinburgh Scotland, August 1968
♦
This antique paper presented the thesis that mathematical control over
the representation, management, and access of data was critical for the
functional freedom of applications and I/O performance of future
systems.

[Chi68b]⋏
Childs, D L:
Description of a Settheoretic Data Structure
AFIPS fall joint computer conference San Fransico CA, December 1968
♦
Presents early development of STDS,
a machineindependent settheoretic data structure allowing rapid processing of
data related by arbitrary assignment.

[Chi77]⋏
Childs, D L:
Extended Set Theory
A General Model For Very Large, Distributed, Backend Information Systems
VLDB 1977 (Invited paper, abstract)
♦
Addresses the need for inherent (not superficial) data independence
where applications are agnostic regarding the organization of stored data.
(paper)

[Chi84]⋏
Childs, D L:
VLDB Panel : Inexpensive Large Capacity Storage Will Revolutionize
The Design Of Database Management Systems
Proceedings of the Tenth International Conference on Very Large Data Bases.
Singapore, August, 1984
♦
As secondary storage devices increase in capacity and decrease in cost,
current DBMS design philosophies become less adequate for addressing
the demands to be imposed by very large database environments. Future
database management systems must be designed to allow dynamic
optimization of the I/O overhead, while providing more sophisticated
applications involving increasingly complex data relationships.

[Chi86]⋏
Childs, D L:
A Mathematical Foundation for Systems Development  NATOASI Series, Vol. F24, 1986
♦
Paper presents a Hypermodel syntax for precision modeling of arbitrarily complex systems
by providing
a function space continuum with explosive resolution and extended set notation to provide
generality and rigor to the concept of a Hypermodel.

[Chi06]⋏
Childs, D L:
Managing Data Mathematically:
Data As A Mathematical Object:
♦
"Using Extended Set Theory for High Performance Database Management"
Presentation given at Microsoft Research Labs. with an introduction by Phil Bernstein.
(video: duration 1:10:52)  2006

[Chi10]⋏
Childs, D L:
Why Not Sets?  2010
♦
Why are sets not used in modeling the behavior
and assisting the development of computing systems?

[Chi11]⋏
Childs, D L:
Functions as Set Behavior A Formal Foundation Based On Extended Set Axioms  2011
♦
Within the framework of extended set theory, XST, the concept
of a function is defined as a behavior of sets in
terms of how specific sets react subject to their interaction
with other sets.
A notable consequence of this approach is that
the use of functions need no longer be constrained by
properties of a Cartesian product.

[Cod70]
^{⋏}^{a}^{b}
Codd, E. F.:
A Relational Model of Data for Large Shared Data Banks CACM 13, No. 6 (June) 1970
♦
Abstract: Future users of large data banks must be protected from having to know
how the data is organized in the machine (the internal representation).

[Fay13]⋏
Fayyad, U. M.:
Big Data Everywhere, and No SQL in Sight
SIGKDD Explorations, Volume 14, Issue 2  2013
♦
"The term BigData is not welldefined, but is
generally used to refer to the type of data that breaks the limits of
traditional data storage and management stateoftheart."

[Har08]⋏
Harizopoulos,~S.; Madden,~S.; Abadi,~D.; Stonebraker,~M.:
OLTP Through the Looking Glass, and What We Found There
SIGMOD'08, June 912, 2008
♦
Over 90% of an application process is indexedaccess overhead.

[Lar09]⋏
Larsen, S. M.:
The Business Value of Intelligent Data Access  March 2009
♦
Article provides an excellent description on how difficult
it is to optimize data access paths.
"For a wide range of reasons, designing for and maintaining optimal data
access poses a genuine challenge to even the most sophisticated
enterprises." p. 2

[Lig07]⋏
^{a}
^{b}
^{c}
Lightstone, S; Teorey, T.; Nadeau, T.:
Physical Database Design
 Morgan Kaufmann, 2007
♦
A comprehensive analysis of how complicated the physical database
design process can be
without the guidance of a formal data access model.
Without such formal support physical file data access structures
typically impede system performance. [excerpts below]
a) File data access strategies are extreamly difficult to optimize.
"Some computing professionals currently run their own consulting
businesses doing little else than
helping customers improve their table indexing design."
Their efforts can improve query performance by as much as 50 times. (p. 2)
b) Files are physical representations of data.
Tables are logical representations of data.
"Tables are Files"? (p. 7)
c) An index is data organization set up to speed up the retrieval
of data from tables. In database management systems, indexes can be specified
by database application programmers. (p. 8)
d) "It is important to be able to analyze the different paths for the
quality of the result, in other words, the performance of the system to
get you the correct result and choose the best path to get you there."
(p. 31)
e) A block (or page) has been the basic unit of I/O from disk to fast
memory (RAM), typically 4~KB in size.
In recent years, prefetch buffers (typically 64~KB, as in DB2) have been used to
increase I/O efficiency. (p. 371)
f) The total I/O time for a full table scan is computed simply
as the I/O time for a single block, or prefetch buffer (64~KB),
times the total number of those I/O transfers in the table. (p. 372)

[Man14]⋏
Manoochehri, M.:
Data Just Right  Introduction to LargeScale Data & Analytics)
AddisonWesley, 2014.
♦
Largescale data analysis is now vitally important to virtually every business. Mobile and social technologies are generating massive datasets; distributed cloud computing offers the resources to store and analyze them; and professionals have radically new technologies at their command, including NoSQL databases. .

[MICRO]⋏
MICRO DBMS 19721998
♦
MICRO was the first DBMS to use settheoretic operations (STDS) to create and
manage stored data.The system
supported timesharing commercial applications from 1972 to 1998.

[MSQL]
^{⋏}
SQL Server Performance Team:
Great New TPCH Results with SQL Server 2008 17 Aug. 2009
♦
"HP also published a 300GB result on their ProLiant DL785 platform with SQL Server 2008. This publication illustrates the high performance and solid price/performance using industry standard components from HP and Microsoft." (Load time: 300GB in 13.33 hours)

[Nor10]⋏
North, K.:
♦
Three articles
presenting a short historical perspective on the
role of set theory,
mathematically sound data models,
and the importance of data independence.  2010
PART I:
Sets, Data Models and Data Independence
PART II:
Laying the Foundation: Revolution, Math for Databases and Big Data
PART III:
Information Density, Mathematical Identity, Set Stores and Big Data

[Sk57]⋏
^{a}
^{b}
^{c}
Skolem, T.:
Two Remarks on Set Theory (The ordered ntuples as sets)
MATH. SCAND, 5 (1957) 4046
♦
Skolem concludes:
"I shall not pursue these considerations here, but only emphasize that
it is still a problem how the ordered ntuple can be defined in the
most suitable way."

[Sto5]⋏
Stout, R.:
Information Access Accelerator
Information Builders Inc.  2005
♦
Slide presentation explaining a 40 to 1 performance improvement over commercial DBMSs
by using a structured set access interface between applications and storage.

[Sto07]⋏
Stonebraker, M.; Madden, S.; Abadi, D.; Harizopoulos, S.; Hachem, N.; Helland, P.:
The End of an Architectural Era (It's Time for a Complete Rewrite)
33rd International Conference on Very Large Data Bases, 2007.
♦
Paper presents the position that separate storage organizations are needed for separate applications.

[Sto+14]⋏
Stonebraker, M., et. al.:
Enterprise Data Applications and the Cloud: A Difficult Road Ahead
Proc IEEE IC2E, Boston, Ma., March 2014
♦
Paper succinctly delineates potential growing pains for future DBMS,
if developers continue to rely on physically dependent data access technologies.
"There is considerable interest in moving DBMS
applications from inside enterprise data centers to the cloud,
both to reduce cost and to increase flexibility and elasticity.
In some circumstances, achieving good DBMS performance on
current cloud architectures and future hardware technologies will
be nontrivial.
In summary, there is a difficult road ahead for
enterprise database applications."

[Sto14]⋏
Stonebraker, M.:
Hadoop at a Crossroads?
BLOG@CACM, August, 2014.
♦
Persistent use of file systems perpetuating use of physical data location access strategies
will present serious challenges for future system developers.
"Hiding the (physical) location of data from the DBMS is death, and the DBMS will go to great lengths to circumvent this feature."

[Teo11]⋏
Teorey, T.;Lightstone, S; Nadeau, T.Jagadish, H. V.:
Database Modeling and Design
Morgan Kaufmann, 2011, Fifth Edition.
♦
Many in the industry consider this to be the best book available
on classic database design and for explaining how to build database applications,
complemented with
with objective commentary.
For example in Chapt. 8: ``In short, transferring data between a database and
an application program is an onerous process, because of both
difficulty of programming and performance overhead."
Copyright © 2014
INTEGRATED INFORMATION SYSTEMS
« Last modified on 10/17/2014 »
 CCONTACT 
