XSP TECHNOLOGY    
Extended Set Processing Technology

Structured Set Data Management Systems
for
Accessing the Right Data in the Right Form at the Right Time
D L Childs

"Ultimate Database: The always-available, infinitely scalable relational database in the cloud doesn't exist - at least not yet." - M.  Manoochehri [Man14]
Though an Ultimate Database may not currently exist, the capabilities for its construction do. The set-theoretic foundations of Relational systems available in 1970 are not adequate for supporting current database demands. Updating the RDM to a Structured Set Data Model, SSDM, provides the necessary mathematical machinery for preserving relational capabilities while optimizing data access for any combination of applications with any combination of storage representation and data distribution.

Introduction
A growing interest in Big Data and Enterprise Cloud computing is already stress-testing the limits of traditional (relational) data storage and management systems. Specific limitations cited against relational implementations are: restricted data representation, protracted application development time, and rigid storage organization inhibiting scaling. There are a chorus of other issues, but they are mitigated when the previous ones are eliminated.

In 1970 the research effort was to provide application data representation independence from storage data representation dependence. Problems of horizontal scaling and distributed data access were not even on the horizon. It now seems prudent to revisit the foundations of relational systems to see if any research since 1970 might be able to augment existing systems by preserving relational capabilities while supporting optimized data access for any combination of applications with any combination of storage representation and data distribution.

Relational, Scalable & Schemaless
The ability to mathematically manipulate data representations has allowed Relational systems to dominate the industry for over thirty years. Unfortunately the mathematical muscle currently supporting RDBMS installations is not sufficient to support requirements for Big Data applications.

The NoSQL movement has provided an alternative approach by offering separate storage implementations for specialized data representation access needs. This approach is not universally appealing since it lacks the maturity of relational systems, requires a different system for each desired data representation, and currently provides no mathematical (relational) capability for processing data.

Since 1970 there has been substantial research into the foundations of set theory. This research provides a broader mathematical foundation for supporting and extending visions originally proposed by Codd for the RDM. For extended mathematical capabilities to be integrated into existing and future systems it first has to be shown how such capabilities could impact any real practical implementation issues. Three problems need to be addressed

Preserve Relational: Though the underlying strength of the RDM is its mathematical foundation in classical set theory, CST, this limits data representations to relations as arrays. A simple (in principle) solution is to broaden the definition of "relation" to include any and all possible computer based representations of data.

Provide Scalability: As long as applications are dependent on physical properties of storage data representations and organization, horizontal scaling will be difficult to achieve. By divorcing applications from ALL physical awareness of storage data, horizontal scaling is readily achievable. A physical insulation can be provided by a mathematical conduit between applications and storage.

Eliminate Schemas: The need for schemas in current relational systems is an unfortunate design consequence of requiring storage organizations to be percognizant of application data access needs. With operational support for the dynamic restructuring of storage data, this pre-structuring requirement is depreciated.

Each of these three concerns can be resolved by extending the RDM, based of classical set theory, to a structured set data model, based on using structured sets, as defined under extended set theory, XST.

SSDM: Structured Set Data Model
The definition of a structured set data model, SSDM, is deceptively simple:

SSDM: Any collection of data representations and operations on them that are well-defined under the axioms of extended set theory, XST.

There is no restriction on how many operations are defined, nor on what the operations do. There is no restriction on how many data representations are defined, nor on how they are defined. The only condition is that all operations and data representations be defined using extended set notation, XSN.

Structured Sets
A structured set is an extended set as defined under the axioms of extended set theory[Bla11]. Conceptually an extended set is just a classical set with an extended membership to provide two conditions for set membership instead of just one. The particulars are rather boring, but the utility of the extension allows a set theoretic dimension for structure. The only difference between classical sets and structured sets is that classical sets have only one condition for membership while structured sets require two conditions.

If A is a structured set and if  ΓA(x,s) is true, then x is a s-member of A.
If  ΓA(x,s) is false, x is not a s-member of A.
For example: let A = <a,b,c> = {a1, b2, c3}, then  ΓA(b,2) is true, while  ΓA(a,2) is false.
Classical sets have no structure. The membership test for any Classical set A is  ΓA(x,∅).
Thus, all Classical sets are structured sets with null structure.

The structure component of an extended set membership can be used to distinguish the container part of a data representation from the content part. Though set theory has been tried many times as a formal data model, it has always failed to provide the ability to suitably define data records as unambiguous sets. Structured sets provide an additional component to classical set membership allowing a formally defined representation of data that uniquely distinguishes the logical data relationships (content) from the physical data representation (container).

All Data Representations ARE[8] Structured Sets.
Thus, All Data Can Be Managed Using Set Operations.

Since structured sets can formally represent any and all application and storage data with the ability to distinguish data content from data container, structured set based access strategies can manipulate data content and data containers independently to provide near-optimal access performance for each and every application.

With structured sets the distinction between content and structure is an innate property of extended set membership. This property makes structured sets a natural choice for modeling representations of data. Under a structured set data model all logical and physical representations of data are structured sets. All manipulations of logical and physical data representations can be modeled by set operations. For presentation convenience or performance considerations extended set operations can be defined that map the content of one structured set to another structured set having a totally deferent structure. Thus a structured set data model is an ideal choice for modeling data independent access systems.

SQL & Structured Sets
The SQL SELECT statement was originally [Cha74] introduced as a set operation. The language provided applications the means for specifying the membership of a result set as derived from a given collection of sets. There was no need for applications to know how data was actually stored.

Originally the sets were restricted to being arrays, since that was the only structure reasonably supported by classical set theory. Now that structured sets are available for modeling any representation of data, there is no need to restrict SELECT statements to just arrays.

Conclusion
All application data representations are structured sets. All storage data representations are structured sets. All data transformations between applications and storage are structured set operations. A SSDM provides developers complete data management freedom, limited only by their choice of hardware, their imagination, and their tolerance for set theory.

Integrated Information Systems
Integrated Information Systems, IIS, was established in 1985 to provide research and consulting services. Assistance in understanding and applying SSDM techniques for preserving relational capabilities while supporting dynamic storage reorganization to optimize data access for any combination of applications with any combination of storage representation and data distribution. is available.


Notes

  1. CST sets are defined by axioms.
    CST sets are defined by axioms.
  2. CST axioms impose no structure on the members of sets.
    CST axioms impose no structure on the members of sets.
  3. container is use instead of structure to avoid disruptive associations.
    Note 3: The operations were set operations and the operands were relations.
  4. unexpected results: <a,b> ∩ <a,c> = <a,a>
  5. "Implementation of systems to support the relational model are not discussed." [Cod70 p.377].
  6. Codd used the term data independence, but it has since been corrupted to include physical file dependent systems.
  7. "faithful" means isomorphic representation of content, structure, and behavior.
  8. abc In set theory sets are defined by their membership and since data representations are a physical expression of the content and structure of the data that can be expressed mathematically, all data representations are, by definition, structured sets.
  9. Using Codd's original sense of data independence for data to be disassociated from any machine organization.
  10. This early system only supported flat files as mathematical objects.
  11. MICRO(1972-1998) used mathematically well defined operations in a time sharing environment to manage application data, storage data, and all transformations between the two.


References

  1. [Bla11] Blass, A., Childs, D L: Axioms and Models for an Extended Set Theory - 2011 ♦ This paper presents the formal foundation for supporting "structured sets".
        5.1 TUPLES: Traditional set theory has several ways of coding ordered tuples < a1, a2, .. , an > as sets, none of which is really canonical [Sk57]. XST provides a simple and natural way to represent tuples, namely to use natural numbers as scopes. Thus, a tuple < a1, a2, .. , an > is identified with the (extended) set { a11, a22, .. , ann }. The natural numbers used here as scopes can be represented by the traditional von~Neumann coding (with scope ∅),
        5.2 GENERALIZED TUPLES: Instead of indexing the components of a tuple by (consecutive) natural numbers, one could index them by arbitrary, distinct labels. The same XST representation stil works; use the labels as scopes. This provides, for example, a convenient way to deal with what are often called records in computing. Records have fields, in which data are inserted. We can represent them set-theoretically by taking the field names as scopes with the data as elements. Similarly, we can represent relations, in the sense of relational databases, by sets of generalized tuples, one for each row of the relation. The scopes for such a generalized tuple would be the attribute names, while the corresponding elements would be the values, in that row, of the attributes.
        5.3 FUNCTIONS and MULTI-FUNCTIONS (functions defined by set behavior): In general, a set f of generalized tuples, all having the same scope set D, can be regarded as describing several (possibly multi-valued) operations, called the behaviors of f.
  2. [Boy73] Boyce, R. F.; Chamberlin, D. D.; King, W. F.; Hammer, M. M.: Specifying Queries as Relational Expressions: SQUARE - IBM Technical Report RJ 1291, 1973 ♦ This paper presents a data sublanguage called SQUARE, intended for use in ad hoc, interactive problem solving by non-computer specialists. SQUARE is based on the relational model of data, and is shown to be relationally complete; however, it avoids the quantifiers and bound variables required by languages based on the relational calculus. Facilities for query, insertion, deletion, and update on tabular data bases are described. A syntax is given, and suggestions are made for alternative syntaxes, including a syntax based on English key words for users with limited mathematical background.
  3. [Cer10] Vint Cerf: "It's like 1973 for Moving Data Around in the Cloud" Article calls attention to concerns with moving data in Cloud environments.
  4. [Cha74] Chamberlin, D. D.; Boyce, R. F.: SEQUEL: A Structured English Query Language - IBM Research Laboratory, 1974 ♦ ABSTRACT: In this paper we present the data manipulation facility for a structured English query language (SEQUEL) which can be used for accessing data in an integrated relational data base. Without resorting to the concepts of bound variables and quantifiers SEQUEL identifies a set of simple operations on tabular structures, which can be shown to be of equivalent power to the first order predicate calculus. A SEQUEL user is presented with a consistent set of keyword English templates which reflect how people use tables to obtain information. Moreover, the SEQUEL user is able to compose these basic templates in a structured manner in order to form more complex queries. SEQUEL is intended as a data base sublanguage for both the professional programmer and the more infrequent data base user.
  5. [Cha01] Champion, M.: XSP: An Integration Technology for Systems Development and Evolution - Software AG - 2001 ♦ The mathematics of the relational model is based on classical set theory, CST, and this is both its strength and its weakness. An "extended set theory", XST, can be used to model ordering and containment relationships that are simply too "messy" to handle in classical set theory and the formalisms (such as relational algebra) that are based on it.
  6. [Chi68a] Childs, D L: Feasibility of a Set-theoretic Data Structure A general structure based on a reconstituted definition of relation IFIP Cong., Edinburgh Scotland, August 1968 ♦ This antique paper presented the thesis that mathematical control over the representation, management, and access of data was critical for the functional freedom of applications and I/O performance of future systems.
  7. [Chi68b] Childs, D L: Description of a Set-theoretic Data Structure AFIPS fall joint computer conference San Fransico CA, December 1968 ♦ Presents early development of STDS, a machine-independent set-theoretic data structure allowing rapid processing of data related by arbitrary assignment.
  8. [Chi77] Childs, D L: Extended Set Theory A General Model For Very Large, Distributed, Backend Information Systems VLDB 1977 (Invited paper, abstract) ♦ Addresses the need for inherent (not superficial) data independence where applications are agnostic regarding the organization of stored data. (paper)
  9. [Chi84] Childs, D L: VLDB Panel : Inexpensive Large Capacity Storage Will Revolutionize The Design Of Database Management Systems Proceedings of the Tenth International Conference on Very Large Data Bases. Singapore, August, 1984 ♦ As secondary storage devices increase in capacity and decrease in cost, current DBMS design philosophies become less adequate for addressing the demands to be imposed by very large database environments. Future database management systems must be designed to allow dynamic optimization of the I/O overhead, while providing more sophisticated applications involving increasingly complex data relationships.
  10. [Chi86] Childs, D L: A Mathematical Foundation for Systems Development - NATO-ASI Series, Vol. F24, 1986 ♦ Paper presents a Hypermodel syntax for precision modeling of arbitrarily complex systems by providing a function space continuum with explosive resolution and extended set notation to provide generality and rigor to the concept of a Hypermodel.
  11. [Chi06] Childs, D L: Managing Data Mathematically:   Data As A Mathematical Object: ♦ "Using Extended Set Theory for High Performance Database Management" Presentation given at Microsoft Research Labs. with an introduction by Phil Bernstein. (video: duration 1:10:52) - 2006
  12. [Chi10] Childs, D L: Why Not Sets? - 2010 ♦ Why are sets not used in modeling the behavior and assisting the development of computing systems?
  13. [Chi11] Childs, D L: Functions as Set Behavior A Formal Foundation Based On Extended Set Axioms - 2011 ♦ Within the framework of extended set theory, XST, the concept of a function is defined as a behavior of sets in terms of how specific sets react subject to their interaction with other sets. A notable consequence of this approach is that the use of functions need no longer be constrained by properties of a Cartesian product.
  14. [Cod70] -a-b Codd, E. F.: A Relational Model of Data for Large Shared Data Banks CACM 13, No. 6 (June) 1970 ♦ Abstract: Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).
  15. [Fay13] Fayyad, U. M.: Big Data Everywhere, and No SQL in Sight SIGKDD Explorations, Volume 14, Issue 2 - 2013 ♦ "The term BigData is not well-defined, but is generally used to refer to the type of data that breaks the limits of traditional data storage and management state-of-the-art."
  16. [Har08] Harizopoulos,~S.; Madden,~S.; Abadi,~D.; Stonebraker,~M.: OLTP Through the Looking Glass, and What We Found There SIGMOD'08, June 9-12, 2008 ♦ Over 90% of an application process is indexed-access overhead.
  17. [Lar09] Larsen, S. M.: The Business Value of Intelligent Data Access - March 2009 ♦ Article provides an excellent description on how difficult it is to optimize data access paths. "For a wide range of reasons, designing for and maintaining optimal data access poses a genuine challenge to even the most sophisticated enterprises." p. 2
  18. [Lig07] a b c Lightstone, S; Teorey, T.; Nadeau, T.: Physical Database Design - Morgan Kaufmann, 2007 ♦ A comprehensive analysis of how complicated the physical database design process can be without the guidance of a formal data access model. Without such formal support physical file data access structures typically impede system performance. [excerpts below]
      a) File data access strategies are extreamly difficult to optimize. "Some computing professionals currently run their own consulting businesses doing little else than helping customers improve their table indexing design." Their efforts can improve query performance by as much as 50 times. (p. 2)
      b) Files are physical representations of data. Tables are logical representations of data. "Tables are Files"? (p. 7)
      c) An index is data organization set up to speed up the retrieval of data from tables. In database management systems, indexes can be specified by database application programmers. (p. 8)
      d) "It is important to be able to analyze the different paths for the quality of the result, in other words, the performance of the system to get you the correct result and choose the best path to get you there." (p. 31)
      e) A block (or page) has been the basic unit of I/O from disk to fast memory (RAM), typically 4~KB in size. In recent years, prefetch buffers (typically 64~KB, as in DB2) have been used to increase I/O efficiency. (p. 371)
      f) The total I/O time for a full table scan is computed simply as the I/O time for a single block, or prefetch buffer (64~KB), times the total number of those I/O transfers in the table. (p. 372)
  19. [Man14] Manoochehri, M.: Data Just Right - Introduction to Large-Scale Data & Analytics) Addison-Wesley, 2014. ♦ Large-scale data analysis is now vitally important to virtually every business. Mobile and social technologies are generating massive datasets; distributed cloud computing offers the resources to store and analyze them; and professionals have radically new technologies at their command, including NoSQL databases. .
  20. [MICRO] MICRO DBMS 1972-1998 ♦ MICRO was the first DBMS to use set-theoretic operations (STDS) to create and manage stored data.The system supported timesharing commercial applications from 1972 to 1998.
  21. [MSQL] SQL Server Performance Team: Great New TPC-H Results with SQL Server 2008 17 Aug. 2009 ♦ "HP also published a 300GB result on their ProLiant DL785 platform with SQL Server 2008. This publication illustrates the high performance and solid price/performance using industry standard components from HP and Microsoft." (Load time: 300GB in 13.33 hours)
  22. [Nor10] North, K.: ♦ Three articles presenting a short historical perspective on the role of set theory, mathematically sound data models, and the importance of data independence. - 2010
    PART I: Sets, Data Models and Data Independence
    PART II: Laying the Foundation: Revolution, Math for Databases and Big Data
    PART III: Information Density, Mathematical Identity, Set Stores and Big Data
  23. [Sk57] a b c Skolem, T.: Two Remarks on Set Theory (The ordered n-tuples as sets) MATH. SCAND, 5 (1957) 40-46 ♦ Skolem concludes: "I shall not pursue these considerations here, but only emphasize that it is still a problem how the ordered n-tuple can be defined in the most suitable way."
  24. [Sto5] Stout, R.: Information Access Accelerator Information Builders Inc. - 2005 ♦ Slide presentation explaining a 40 to 1 performance improvement over commercial DBMSs by using a structured set access interface between applications and storage.
  25. [Sto07] Stonebraker, M.; Madden, S.; Abadi, D.; Harizopoulos, S.; Hachem, N.; Helland, P.: The End of an Architectural Era (It's Time for a Complete Rewrite) 33rd International Conference on Very Large Data Bases, 2007. ♦ Paper presents the position that separate storage organizations are needed for separate applications.
  26. [Sto+14] Stonebraker, M., et. al.: Enterprise Data Applications and the Cloud: A Difficult Road Ahead Proc IEEE IC2E, Boston, Ma., March 2014 ♦ Paper succinctly delineates potential growing pains for future DBMS, if developers continue to rely on physically dependent data access technologies. "There is considerable interest in moving DBMS applications from inside enterprise data centers to the cloud, both to reduce cost and to increase flexibility and elasticity. In some circumstances, achieving good DBMS performance on current cloud architectures and future hardware technologies will be non-trivial. In summary, there is a difficult road ahead for enterprise database applications."
  27. [Sto14] Stonebraker, M.: Hadoop at a Crossroads? BLOG@CACM, August, 2014. ♦ Persistent use of file systems perpetuating use of physical data location access strategies will present serious challenges for future system developers. "Hiding the (physical) location of data from the DBMS is death, and the DBMS will go to great lengths to circumvent this feature."
  28. [Teo11] Teorey, T.;Lightstone, S; Nadeau, T.Jagadish, H. V.: Database Modeling and Design Morgan Kaufmann, 2011, Fifth Edition. ♦ Many in the industry consider this to be the best book available on classic database design and for explaining how to build database applications, complemented with with objective commentary. For example in Chapt. 8: ``In short, transferring data between a database and an application program is an onerous process, because of both difficulty of programming and performance overhead."


Copyright 2014   INTEGRATED INFORMATION SYSTEMS   Last modified on 10/17/2014
-  CCONTACT -