BIG   DATA
  Access & Analysis

                
  • Data Accessing -
    No shared knowledge of data organizations between processing components.


  • Inviolate Data -
    All data represented as mathematical objects supported by constructive non-destructive updating.


  • Scalability -
    Set-theoretic partitioning of data into application relevant data buffers ensures near-optimal platform dependent performance.


  • Global Data -
    All applications access data as mathematical objects not as physical files allowing disparate applications to share distributed data.


  • Interoperability -
    Since mathematical definitions of data are application and storage independent, disparate applications can easily share distributed data.
                              


DEVELOPMENT

Structured Set Processing routines are available for analyzing and improving access to existing and future systems data.

  • SSP Functions -
    Routines to access, analyze and manipulate physically stored data.


  • iXSP MANUAL -
    An interactive interface for using SSP functions.
                              


VALUE

Much of the value of XSP comes from experience and demonstrations of practical results.

  • XSP Analysis -
    Modeling existing systems in terms of XST to determine optimal behavior given a chosen platform.


  • SSP Functions -
    Derivation of set-theoretic descriptions for new applications.


  • Data Analysis -
    On site use of iXSP for data validation and discovery.
                              


FUTURE

Many years of research results are available and accessible to anyone willing to tolerate set-theoretic notation.

  • Cluster Computing -
    Synergistic data access for disparate applications with highly distributed data sources.


  • Category Theory -
    Modeling systems behavior in terms function spaces.


  • Intelligent Storage -
    Set processing data access operations embedded in storage devices.
                              

  XSP TECHNOLOGY  
Structured  Set  Processing  Systems

STRUCTURED SET DATA ACCESS
Right Data in the Right Form at the Right Time

"Future users of large data banks must be protected from having to know how the data is organized in the machine." - E. F. Codd[1] - 1970
Accessing large volumes of data is not well served by traditional data access strategies. Though many solutions are being proposed, none address the core issue of applications having to be cognizant of the organization of stored data. Moving relational operations to the storage side of the I/O, can improve performance by orders of magnitude.

INTRODUCTION
Accessing large volumes of data, whether centralized or distributed, requires a totally different approach than is being pursued by the industry. For small volumes of data, poking at stored data with index structures works fine. For accessing and analyzing large volumes of data, it does not.

Traditional data access strategies are predicated on the assumption that different application needs require different stored data organizations[2]. This is only a valid assumption if the storage organization is requisite knowledge for the operation of the application.

Alternatively, if an optimal data access strategy provides an application with just the right data, in just the right form, at just the right time, then the application can be oblivious as to how the data is organized in the machine. Which is exactly the relational vision Codd had in 1970.

Clouding the "optimal data access" issue is that the term data is not very well defined. What is it that is actually being accessed? Is it "data" or a "representation of data"? Without a precise understanding of exactly what is being accessed, it is difficult to judge what has actually been accessed, and even more difficult to determine if it is being accessed "optimally".

In order to develop an "optimal" data access strategy, it is helpful to know what is meant by "data". If data is assumed to be an undefined term, but data representation and data content are well understood terms, then a meaningful analysis of an optimal data access strategy can be pursued.

OPTIMAL DATA ACCESS
An optimal data access strategy supporting multiple applications access to large volumes of data would deliver just the right data, in just the right form, at just the right time. - VDa.png - Though no such access strategy can ever be a reality, but a near-optimal data access strategy could come close. A candidate is currently available that evolved from a very early predecessor, STDS[3]. STDS supported application independence from storage organizations and adaptive storage restructuring for time sharing commercial relational applications from 1972 through 1998. The current incarnation of STDS uses structured sets, SSETS, as a medium to exchange data between applications and stored data.

APPLICATION INDEPENDENCE
As influential and productive as the Relational Data Model, RDM, has been for the past thirty years, its full potential has not been approached. The two essential features of the RDM are

1) Set-theoretic representations of data, and
2) Application independence from storage organizations of data.
Neither of these features have yet to be exploited by existing commercial systems. There has not been a need to do so, nor has there existed a formal foundation that would have supported an effort if there had been a need. The escalating growth of accumulated data now provides such a need.

Fortunately, there now exists a formal foundation[4] (not available in 1970) allowing RDM operations extended to all types of data. With applications being access independent of storage organizations, RDM operations on the storage side of the I/O can improve performance by orders of magnitude.

RELATIONAL SYSTEMS
Though the commercial success of the RDM for the last thirty years has validated the practicality of a mathematical model of data, the model is inadequate for supporting data access requirements of future systems. The the RDM is mathematically deficient in two ways:

1) Restricted selection of logical data representations at the application level:
The selection of logical data representations is limited to one, Tables.

2) Restricted selection of physical data representations at the storage level:
The selection of physical data representations is zero.
These deficiencies can be easily remedied:
The first is a function of the mathematical support available at the time. This is resolved by using structured sets as a foundation instead of Classical sets.
The second is really not a deficiency with the RDM, since Codd explicitly indicated that the model was not to address storage representations[5].

The storage data representation deficiency was imposed by implementers that equated the logical concept of Table with the physical concept of file. This too can be remedied by the use of ssets. It will be shown that Tables are ssets, and that files are ssets, but that Tables are not files. With these deficiencies resolved the RDM can resume its intended place as a rich mathematically sound application component of a general system model.

FILE vs. SSET
A general system data model requires two cooperating data sub-models: and application data model for productivity, and a storage data model for performance. Current systems use well defined application data models, but do not use independent and well defined storage data models. The reason, no adequate storage data models exist. Without the availability on an independent storage data model, application data models have to assume the role of system data model. This requires application models, intended for users ``to know how data is organized in the machine". Exactly the fear that Codd warned about in 1970.

  • Physical File Data Access Strategies
    • Very Poor I/O Performance - FILE ACCESS.
    • Endangered Data Integrity - DESTRUCTIVE UPDATES.
    • Structured-Data Access Paths - PHYSICAL DEPENDENCE.
  • Structured Set Data Access Strategies
    • Highly Optimized I/O Performance - SSET ACCESS.
    • Inviolate Data Integrity - CONSTRUCTIVE UPDATES.
    • Relevant-Data Access Operations - LOGICAL DEPENDENCE.

PHYSICAL FILE ACCESS
Systems where data access strategies are based on how data is represented and organized in storage can be called container-dependent systems. - PFDA.png - Any system where the application directly access files in storage is a container-dependent system, by definition. This, of and by itself, is not a condemnation of such systems, only a frame of reference to distinguish capabilities and limitations of different data accessing systems. For many scientific applications container-dependent systems are a necessity.

STRUCTURED SET ACCESS
A system that shares logical data content without requiring the sharing of physical data representation is content-dependent. The key contribution of structured sets is the ability to share the mathematical identity of a data representation without having to share the physical representation. - SSDA.png - For example the mathematical objects VI, six, 6, and 0110 all have the same mathematical meaning, but do not have the same physical representation. System data access strategies can take advantage by expressing application data requests in terms of desired result instead of algorithmic derivation. Figure 5 reflects an SQL query that only specifies what result is required, in terms of sset specifications, and the storage management system recruits the necessary stored data and aggregates the relevant data into an application deliverable.

STRUCTURED SETS
A structured set, sset, is an extended set as defined under the axioms of extended set theory[Bla11]. Conceptually an extended set, or sset, is just a classical set with an extended membership to provide two conditions for set membership instead of just one. The particulars are rather boring, but the utility of the extension allows a set theoretic dimension for structure. The only difference between classical sets and structured sets is that classical sets have only one condition for membership while structured sets require two conditions.

If A is a sset, then if A(x,s) is true, x is a s-member of A.
If false, x is not a s-member of A.
For example: let A=<a,b,c> then A(b,2) is true, while A(a,2) is false.
Since Classical sets have no structure, the membership test for any Classical set A is A(a,∅). Thus, all Classical sets are structured sets with null structure.
The structure component of sset membership can be used to distinguish the content part of a data representation from the container part. Though set theory has been tried many times as a formal data model, it has always failed to provide the ability to suitably define data records as unambiguous sets. Structured sets, ssets, provide an additional component to classical set membership allowing a formally defined representation of data that uniquely distinguishes the logical data relationships (content) from the physical data representation (container).

ALL DATA REPRESENTATIONS ARE[8] STRUCTURED SETS.
THUS, ALL DATA CAN BE MANAGED USING SSET OPERATIONS.
Since ssets can formally represent any and all application and storage data with the ability to distinguish data content from data container, sset based access strategies can manipulate data content and data containers independently to provide near-optimal access performance for each and every application.

With ssets the distinction between content and structure is an innate property of extended set membership. This property makes ssets a natural choice for modeling representations of data. Under a structured set data model all logical and physical representations of data are ssets. All manipulations of logical and physical data representations can be modeled by sset operations. For presentation convenience or performance considerations sset operations can be defined that map the content of one sset to another sset having a totally deferent structure. Thus a structured set data model is an ideal choice for modeling data independent access systems.

STRUCTURED SET STORAGE SYSTEM
In a structured set storage system, SSSS, all representations of data, both logical and physical, are recognized[8] as ssets. Therefore a reasonable first task in supporting this claim might be to show, by example, that RDM Tables can be supported data independently[9] under a SSDM. In doing so, also support an earlier claim that all Tables are ssets and all files are ssets, but that Tables are not files. The files, in this example, being arrays or flat files

Example 1: T1 and T2 are two structurally distinct content equivalent data representations of RDM-Tables representing the same RDM-Relation with data content for husbands under domains Name, Age, and Wife. A1 and A2 are two structurally distinct content isomorphic data arrays. Note below that the two RDM-Tables are equal, T1 = T2. but the two data arrays are not,  A1A2.

T1
Name  Age  Wife
Alan 43 Mary
Bill 37 Jane
A1
   
Alan 43 Mary
Bill 37 Jane
T2
Wife  Name Age
Jane Bill 37
Mary Alan 43
A2
 
Jane Bill 37
Mary Alan 43
T1 = { { Alan<Name>,  43<Age>,  Mary<Wife> },    { Bill<Name>,  37<Age>,  Jane<Wife> } }
A1 = { { Alan1,  432,  Mary3 }1,    { Bill1,  372,  Jane3 }2 }
T2 = { { Jane<Wife> , Bill<Name>,  37<Age> },    { Mary<Wife> , Alan<Name>,  43<Age> } }
A2 = { { Jane1 , Bill2,  373 }1 ,    { Mary1 , Alan2,  433  }2 }

The example above only demonstrates that structured sets can faithfully represent Tables and arrays by preserving the content and structural differences between them. For a SSDM to be of any value the mappings of application data to storage data must be expressed in terms of application ssets and storage ssets

Let   FA1 = {  < Name, Age, Wife>DOMAINA1ARRAY  } and
                                    let   FA2 = {  < Wife, Name, Age >DOMAINA2ARRAY  },
Then there exist a mapping, M, such that:
                                    M(FA1) =  M(FA2) =  T1  =  T2.
The mapping, M, above is a well defined sset operation that transforms the data content of files FA1 and FA2 into application Tables represented logically by T1 and T2. Different applications are free to determine which specific physical array structure best suits the application. By using ssets to model data representations and sset operations to map data between applications and storage all data storage and processing systems can be faithfully[7] modeled under a SSDM.

SQL Is Not Dead
There has been a recent awakening awareness of the deficiencies of RDM implementations. The blame is usually focused on SQL, but there are just two problems and SQL is not one of them.

Problem 1: The RDM itself is constricted by classical set theory which only provides a binary condition for member of a set, in or out. This severely limits the formal support for modeling membership and structure at the same time.

Problem 2: The use of Tables equated to files, though expedient at the time, violates the principle intent of the RDM and has effectively crippled the expansion of RDM capabilities to high-performance distributed applications.
Though both of these problems are easily remedied, it is unlikely that providing customers with improved capabilities at a lower cost is of interest to peddlers of existing systems.

Replace TABLES with SSETS
Though observations of SQL implementation deficiencies seem generally legitimate, the proposed solutions ignore the root problem of manipulating physical data representations. The necessity of tethering application data views to physical data representations is exactly what the RDM intended to eliminate. SQL, in principle, can do just that. The upgrade of SQL implementations to leverage future hardware and new exotic data representations with proven relational capabilities is really quite easy, replace Tables with ssets, remembering that all Tables are[8] already ssets. Since the workhorse of SQL is the SELECT statement and since all Tables are already ssets, the only requirement to upgrade SQL implementations is to provide structured set accessing to properly configured SQL SELECT statements.

SQL with SSETS
The TPC-H benchmark is an industry standard using 22 SQL queries for comparing the performance of physical data accessing systems. Of the 22 queries, Query 9 is the most challenging. It requires a 5-way join. The rules of the benchmark demand transaction style indexes and that the same physical data organization be shared by all 22 queries. If the constraint of using physical data access strategies is removed, and if the focus is switched to providing customers with the best total elapsed time, then the SQL Query 9 performance can be improved by orders of magnitude[St05].

Hadoop, MapReduce, & SSETS
Hadoop[Hado]. and MapReduce[MapR]. are both ideal subjects for data access performance improvement by using logical sset access strategies to replace the existing physical file accessing strategies. The Hadoop distributed file system already leverages 64 MB blocks, and blocks are just restricted ssets. MapReduce already employs a restricted form of set theoretic partition, process, produce strategy which can be expanded to include more application relevant data packing.

CONCLUSION
Given the advances in hardware platforms and the need for accessing large amounts of data, structured set access strategies offer orders of magnitude better system performance than current physical file data access strategies.

  • STRUCTURED SET ACCESS SYSTEMS: Critical Concepts:
    • ALL DATA REPRESENTATIONS ARE SSETS.
    • DATA ACCESSED BY SSETS, NOT BY FILES.
    • STORAGE MANAGEMENT USES SSET OPERATIONS.
The generality, mathematical soundness, and adaptability of a structured set access system allows developers and users an easy migration path from restrictive physical file accessing systems to scalable logical sset access systems. Structured set I/O management and data access software is currently available for analysis, development, and deployment of structured set access systems.



REFERENCES

  1. [Cod70]^ a b c Codd, E. F.: A Relational Model of Data for Large Shared Data Banks CACM 13, No. 6 (June) 1970
    Abstract: Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).
  2. [Sto07]^ Stonebraker,~M.; Madden,~S.; Abadi,~D.; Harizopoulos,~S.; Hachem,~N.; Helland,~P.: The End of an Architectural Era (It's Time for a Complete Rewrite) 33rd International Conference on Very Large Data Bases, 2007.
    Paper presents the position that separate storage organizations are needed for separate applications.
  3. [STDS]^ MICRO/STDS RDBMS 1972-1998
    MICRO was the first DBMS to use set-theoretic operations (STDS) to create and manage stored data.The system supported timesharing commercial applications from 1972 through 1998.
  4. [Har08]^ Harizopoulos,~S.; Madden,~S.; Abadi,~D.; Stonebraker,~M.: OLTP Through the Looking Glass, and What We Found There SIGMOD'08, June 9-12, 2008
    Over 90% of an application process is indexed-access overhead.
  5. [Cha01]^ Champion, M.: XSP: An Integration Technology for Systems Development and Evolution - Software AG - 2001
    The mathematics of the relational model is based on classical set theory, CST, and this is both its strength and its weakness. An "extended set theory", XST, can be used to model ordering and containment relationships that are simply too "messy" to handle in classical set theory and the formalisms (such as relational algebra) that are based on it.
  6. [Bla11]^ Blass, A., Childs, D L: Axioms and Models for an Extended Set Theory - 2011
    This paper presents the formal foundation for supporting "structured sets".
  7. [Lar09]^ Larsen, S. M.: The Business Value of Intelligent Data Access - March 2009
    Article provides an excellent description on how difficult it is to optimize data access paths.
    "For a wide range of reasons, designing for and maintaining optimal data access poses a genuine challenge to even the most sophisticated enterprises." p. 2
  8. [Li07]^ a b c Lightstone, S; Teorey, T.; Nadeau, T.: Physical Database Design - Morgan Kaufmann, 2007
    A comprehensive analysis of how complicated the physical database design process can be without the guidance of a formal data access model. Without such formal support physical file data access structures typically impede system performance. [excerpts below]
      a) File data access strategies are extreamly difficult to optimize. "Some computing professionals currently run their own consulting businesses doing little else than helping customers improve their table indexing design." Their efforts can improve query performance by as much as 50 times. (p. 2)
      b) Files are physical representations of data. Tables are logical representations of data, yet tables are equated to files. (p. 7)
      c) An index is data organization set up to speed up the retrieval of data from tables. In database management systems, indexes can be specified by database application programmers. (p. 8)
      d) "It is important to be able to analyze the different paths for the quality of the result, in other words, the performance of the system to get you the correct result and choose the best path to get you there." (p. 31)
      e) A block (or page) has been the basic unit of I/O from disk to fast memory (RAM), typically 4~KB in size. In recent years, prefetch buffers (typically 64~KB, as in DB2) have been used to increase I/O efficiency. (p. 371)
      f) The total I/O time for a full table scan is computed simply as the I/O time for a single block, or prefetch buffer (64~KB), times the total number of those I/O transfers in the table. (p. 372)
  9. [Nor10]^ North, K.: Three articles presenting a short historical perspective on the role of set theory, mathematically sound data models, and the importance of data independence. - 2010
    PART I: Sets, Data Models and Data Independence
    PART II: Laying the Foundation: Revolution, Math for Databases and Big Data
    PART III: Information Density, Mathematical Identity, Set Stores and Big Data
  10. [Teo11]^ Teorey, T.;Lightstone, S; Nadeau, T.Jagadish, H. V.: Database Modeling and Design Morgan Kaufmann, 2011, Fifth Edition.
    Many in the industry consider this to be the best book available on classic database design and for explaining how to build database applications, complemented with with objective commentary. For example in Chapt. 8: ``In short, transferring data between a database and an application program is an onerous process, because of both difficulty of programming and performance overhead."
  11. [Chi68a]^ Childs, D L: Feasibility of a Set-theoretic Data Structure A general structure based on a reconstituted definition of relation IFIP Cong., Edinburgh Scotland, August 1968
    This antique paper presented the thesis that mathematical control over the representation, management, and access of data was critical for the functional freedom of applications and I/O performance of future systems.
  12. [Chi68b]^ Childs, D L: Description of a Set-theoretic Data Structure AFIPS fall joint computer conference San Fransico CA, December 1968
    Presents early development of STDS, a machine-independent set-theoretic data structure allowing rapid processing of data related by arbitrary assignment.
  13. [Chi77]^ Childs, D L: Extended Set Theory A General Model For Very Large, Distributed, Backend Information Systems VLDB 1977 (Invited paper, abstract)
    Addresses the need for inherent (not superficial) data independence where applications are agnostic regarding the organization of stored data. (paper)
  14. [Chi84]^ Childs, D L: VLDB Panel : Inexpensive Large Capacity Storage Will Revolutionize The Design Of Database Management Systems Proceedings of the Tenth International Conference on Very Large Data Bases. Singapore, August, 1984
    As secondary storage devices increase in capacity and decrease in cost, current DBMS design philosophies become less adequate for addressing the demands to be imposed by very large database environments. Future database management systems must be designed to allow dynamic optimization of the I/O overhead, while providing more sophisticated applications involving increasingly complex data relationships.
  15. [Chi86]^ Childs, D L: A Mathematical Foundation for Systems Development - NATO-ASI Series, Vol. F24, 1986
    Paper presents a Hypermodel syntax for precision modeling of arbitrarily complex systems by providing a function space continuum with explosive resolution and extended set notation to provide generality and rigor to the concept of a Hypermodel.
  16. [Chi10]^ Childs, D L: Why Not Sets? - 2010
    Why are sets not used in modeling the behavior and assisting the development of computing systems?
  17. [Chi11]^ Childs, D L: Functions Defined by Set Behavior A Formal Foundation Based On Extended Set Axioms - 2011
    Within the framework of extended set theory, XST, the concept of a function is defined as a behavior of sets in terms of how specific sets react subject to their interaction with other sets. A notable consequence of this approach is that the mathematical properties of functions need no longer be dependent on the mathematical properties of a Cartesian product.
  18. [Fay13]^ Fayyad, U. M.: Big Data Everywhere, and No SQL in Sight SIGKDD Explorations, Volume 14, Issue 2 - 2013
    "By moving to more flexible data platforms, many of the companies making the move to NoSQL data systems are taking on a major liability down the road."


SUPPLEMENT MATERIAL
  • XSP TECHNOLOGY Theory & Practice
    Formal Modeling & Practical Implementation of XML & RDM Systems: Every technology must have a sound underlying theory to support the consistency and predictability of the methods promoted by the technology. In this context, the term theory is respected as an articulation of a body of rules governing the relationships and behavior of objects in a specific system of interest.
  • Data Representations as Mathematical Objects
    Considering Content Compatibility of Relational & XML Data Representations: The theme of this paper is to treat all data representations as mathematical objects instead of as physical structures.
  • Set Processing At The I/O Level
    A Performance Alternative to Traditional Index Structures: It is generally believed that index structures are essential for high-performance information access. This belief is false. For, though indexing is a venerable, valuable, and mathematically sound identification mechanism, its logical potential for identifying unique data items is restricted by structure-dependent implementations that are extremely inefficient, costly, functionally restrictive, information destructive, resource demanding, and, most importantly, that preclude data independence. A low-level logical data access alternative to physical indexed data access is set processing. System I/O level set processing minimizes the overall I/O workload by more efficiently locating relevant data to be transferred, and by greatly increasing the information transfer efficiency over that of traditional indexed record access strategies. Instead of accessing records through imposed locations, the set processing alternative accesses records by their intrinsic mathematical identity. By optimizing I/O traffic with informationally dense data transfers, using no physical indexes of any kind, low-level set processing has demonstrated a substantial, scalable performance improvement over location-dependent index structures.
  • SET-STORE Data Access Architectures   Data Access Architectures for Cloud Computing Environments Row-store and column-store architectures rely on DATA ACCESS PATHS for accessing and manipulating data by its physical properties. Set-store architectures rely on DATA ACCESS OPERATIONS for accessing and manipulating data by mathematically distinguishing between DATA CONTENT and DATA REPRESENTATION. Traditional architectures link applications and storage physically. Set-Store architectures link applications and storage mathematically. Set-Store architectures provide dynamic restructuring of storage to supply applications with just the right data, in just the right format, at just the right time.
  • Information Access Accelerator XSP Software Performance Evaluation
    A slide presentation of an industry performance comparison of IBM and Oracle row-store based RDBMSs with iXSP, an interactive set-store data access system. Two benchmarks were performed. One showing a "40-FOLD SPEED INCREASE". The other showing a 76-98 fold performance improvement.
  • Managing Data Mathematically:   Data As A Mathematical Object:
    "Using Extended Set Theory for High Performance Database Management" Presentation given at Microsoft Research Labs. with an introduction by Phil Bernstein. (video: duration 1:10:52) - 2006   We introduce the formal foundations of a set-theoretic data model that can model data at both the logical and physical level. To demonstrate its practical value, we show how to use it to dynamically restructure data based on query requirements. Over time, most queries can be answered by retrieving from disk at most a small superset of the data they actually need, thereby yielding higher performance than conventional methods in today's database systems.
  • MORPHISMS as Set Behavior
    For both conceptual convenience and mathematical compatibility, the term morphism will be defined as an abstract modeling symbol that can be algebraically manipulated and combined with other like symbols to reflect the behavior of one set as influenced by another. Not having any mathematical substance, morphisms do not exist in any formal set theories and thus can not be contained in sets. However, since the notation for a morphism can be defined in terms of legitimate sets, f and σ, sets of morphisms can be simulated by sets of the form ( f σ }.
  • Sherlock Holmes Meets the Wright Brothers
    The point to be made is that successive generations of computer systems can not survive the design philosophy that has dictated prior system implementations. This major failing has to be shared by both industry and the academic community. Neither has concentrated on what the user needs most: PERFORMANCE.

Copyright 2014   INTEGRATED INFORMATION SYSTEMS   Last modified on 03/11/2014
-  CONTACT -