XSP TECHNOLOGY    
Extended Set Processing Technology

Machine-Independent Data Structure Management
Accessing Data by Mathematical Identity, Not by Physical Location

XSP, extended set processing, represents data as a mathematical object allowing data access to be both application and platform independent. Over fifty years ago an ARPA funded research project developed data access operations that allowed any machine-readable data to be accessed and processed in real time. These access operations were implemented as a set-theoretic data structure, STDS. ARPA envisioned mathematically sound data structures that were interrupt-immune. Implementation of these operations for reliable, optimal data access from any platform by any application is the subject of the following paper.

Introduction
Data can be viewed as a collection of associated items. These associations are interpreted as meaningful relationships. For accessing and processing these relationships they need to be represented on a computer. The data relationships are machine-independent. Their computer representations are not.

In 1965 ARPA initiated a research project to explore the existence of a mathematical foundation for defining and manipulating data representations on computers. ARPA considered current data structure development too application and platform dependent, thus be unable to support optimal function and performance capabilities of future systems. It was ARPA's contention that machine representations of data should be transparent to all applications no matter the platform, i.e. machine-independent.

Data relationships and mathematical sets have something in common, they are both devoid of any physical dependencies. Set operations already exist for processing relationships mathematically defined as relations. Set operations are defined in terms of a set's mathematical identity, not in terms of a set's physical representation. Set theory also supports a wealth of operations for processing sets by their mathematical identities. If data relationships had mathematical identities, then set operations could be used to access and process data. This observation was used in 1965 to initiate research on the feasibility of developing a mathematically well-defined set-theoretic data structure. STDS was later used as the storage management system to support an Information Management System developed at the University of Michigan and commercially available from 1971 through 1998.(see STDS-IMS) Set processing in a network environment was implemented in 1975. (see Hardgrave)

Data as a Mathematical Object
Data is essentially a collection of meaningful relationships of interest. Exploiting this interest with the aid of a computer requires representing data relationships in a machine-friendly format. Thus, by definition, data representations are machine-dependent.

A mathematical foundation for defining and manipulating data relationships requires that data relationships have mathematical identities. Since any actual manipulation is done by a computer, the machine representations of data relationships must also have mathematical identities. Binary relationships already have an identity as binary relations defined by set theory. The wealth of set operations already available for well-defined relations made set theory an attractive candidate as a mathematical foundation for developing future data processing systems. Unfortunately, providing data relationships with a unique mathematical identity was not feasible with any then available set theory.

Records as n-Tuples
No matter what data model is chosen for a specific data processing application, the data has to be physically structured in order to be accessed. To support transparency for all user data models, machine-independence was required at the I/O level. At the I/O level all data structures are defined in terms of records, fields, and files. The structural similarity between computer records and set-theoretic n-tuples is difficult to ignore. To model data structures at the I/O level STDS was developed in terms of tuples, domains, and sets.

Since records are a basic representation of data and n-tuples are mathematical objects for representing item relationships, formally identifying the two seemed to meet ARPA's research objective for a mathematical foundation for supporting data structures. And in principle it did, but in practice it did not. The fundamental problem was that manipulating records relied on programming item access to the contents of records. Even though n-tuples have a set-theoretic identity it is declared instead of being defined by set-membership. Specifically, the tuple < a, b, c > does not contain a, b, or c as an item. While the record [ c, b, a ] contains all three.

Though the set { a, b, c } contains exactly the three items, so does { c, b, a }. Not surprising since it is exactly the same set. But < a, b, c > and < c, b, a > are not the same set. Though they belong to the same equivalence class, they are inverses of each other. This is not easily verified using classical set notation.

Determining that two records are equal is simple. If for all i, the i-th items of record r and record s are the same, the records are equal. Determining if two n-tuples are equal is not so simple. The i-th item of an n-tuple is not well-defined in set theory. Without a formal mechanism to determine if two n-tuples are equal, representing records by n-tuples is not of much value .

Record Reordering
In set theory, operations on sets are defined in terms of set membership. To define operations on n-tuples that mimic record operations, n-tuples need a well-defined membership condition. Assume that n-tuples have a suitably defined membership condition, then define ρi ( < x1, x2, .. , xn > ) =  x i.
             Item Valueρi ( < x1, x2, .. , xn > ) =  xi
Given the i-th item value operation defied by ρi ( r ) =  ri. determining if two records, possibly on separate platforms, are equal can now be defined set theoretically.
             Theorem:  Record r equals record s iff (∀i) ( ρi ( r )  =  ρi ( s ) ).
Being able to determine whether two records on separate platforms are equal is necessary, but nowhere near sufficient. Field ordering does not affect the represented relationship, but it does impact system data accessing and performance dictated by resulting data structures. Since a record of cardinality n, |n|, provides n! different representations for the same relationship. For relationships to be shared across platforms some mechanism is needed to detect relationship equivalence. A re-mapping operation can be defined that allows any tuple in an equivalence class to be re-mapped to any other tuple in the same equivalence class.
             Re-map:  Given σ with |σ| = k and record r, let ρσ(r)  =  < ρσ1(r),  ρσ2(r),.., ρσk(r)  >.
                           Example:    ρ< 3, 2, 4, 1 > (< a, b, c, d >)  =  < c, b, d, a >
Since an equivalence class of n degree records has n! different representations, and since mapping directive σ has exactly n!, all possible mappings are covered. This provides platform independence. Actually a little too much independence since platforms have no way to recognize relationship equivalence. Some commonality needs to be shared between platforms. By sharing appropriate mapping directive, σs, relevant relationships can be easily shared across platforms.

Relevant Reordering
Fortunately, records come equipped with item Fields. Field designations are required to be unique, even if they represent the same set of item values. Field orderings determine item order and provide a basis for equating relationship equivalence between files of the same equivalence class. By extension physically defined data Fields equate to set-theoretically well-defined Domains. Just as Field ordering determined record ordering, Domain ordering and selection dictate n-tuple orderings.

Since not all applications require access to all relationship Domains, I/O and processing performance can be dramatically improved by transferring only those Domains actually required by an application. Re-map provides a real-time ability to provide a record reduction and re-ordering during data access. This Domain selection capability was included in the 1968 version of STDS.(see AFIP)

In 2005 Information Builders Inc, compared the performance of an STDS supported Information Access Accelerator with two of the current leading DBMSs.(see IAA) The real-time application of set-theoretically defined Domain selection gave typical performances improvements between 40 to 90 times that of commercial DBMSs.

            The XST architecture preserves membership across environments while making it possible for
                    the data representation to vary from one environment to the next. Thus, performance at the
                    process level can remain independent of how relationships are expressed at the conceptual
                    level and how data is represented at the storage level.

                    This is the case for the XST approach in a nutshell:
                    (1) Set membership determines functionality.
                    (2) Data representation at the storage level determines performance.
                    (3) Strong data independence separates functionality concerns from performance concerns(IAA-13)

Since extended set operations are application independent, program language independent, platform independent, and storage organization independent the only performance dependencies are the program blueprint and the programmer's imagination.( IAA-12, IAA-14)

Though the industry has long believed that index structures are essential for best performance, the exact opposite is true. Research has shown that over 90% of an application process can be indexed-access overhead.(SIGMOD) By eliminating index structures in favor of direct data access using extended set operations performance can be improved by orders of magnitude.(IAA-30) For more extensive performance results of two industry benchmark studies see IAA report pages 23-39.

RELATIONS (set theoretic)
In set theorey, relations are well-defined as subsets of the Cartesian product. The Cartesian product is well-defined in terms of ordered pairs. The ordered pair is not well-defined. It is declaired into existence by an individual. Not just one individual, but by several. Each with a different agenda. None of which allow the Cartesian product to be associative. Under the axioms of extended set theory, a definition of ordered pair was well-defined that allowed the Cartesian product to be associative. This in turn allowed n-ary relations to be well-defined subsets of a well-defined extended Cartesian product.

XSP: Smoke & Mirrors?
To some, extended set processing, XSP, seems to be more "smoke & mirrors" than "sets & mathematics". There is no such thing as an "extended set". This is not a totally unjustified view. Without a deep understanding of set membership condition and the difficulties imposed by the Kuratowski definition of an ordered pair, it would not be obvious what role an extended set membership condition could have on relations defined as subsets of the Cartesian Product.(see Berztiss)

XML, Categories, λ-Calculus
The 1968 implementation of STDS only supported access to homogeneous data arrays. This was far short of ARPA's vision of a formal foundation for accessing any and all possible computer representations of data. By 1998 an implementation of STDS was equipped to recognize and process XML structures as mathematical objects. It demonstrated XSP ability to establish content equivalence between RDM table-structures and XML computer representations. (see [T&P])

A mathematical foundation for developing executable systems is of little value without capable tools for exploiting its potential. Category theory and the Lambda calculus provide just such tools. Both have known difficulty associating with Classical set theory. Extended set theory proved more accommodating. (see Blass, Kats)

Summary
Mathematically defined data structures access data by its content. Physically defined data structures access data by its location. Mathematically defined data structures are intrinsically stable. Physically defined data structures are intrinsically fragile. Mathematically defined data structures are global. Physically defined data structures are local. Mathematically defined data structures are machine-independent. Physically defined data structures are machine-dependent.

Without a unifying mathematical foundation data accessing is a disjoint universe of physically intractable data structures. With a unifying mathematical foundation data accessing is a unified universe of mathematically synergistic data structures.


References

  1. [Skolem] Two Remarks on Set Theory, Skolem, T.: 2. The ordered n-tuples as sets, MATH. SCAND, 5 1957, "It is still a problem how the ordered n-tuple can be defined in the most suitable way."
  2. [Suppes] Axiomatic Set Theory, Suppes, P.:Van Nostrand, 1960, "All relations are binary."[p. 57]
  3. [AFIP] Description of a Set-theoretic Data Structure: - 1968
  4. [ARPA] CONCOMP: Research in Conversational use of Computers: final report, December 1970
  5. [Berztiss] DATA STRUCTURES: Theory and Practice, Berztiss, A.T., Academic Press, New York, 1971, "We have to develop a completely new theory of ordered objects."[p. 27]
  6. [STDS-IMS] UoM Information Management System, STDS supported, 1971-1998
  7. [Hardgrave75] Set Processing in a Network Environment Extended set theory is an effective method for describing the complex data structures needed to support large-scale data base applications. [Hardgrave, W.T. 1975]
  8. [Hardgrave76] A technique for implementing a set processor, Hardgrave, W. T. 1976
  9. [VLDB77] 1977 VLDB Extended Set Theory: A General Model For Very Large, Distributed, Backend Information Systems.
  10. [LLL] Set Theoretic Data Structures (STDS), Lawrence Livermore Lab. Technical Report, 1977.
  11. [NATO] A Mathematical Foundation for Systems Development - NATO-ASI Series, Vol. F24, 1985
  12. [VLDB84] 1984 VLDB Panel: Inexpensive Large Capacity Storage Will Revolutionize Future Systems.
  13. [Champion] XSP: An Integration Technology for Systems Development, Champion, M. - 2001
  14. [SIGMOD] OLTP Through the Looking Glass, and What We Found There Harizopoulos,~S.; Madden,~S.; Abadi,~D.; Stonebraker,~M.: SIGMOD'08, 2008 ♦ Substantial time is spent in logging, latching, locking, B-tree, and buffer management operations. Over 90% of an application process is indexed-access overhead.
  15. [IAA] Information Access Accelerator iXSP Performance evaluation, IBI, Stout, R. - 2005
  16. [MSLab] Managing Data Mathematically:   Data As A Mathematical Object: ♦ "Using Extended Set Theory for High Performance Database Management" Presentation given at Microsoft Research Labs. with an introduction by Phil Bernstein. (video: duration 1:10:52) - 2006
  17. [North] Sets, Data Models and Data Independence Ken North a Dr. Dobbs''s Blogger, March 10, 2010
  18. [SETS?] Why Not Sets? All computer processing is set-theoretic in nature. - 2013
  19. [C&C] Content-Container Data Access Strategies Content for Functionality - Containers for Performance - 2016 ♦ Since data represented for processing is not always ideal for preservation, and since data represented for preservation is not always ideal for processing, accessing application-friendly data from storage-friendly data poses a genuine challenge.
  20. [DDRvsPDS] Set Processing vs. Record Processing Performance: Dynamic Data Restructuring vs. Prestructured Data Storage - 2016 ♦ Record processing systems and set processing systems are very different. This paper attempts to clarify the major differences and demonstrate the performance advantages of set processing.
  21. [XSP] XSP: Extended Set Processing: Mathematically Managing Data Representations - 2016 ♦ This paper presents a high level overview of why a mathematical model of data representations is necessary, and how an extended set processing model accomplishes the task.


  Copyright 2023   INTEGRATED INFORMATION SYSTEMS   Version: 01/15/2023 v03

EXTENDED SET THEORY: Historical Lineage
Functions, Relations, Processes, & Categories

Extended set theory, XST, complements classical set theory, CST, in the following ways: arbitrarily large sets, fixed rank for n-tuples of constant rank items, associative Cartesian product, self-application, and support of category theory.

References

  1. [2T] Two Theoremsf: CST vs XST Functions - 2019
  2. [PFAC] Processes, Functions, Applications & Composition ♦ Lambda application & category composition of processes. - 2018
  3. [CSTasXST] XST Definitions, Operations, & Properties -Tuples, Graphs, Functions - ♦ CST operations as XST operations. - 2018
  4. [Kats] Kategories & Morfisims: - 2017
  5. [FUN] Functions As Set Behavior Essential Concepts: Conceptual & Formal Modeling Foundations. - 2016
  6. [Ellis] Extended Set Theory: A Summary A clarification of extended set notation. - 2015
  7. [Blass] Axioms and Models for an Extended Set Theory Blass, A., Childs, D L - 2011 ♦ This paper presents the formal foundation for "extending" the axioms of ZFC, which only addresses the property of membership, to include a "scope" property for addressing the property of "structure".
  8. [T&P] XSP TECHNOLOGY: Theory & Practice Formal Modeling & Practical Implementation of XML & RDM Systems - 2007 ♦ The RDM works in spite of set theory, not because of it. XML-Structures and operations on these structures are set-theoretically sound under XST. In this paper the formal foundations of RDM and XML systems are examined in the light of XST to provide practical relevance of XSP Technology to the field of software systems development.
  9. [X-AXIOMS] Summary of Axioms for Extended Set Theory: - 2005
  10. [XSP-XML] XSP for XML Systems Design & Development: - 2000
  11. [XST] Extended Axiomatic Set Theory: A Focus on Functions (early draft) - 1999
  12. [IFIP] Feasibility of a Set-theoretic Data Structure: A Reconstituted Definition of Relation, - 1968