XSP TECHNOLOGY    
Extended Set Processing Technology


Big Data Management
Putting Data Management on a Mathematical Foundation
D L Childs

Big Data management objectives (scaling, volume data, diverse and distributed data) are hindered by physical dependencies required to access data. In 1970 Codd showed how to improve data access by giving data a mathematical identity, tables, and by replacing physically dependent routines with mathematically reliable operations. Replacing physically dependent routines with mathematically reliable operations can dramatically improve data access of today's Big Data management systems.

In 1970 Codd introduced the Relational Data Model, RDM. Codd's approach was to abstract away the physical data structures and low-level algorithms of previous data management systems with mathematically reliable operations that could be used at the application level. Storage data and data access were still managed by physically dependent routines, but these physical dependencies were hidden from application users.

Codd's contribution to data management was introducing the concept of a mathematical identity of data, manipulated by set-theoretic operations.

RDM: Relational Data Model
The need to access large amounts of diverse distributed data was not a concern when traditional data management was being developed. The primary concern, forty years ago, was just accessing stored data. Since all data was managed by routines using physical properties of data representations,[Cod70] users were burdened with having to know the physical organization of data in storage.[Cod70]

The Relational Data Model, RDM, is an abstract model presenting a strategy for modeling data as mathematical objects and manipulating these objects with mathematically well defined operations. A Relational Database Management System, RDBMS, is an implementation combining the data management specifications of the RDM with some data access and storage organization data management strategy.

In his Turing acceptance paper in 1981,[Cod81] Codd provided some design objectives for building future data management systems.

Codd's Data Management Objectives
     (1) Data Independence Objective: Separate logical/physical aspects of data management.
     (2) Communicability Objective: Make the model structurally simple.
     (3) Set-processing Objective: Provide a foundation for set-oriented processing.
     (4) Foundation Objective: Provide a sound theoretical foundation for data management.

In 1981 the foundations of set theory were inadequate to support the requirements for faithfully modeling data representations, making it impossible for any implementation to satisfy all four data management conditions. Thus actual RDBMs implementations had to compromise the mathematical reliability of the relational data model and use physically dependent strategies to assess data from storage.

RDM: mathematically reliable, declarative, set-processing.
RDBMS: physically dependent, procedural, record-processing.

Despite the fundamental mismatch between RDM and RDBMS data management strategies, RDBMS ability to pragmatically manipulate data representations has allowed Relational systems to dominate the industry for over thirty years. Unfortunately the mathematical muscle of the RDM currently supported by RDBMS installations is not sufficient to support requirements for Big Data applications as the physical data structures are more and more distributed across processors and around the world.

The RDM was successful because it provided operations to somewhat minimize the physical dependency between the user's view of data and the system's representation of data, and this was sufficient for most data centers applications. But many of Codd's lessons have been lost in the transition to cloud computing. Today the physical details of data storage and management in the cloud make Big Data applications more difficult to develop, expensive to use, and difficult to design and maintain than they would if Codd's principles were respected.

Recently, extensions to the foundations of set theory have been completed and proven consistent.[Bla11] Extended set theory, XST, provides a sound theoretical foundation for data management by supporting a mathematical identity for all data representations, (RDM, XML, Xrel1, Xrel2).

Most importantly, XST allows all of Codd's data management objectives to be achieved. This means that the physical dependencies that distinguish Relational (SQL) from non-Relational (NoSQL) implementations can be isolated and removed.

Big Data Management
Scalability, diverse data, schema development overhead, accessing large volumes of distributed data are a few of the concerns facing developers of Big Data management strategies. Some early pioneering developers are even suggesting the need for a complete system overhaul.

The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for tomorrow’s requirements, not continue to push code lines and architectures designed for yesterday’s needs.[Sto7]

A system overhaul is not required. Existing systems can easily be upgraded and new systems developed to provide optimal performance for all applications. What is required is evolving physically dependent data management strategies into mathematically reliable data management strategies.

Physically dependent data management strategies (those that manage data using physical properties of data representations) require using index structures to access data.

Mathematically reliable data management strategies (those that manage data using a mathematical identity of data) use meta-data index sets and set operations to access data.

Mathematically reliable data management and analysis software, iXSP, already exists and is available to augment the performance of current systems and to assist in the development of future systems. iXSP is a suite of I/O operations for accessing and processing disk resident data, providing a 10 to 90 times performance improvement over IBM and Oracle.[Sto5] A summary of iXSP properties are:

*  Relationally complete.
*  Supports all four of Codd's data management objectives.
*  Requires no schema design.
*  Allows spontaneous SQL queries.
*  Has no index structure overhead.
*  Recognizes all data representations.
*  Both vertically and horizontally scalable.
*  Uses constructive updates instead of destructive updates.
*  I/O performance improvement, up to three orders of magnitude.
*  Parallel I/O access to very large distributed data.

The foundations of iXSP software were initiated under the CONCOMP, project sponsored by an ARPA grant from 1965 to 1970. The project was directed by F. H. Westervelt. One of his objectives was to "find a mathematical foundation for data structures". In 1968 two papers presented some success by reconstituting some formal definitions of classical set theory to better reflect intuitive notions. Further efforts lead to the development and implementation of a collection of set operations that worked directly on data stored as arrays. This software, STDS, was used as the data access engine for the Michigan University MICRO RDBMS, from 1972 to 1998.

STDS: Set-Theoretic Data Structure
In 1968 two papers were published suggesting the potential advantages of recognizing data as a mathematical object, managing data with set operations, and organizing data on disks using a set-theoretic data structure, STDS.[A][B]

Conceptually, the set-theoretic concept of n-tuple seemed like an ideal model for a data record. It was, and still is, an illusion. As articulated by Skolem,[Sk57] there is currently no suitable way to (set-theoretically) define an n-tuple. Intuitively, however, the set-theoretic operations defined for n-tuples were ideally suited to manipulate data records modeled as n-tuples. By ignoring the classical set theory, CST, difficulties, an n-tuple was considered to be a sequence of ordered values. A reconstituted definition was defined to be any set of n-tuples, such that "n" was the same. The set-theoretic condition of unique elements was also relaxed. The resulting definition of a "relation" was just a homogeneous array.

Given this mapping of relations to arrays, STDS operations were just a collection of classical set operations modified to work on homogeneous arrays. All relations were considered to be sets, that is autonomous and immutable. No conditions were imposed regarding key fields and no values could be modified. Set operations were required to read data from disks and write results back to disks. These conditions insured (what was later to be defined as) ACID protection.

iXSP:  Interactive, Extensible, Set-Processing, Data Management Software
Though STDS proved to be a very fast intrinsically reliable data management and data access system, its limitation to homogeneous array data seemed restrictive. Unfortunately, if mathematically reliability were to be maintained, arrays were to remain a limitation or extensions to the foundations of set theory were necessary to provide mathematical identities for other data representations. In 1985 Andreas Blass and the author initiated an investigation on how to legitimately (provide axioms for) extending the membership condition of classical set theory to provide recognition of set structure. Though the concept of scopes had been explored earlier and had proven successful for computer programming, the actually axioms supporting an extended set theory were not consummated until 2011.[Bla11]

RDBMS vs. iXSP
All RDBMS implementations use a mathematically reliable application model, the RDM, but support system data access with physically dependent indexes. This means that indexes must be planned for, designed, implemented, and maintained in order to be useful. This consumes considerable non-recurring engineering resources, as well as recurring processing and system administration overhead. In a high rate, low latency data processing system, there may not be time to effectively execute the necessary index and related statistical updates to effectively use them. In large database instances with 100's of millions or billions of records, the indexes themselves can get very large and deep (in b-tree structures), often requiring secondary indexing structures to manage the primary indexes.

In addition, RDBMS do not have a means to avoid transfer of significant irrelevant data across the I/O barrier. The entire table (or tables) involved in a query must be considered each time the query is executed. All columns of a table must be included in any record retrieval before performing any requested projections to obtain the desired output columns. In addition, records are often stored in random accessed pages, which can result in whole records being transferred over the I/O barrier for no other reason than they share a page with a desired record. All of these conditions result in a highly inefficient use of the storage system and exacerbates storage I/O limitations.

In iXSP, processing is done at the set level, not the record level. So there is no need for record level indexes. Instead, set theoretic operations are performed on sets as operands. And the resulting sets, along with their identity and relationship to their parent sets, are saved into a continually evolving set universe. Each requested set operation is first reduced algebraically to its lowest cost equivalent, using already computed sets where possible. This step is extremely fast as it uses set metadata and requires no I/O accesses. The resulting set equation is then executed and the results returned and saved.

Typical queries are looking for specific relevant information and result in sets that are several orders of magnitudes smaller in size than the original. In addition, it is typical for multiple queries to be related in one or more attribute as analyses proceeds along a line of investigation or "drill down". This results in the creation of multiple, small data sets containing highly relevant data related to the recent queries. Instead of relying on indexes to improve efficiency in finding requested records, iXSP uses these small set partitions of the full data set to rapidly and efficiently find the equivalent result set using minimum data transfer across the storage I/O boundary.

By exposing only mathematically well-defined data and operations to users and applications, set processing DBMS such as MICRO maximize performance by minimizing system overhead. For example, set processing does not require extensive schema development and index structure design and maintenance such as is required by RDBMS that expose these implementation details to users. Furthermore, by exposing only the set theoretic data and operations the DBMS implementers and administrators, MICRO and other set theoretic DBMS enabled dramatic performance benefit using dynamically generated set processing results rather than predefined index structures that add so much overhead to today’s RDBMS systems.[Har08] By focusing only on the sets and set operations, the implementers of MICRO anticipated database innovations such as column stores, data sharding, and data cracking that came along much later as ad hoc solutions to RDBMS performance bottlenecks. The performance advantage of set processing over record processing is not only in what it does by using set operations, but also by what it does not do by not requiring protracted schema development nor extraneous index structure overhead.

Big Data Challenges (Physically dependent data management)
Big Data applications introduce issues of scalability, data diversity, and distributed data access. All of which are restricted by physically dependent data management strategies. Replacing physically depended data management strategies with mathematically reliable data management strategies can resolve the restrictions on scalability, data diversity, and distributed data access. This can be achieved with the aid of a Structured Set Data Model.

The definition of a structured-set data model, SSDM, is deceptively simple:

SSDM: Any collection of data representations and operations on them that are well-defined under the axioms of extended set theory, XST.

There is no restriction on how many operations are defined, nor on what the operations do. There is no restriction on how many data representations are defined, nor on how they are defined. The only condition is that all operations and data representations be defined using extended set notation, XSN.

A structured-set is a set as defined under the axioms of extended set theory.

Conceptually an extended set is just a classical set with an extended membership to provide two conditions for set membership instead of just one. The particulars are rather boring, but the utility of the extension allows a set theoretic dimension for structure. The only difference between classical sets and structured-sets is that classical sets have only one condition for membership while structured-sets require two conditions.

The structure component of an extended set membership can be used to distinguish the container part of a data representation from the content part. Though set theory has been tried many times as a formal data model, it has always failed to provide the ability to suitably define data records as unambiguous sets. Structured-sets provide an additional component to classical set membership allowing a formally defined representation of data that uniquely distinguishes the logical data relationships (content) from the physical data representation (container).

All Data Representations ARE[a] Structured-Sets.
Thus, All Data Can Be Managed Using Set Operations.

Since structured-sets can faithfully represent any and all application and storage data with the ability to distinguish data content from data container, structured-set based access strategies can manipulate data content and data containers independently to provide near-optimal access performance for each and every application.

With structured-sets the distinction between content and structure is an innate property of extended set membership. This property makes structured-sets a natural choice for modeling representations of data. Under a structured-set data model all logical and physical representations of data are structured-sets. All manipulations of logical and physical data representations can be modeled by set operations. For presentation convenience or performance considerations extended set operations can be defined that map the content of one structured-set to another structured-set having a totally deferent structure. Thus a structured-set data model is an ideal choice for modeling data management systems satisfying all four of Codd's data management objectives.

Conclusion
In 1970 Codd introduced the advantage of set processing over record processing. His influence has dominated data management systems for over forty years. Though Codd advocated using set processing for both application data and system data, classical set theory, CST, was not suitable for faithfully modeling physical data representations.

Recent developments in extending the foundations of set theory now allow all data representations (both abstract and physical) to be faithfully modeled by set theory. It is now possible to fulfill Codd's data management objectives by developing mathematically reliable data management systems that provide all of the following:
          (1) Data Independence Objective: Separate logical/physical aspects of data management.
          (2) Communicability Objective: Make the model structurally simple.
          (3) Set-processing Objective: Provide a foundation for set-oriented processing.
          (4) Foundation Objective: Provide a sound theoretical foundation for data management.

Since all data representations have a mathematical identity under extended set theory, XST, all existing data representations can be mathematically identified and manipulated by set operations. This allows existing systems to be upgraded from physically dependent systems to mathematically reliable systems. iXSP software (or its equivalent) can be used to facilitate a system upgrade or to complement existing systems with a mathematically reliable data management extension


Notes

  1. In set theory sets are defined by their membership and since data representations are a physical expression of the content and structure of the data that can be expressed mathematically, all data representations are, by definition, structured-sets.
  2. faithful means isomorphic representation of content, structure, and behavior.
  3. MICRO(1972-1998) used mathematically well defined operations in a time sharing environment to manage application data, storage data, and all transformations between the two.


References

  1. [Bla11] -a-b Blass, A., Childs, D L: Axioms and Models for an Extended Set Theory - 2011 ♦ This paper presents the formal foundation for supporting "structured-sets".
        5.1 TUPLES: Traditional set theory has several ways of coding ordered tuples < a1, a2, .. , an > as sets, none of which is really canonical [Sk57]. XST provides a simple and natural way to represent tuples, namely to use natural numbers as scopes. Thus, a tuple < a1, a2, .. , an > is identified with the (extended) set { a11, a22, .. , ann }. The natural numbers used here as scopes can be represented by the traditional von~Neumann coding (with scope ∅),
        5.2 GENERALIZED TUPLES: Instead of indexing the components of a tuple by (consecutive) natural numbers, one could index them by arbitrary, distinct labels. The same XST representation still works; use the labels as scopes. This provides, for example, a convenient way to deal with what are often called records in computing. Records have fields, in which data are inserted. We can represent them set-theoretically by taking the field names as scopes with the data as elements. Similarly, we can represent relations, in the sense of relational databases, by sets of generalized tuples, one for each row of the relation. The scopes for such a generalized tuple would be the attribute names, while the corresponding elements would be the values, in that row, of the attributes.
        5.3 FUNCTIONS and MULTI-FUNCTIONS (functions defined by set behavior): In general, a set f of generalized tuples, all having the same scope set D, can be regarded as describing several (possibly multi-valued) operations, called the behaviors of f.
  2. [Boy73] Boyce, R. F.; Chamberlin, D. D.; King, W. F.; Hammer, M. M.: Specifying Queries as Relational Expressions: SQUARE - IBM Technical Report RJ 1291, 1973 ♦ This paper presents a data sub-language called SQUARE, intended for use in ad hoc, interactive problem solving by non-computer specialists. SQUARE is based on the relational model of data, and is shown to be relationally complete; however, it avoids the quantifiers and bound variables required by languages based on the relational calculus. Facilities for query, insertion, deletion, and update on tabular data bases are described. A syntax is given, and suggestions are made for alternative syntaxes, including a syntax based on English key words for users with limited mathematical background.
  3. [Cer10]↑ Cerf, V.: "It's like 1973 for Moving Data Around in the Cloud" Using a cloud computing service may sound enticing, but you better consider how that data can be moved around if you want to switch to a different provider. It's a big problem that now has the attention of Vint Cerf, who is calling for standards to define how customer data gets passed between different cloud service providers.
  4. [Cha74] Chamberlin, D. D.; Boyce, R. F.: SEQUEL: A Structured English Query Language - IBM Research Laboratory, 1974 ♦ ABSTRACT: In this paper we present the data manipulation facility for a structured English query language (SEQUEL) which can be used for accessing data in an integrated relational data base. Without resorting to the concepts of bound variables and quantifiers SEQUEL identifies a set of simple operations on tabular structures, which can be shown to be of equivalent power to the first order predicate calculus. A SEQUEL user is presented with a consistent set of keyword English templates which reflect how people use tables to obtain information. Moreover, the SEQUEL user is able to compose these basic templates in a structured manner in order to form more complex queries. SEQUEL is intended as a data base sub language for both the professional programmer and the more infrequent data base user.
  5. [Cha01] Champion, M.: XSP: An Integration Technology for Systems Development and Evolution - Software AG - 2001 ♦ The mathematics of the relational model is based on classical set theory, CST, and this is both its strength and its weakness. An "extended set theory", XST, can be used to model ordering and containment relationships that are simply too "messy" to handle in classical set theory and the formalisms (such as relational algebra) that are based on it.
  6. [Cod70] -a-b Codd, E. F.: A Relational Model of Data for Large Shared Data Banks CACM 13, No. 6 (June) 1970 ♦ Abstract: Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).
  7. [Cod81] Codd, E. F.: A Relational database: a practical foundation for performance CACM 25, No. 6 (June) 1970 ♦ The 1981 ACM Turing Award Lecture: "The relational model calls not only for relational structures (which can be thought of as tables), but also for a particular kind of set processing called relational processing. Relational processing entails treating whole relations as operands."
  8. [Ell15] Ellis, T.: Extended Set Theory: A Summary - 2015 ♦ Extended Set Theory (XST) was originally developed under an ARPA contract to address the limitations of Classical Set Theory (CST). At the root of the problem is the set membership definition of CST, which defines sets and the results of set operations based on a single membership condition: content.
  9. [Fay13] Fayyad, U. M.: Big Data Everywhere, and No SQL in Sight SIGKDD Explorations, Volume 14, Issue 2 - 2013 ♦ "The term BigData is not well-defined, but is generally used to refer to the type of data that breaks the limits of traditional data storage and management state-of-the-art."
  10. [Har76] Hardgrave, W. T.: A technique for implementing a set processor Proceedings of the 1976 conference on Data : Abstraction, definition and structure archive ACM New York, NY, USA 1976 ♦ Extended set theory is an effective method for describing the complex data structures needed to support large-scale data base applications. This paper describes the incorporation of the salient concepts of extended set theory into a general tool for the design and implementation of large-scale database systems. This tool is called a set processor. This implementation is based on the representation of sets of positive integers as bit strings and the application of the Cauchy/Cantor Diagonal Method.
  11. [Har08] a b Harizopoulos,~S.; Madden,~S.; Abadi,~D.; Stonebraker,~M.: OLTP Through the Looking Glass, and What We Found There SIGMOD'08, June 9-12, 2008 ♦ Databases include a suite of features that were optimized for computer technology of the late 1970’s. Advances in modern processors, memories, and networks mean that today’s computers are vastly different from those of 30 years ago. Yet database architecture (record accessing I/O) has changed little. Substantial time is spent in logging, latching, locking, B-tree, and buffer management operations. Over 90% of an application process is indexed-access overhead.
  12. [Her72] Hershey, W; Easthope, C.: A set theoretic data structure and retrieval language Paper from the Session on Data Structures, Spring Joint Computer Conference, May 1972 in ACM SIGIR Forum, Volume 7, Issue 4 (December 1972) ♦ This paper describes a data structure and retrieval program for general large scale information systems. Retrieval operations are done by means of set theoretic operations. The general structure of the data structure and retrieval program are outlined here, and the retrieval language is briefly explained with accompanying examples. The major components of the data structure are discussed in detail: The directory, the dictionary, and the data itself. Limitations of the system and plans for improvement are mentioned.
  13. [Lar09] Larsen, S. M.: The Business Value of Intelligent Data Access - March 2009 ♦ Article provides an excellent description on how difficult it is to optimize data access paths. "For a wide range of reasons, designing for and maintaining optimal data access poses a genuine challenge to even the most sophisticated enterprises." p. 2
  14. [Lig07] a b c Lightstone, S; Teorey, T.; Nadeau, T.: Physical Database Design - Morgan Kaufmann, 2007 ♦ The rapidly increasing volume of information contained in relational databases places a strain on databases, performance, and maintainability: DBAs are under greater pressure than ever to optimize database structure for system performance and administration. Physical Database Design discusses the concept of how physical structures of databases affect performance. Something as simple as improving the table index design has a profound impact on performance.
    - Book provides a comprehensive analysiss of how complicated the physical database design process can be without the guidance of a formal data access model. Without such formal support physical file data access structures typically impede system performance.
      a) File data access strategies are extremely difficult to optimize. "Some computing professionals currently run their own consulting businesses doing little else than helping customers improve their table indexing design." Their efforts can improve query performance by as much as 50 times. (p. 2)
  15. [Man14] Manoochehri, M.: Data Just Right - Introduction to Large-Scale Data & Analytics) Addison-Wesley, 2014. ♦ Large-scale data analysis is now vitally important to virtually every business. Mobile and social technologies are generating massive datasets; distributed cloud computing offers the resources to store and analyze them; and professionals have radically new technologies at their command, including NoSQL databases. .
  16. [MICRO] MICRO RDBMS 1972-1998; User Manual: MICRO A Relational Database Management System 1992 ♦ MICRO supported timesharing commercial applications from 1972 to 1998. It was the first system to use set-theoretic operations to create and manage stored data. MICRO managed data with a mathematically reliable data storage and access system (STDS*) which used no index structures, required no schema design, provided non-destructive updates, and supported both structured and semi-structured data.
  17. [MSQL] SQL Server Performance Team: Great New TPC-H Results with SQL Server 2008 17 Aug. 2009 ♦ "HP also published a 300GB result on their ProLiant DL785 platform with SQL Server 2008. This publication illustrates the high performance and solid price/performance using industry standard components from HP and Microsoft." (Load time: 300GB in 13.33 hours)
  18. [Nor10] North, K.: ♦ Three articles presenting a short historical perspective on the role of set theory, mathematically sound data models, and the importance of data independence. - 2010
    PART I: Sets, Data Models and Data Independence
    PART II: Laying the Foundation: Revolution, Math for Databases and Big Data
    PART III: Information Density, Mathematical Identity, Set Stores and Big Data
  19. [Sk57] a b c Skolem, T.: Two Remarks on Set Theory (The ordered n-tuples as sets) MATH. SCAND, 5 (1957) 40-46 ♦ Skolem concludes: "I shall not pursue these considerations here, but only emphasize that it is still a problem how the ordered n-tuple can be defined in the most suitable way."
  20. [Sto5] Stout, R.: Information Access Accelerator Information Builders Inc. - 2005 ♦ Slide presentation: An evaluation of iXSP, IBM, and Oracle comparing SQL query performance on an industry standard database. Performance improvement varied from 10 to 90 times faster.
  21. [Sto7] Stonebraker, M.; Madden, S.; Abadi, D.; Harizopoulos, S.; Hachem, N.; Helland, P.: The End of an Architectural Era (It's Time for a Complete Rewrite) 33rd International Conference on Very Large Data Bases, 2007. ♦ "The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for tomorrow’s requirements, not continue to push code lines and architectures designed for yesterday’s needs."
  22. [Sto+14] Stonebraker, M., et. al.: Enterprise Data Applications and the Cloud: A Difficult Road Ahead Proc IEEE IC2E, Boston, Ma., March 2014 ♦ Paper succinctly delineates potential growing pains for future DBMS, if developers continue to rely on physically dependent data access technologies. "There is considerable interest in moving DBMS applications from inside enterprise data centers to the cloud, both to reduce cost and to increase flexibility and elasticity. In some circumstances, achieving good DBMS performance on current cloud architectures and future hardware technologies will be non-trivial. In summary, there is a difficult road ahead for enterprise database applications."
  23. [Sto14] Stonebraker, M.: Hadoop at a Crossroads? BLOG@CACM, August, 2014. ♦ Persistent use of file systems perpetuating use of physical data location access strategies will present serious challenges for future system developers. "Hiding the (physical) location of data from the DBMS is death, and the DBMS will go to great lengths to circumvent this feature."
  24. [Teo11] Teorey, T.;Lightstone, S; Nadeau, T.Jagadish, H. V.: Database Modeling and Design Morgan Kaufmann, 2011, Fifth Edition. ♦ Many in the industry consider this to be the best book available on classic database design and for explaining how to build database applications, complemented with objective commentary. For example in Chapt. 8: ``In short, transferring data between a database and an application program is an onerous process, because of both difficulty of programming and performance overhead."
  25. [Wes70] Westervelt, F. H.: CONCOMP: Research in Conversational use of Computers : final report OFFICE OF RESEARCH ADMINISTRATION, Ann Arbor, December 1970 ♦ This report describes the final research results of the CONCOMP Project: Research in the Conversational Use of Computers, which was funded from 1965/70.
    2.5 DATA STRUCTURES The ever-present problems of providing rapid, flexible access to complexly interrelated data resulted in several efforts in the area of data structures under CONCOMP. The work of Childs reported in Technical Reports 3 and 6 and Appendix D of this report represent the scope of efforts expanded. The theoretically derived structures have been implemented and employed by numerous applications users in MTS. The flexibility and versatility represented deserve greater emphasis than is possible in this brief summary.


Bibliography

  1. Feasibility of a Set-theoretic Data Structure A general structure based on a reconstituted definition of relation IFIP Cong., Edinburgh Scotland, August 1968 ♦ This antique paper presented the thesis that mathematical control over the representation, management, and access of data was critical for the functional freedom of applications and I/O performance of future systems.
  2. Description of a Set-theoretic Data Structure: AFIPS fall joint computer conference San Fransico CA, December 1968 ♦ Presents early development of STDS, a machine-independent set-theoretic data structure allowing rapid processing of data related by arbitrary assignment.
  3. Extended Set Theory: A General Model For Very Large, Distributed, Backend Information Systems VLDB 1977 (Invited paper) ♦ ABSTRACT Three distinct components comprise an Information System: INFORMATION MANAGEMENT, DATA MANAGEMENT, and STORAGE MANAGEMENT. Until recently, all three have been subsumed under data management. As applications become more demanding, as support criteria become more complex, and as storage capacity becomes very large, the need for functional independence of these three management areas has become more apparent. Recognition of this situation has been popularized through the phrase, "data independence", or more precisely, "data independence from information" and "data independence from storage".
         The difficulty in achieving data independence arises through the incompatibility of a complex information space being supported by a simple storage space. The popular, but limiting approach, has been to force the information space into a restrictive record space. This achieves a deceptive compatibility allowing only the appearance of data independence at the user level. This record oriented approach has become pervasive for small databases even though it constrains user applications, requires substantial storage overhead, and imposes inherent processing inefficiencies.
         As databases become very large and as distributed systems become desirable the need for inherent (not superficial) data independence becomes crucial. This paper is intended as a tutorial and will describe conditions for data independence and summaries the concepts of Extended Set Theory as a general model for expressing information systems embodying data independence. This generality will be demonstrated by considering some major problems pertinent to the design and support of very large, distributed, backend information systems.
         It should be emphasized that Extended Set Theory is a formalism for expressing solutions and is not a specific solution in itself. Though "redundant membership condition", "distributed membership condition", and "set-theoretic interface" may be new concepts, Extended Set Theory does not preclude any current DBMS concepts, data structures, or existing implementations. Rather, Extended Set Theory embraces them all under a unifying model.
  4. PEBBLE PILES & INDEX STRUCTURES: A Parable - 2005 ♦ Imagine trying to convince ancient sheepherders (who calculated and compared their wealth by equating piles of pebbles to the size of herds of sheep) that a piece of parchment with numbers on it would give the same results as a pile of pebbles, but that would be more accurate, easier to use, faster in processing, less resource intensive, and more portable? Now imagine trying to convince modern database researchers that the mathematical identity of records can be used to replace the physical indexing of records.
  5. 1984 VLDB Panel: Inexpensive Large Capacity Storage Will Revolutionize The Design Of Database Management Systems Proceedings of the Tenth International Conference on Very Large Data Bases. Singapore, August, 1984 ♦ As secondary storage devices increase in capacity and decrease in cost, current DBMS design philosophies become less adequate for addressing the demands to be imposed by very large database environments. Future database management systems must be designed to allow dynamic optimization of the I/O overhead, while providing more sophisticated applications involving increasingly complex data relationships.
  6. A Mathematical Foundation for Systems Development - NATO-ASI Series, Vol. F24, 1986 ♦ Paper presents a Hypermodel syntax for precision modeling of arbitrarily complex systems by providing a function space continuum with explosive resolution and extended set notation to provide generality and rigor to the concept of a Hypermodel.
  7. Managing Data Mathematically:   Data As A Mathematical Object: ♦ "Using Extended Set Theory for High Performance Database Management" Presentation given at Microsoft Research Labs. with an introduction by Phil Bernstein. (video: duration 1:10:52) - 2006
  8. Why Not Sets? - 2010 ♦ Sets are well defined collections of uniquely identifiable items. Data used by computers are well defined collections of items representing situations of interest. Computers themselves are just well defined collections of bits that change value over time. It would seem that all computer processing is highly set oriented. Why are sets not more widely used in modeling the behavior and assisting the development of computing systems?
  9. Functions as Set Behavior: A Formal Foundation Based On Extended Set Axioms - 2011 ♦ ABSTRACT: The term function suggests an action or process or behavior of something applied to something. Within the framework of extended set theory, XST, the concept of a function will be defined in terms of a liberal definition of morphism. Which in turn will be equated with the behavior of how specific sets interact with each other. It will be shown that all Classical set theory, CST, graph based function behavior can be expressed in terms of XST function non-graph based behavior; that the behavior of functions applied to themselves is supported; and that the concepts of Category theory can be subsumed under XST. A notable consequence of this approach is that the use of functions need no longer be constrained by properties of a Cartesian product.
  10. XSP TECHNOLOGY: Theory & Practice: Formal Modeling & Practical Implementation of XML & RDM Systems - 2011 ♦ INTRODUCTION XSP (extended set processing) Technology introduces: [1] a formal mathematical foundation, extended set theory, XST, for defining and mapping user application expectations into internal machine representations; and [2] practical software implementations based on XST that assist in the design, development, and use of XML and RDM systems.
  11. SET-PROCESSING AT THE I/O LEVEL: A Performance Alternative to Traditional Index Structures - 2011 ♦ ABSTRACT It is generally believed that index structures are essential for high-performance information access. This belief is false. For, though indexing is a venerable, valuable, and mathematically sound identification mechanism, its logical potential for identifying unique data items is restricted by structure-dependent implementations that are extremely inefficient, costly, functionally restrictive, information destructive, resource demanding, and, most importantly, that preclude data independence. A low-level logical data access alternative to physical indexed data access is set-processing. System I/O level set-processing minimizes the overall I/O workload by more efficiently locating relevant data to be transferred, and by greatly increasing the information transfer efficiency over that of traditional indexed record access strategies. Instead of accessing records through imposed locations, the set-processing alternative accesses records by their intrinsic mathematical identity. By optimizing I/O traffic with informationally dense data transfers, using no physical indexes of any kind, low-level set-processing has demonstrated a substantial, scalable performance improvement over location-dependent index structures.
  12. SET-STORE DATA ACCESS ARCHITECTURES: For High Performance Informationally Dense I/O Transfers - 2011 ♦ ABSTRACT For high performance analytic processing of vast amounts of data buried in secondary storage, traditional performance strategies generally advocate minimizing system I/O. This paper advocates the converse supported by the use of set-store architectures. Traditional row-store and column-store architectures rely on mechanical data models (based on an imposed physical representation of data) for accessing and manipulating system data. Set-store architectures rely on a mathematical data model (based solely on the intrinsic mathematical identity of data) to control the representation, organization, manipulation and access of data for high performance informationally dense parallel I/O transfers.
  13. iXSP Interactive Extended Set Processor: Programmer’s Manual, User’s Tutorial and Mathematical Backgrounds - v1.7 (44 pages) Draft 2013 ♦ This manual has been produced for those interested in an introduction to the Extended Set Processor from IIS and the commands that the processor supports.
  14. I/O Technology For Big Data: Massively Parallel Data Access of Big Data Environments - 2011 ♦ Traditional I/O technology is based on the storage and retrieval of records, records that are physically preserved in storage. Set processing I/O technology is based on the exchange of collections (sets) of records, records which may or may not physically exist in storage. Given the advances in hardware platforms, set processing I/O technology can offer one to three orders of magnitude better system performance than traditional I/O technology.


Copyright © 2015   INTEGRATED INFORMATION SYSTEMS  « Last modified on 08/19/2015 »
-  CONTACT -