Wednesday, February 4, 2015

On Analysis of Post-Relational Data. #1 - Master-Detail Hierarchy

image/svg+xml Name Manager Budget First Name Last Name * one and one and one Manager Budget Employees, and some Name Department Department Employee one and one and one Hired Date Salary First Name Hire Date Salary one Second Name Each has Department Digging Slate 147 HQ Granite 49 Name Manager Budget First Name Last Name Employee Hire Date Salary Department Digging Barney Rubble 02/02/02 16 Digging Fred Flintstone 01/01/01 17 HQ Rock Quarry 11/11/01 19 First Name Last Name Hire Date Salary Department Digging Barney Rubble 02/02/02 16 Digging Fred Flintstone 01/01/01 17 HQ Rock Quarry 11/11/01 19 Department - Employee <joined> Digging Slate 147 HQ Granite 49 Name Manager Budget Digging Slate 147 <company> <Department> <Name>HQ</Name> <Manager>Granite</Manager> <Budget>123</Budget> <Employee> <FirstName>Rock</FirstName> <LastName>Quarry</LastName> <HireDate>11/11/01</HireDate> <Salary>19</Salary> </Employee> </Department> <Department> <Name>Digging</Name> <Manager>Slate</Manager> <Budget>147</Budget> <Employee> <FirstName>Fred</FirstName> <LastName>Flintstone</LastName> <HireDate>01/01/1001</HireDate> <Salary>17</Salary> </Employee> <Employee> <FirstName>Barney</FirstName> <LastName>Rubble</LastName> <HireDate>02/02/02</HireDate> <Salary>16</Salary> </Employee> </Department></company> { "company": { "Department": [ { "Name": "HQ", "Manager": "Granite", "Budget": "123", "Employee": { "FirstName": "Rock", "LastName": "Quarry", "HireDate": "11/11/01", "Salary": "19" } }, { "Name": "Digging", "Manager": "Slate", "Budget": "147", "Employee": [ { "FirstName": "Fred", "LastName": "Flintstone", "HireDate": "01/01/1001", "Salary": "17" }, { "FirstName": "Barney", "LastName": "Rubble", "HireDate": "02/02/02", "Salary": "16" } ] } ] }} Name Manager Budget one one and one Department Each has each of which has Employees First Name Last Name Hire Date Salary one one and one one Department Each also has some some atomic attributes: Consider an ordinary company. It has a standard (simple) organizational structure, where employees work in Departments managed by managers. Straightforward enough.Here's a simple diagram showing the information structure: Here's a normal tabular representation of the company's Departments and Employee data.This is how data has been commonly represented for the past quarter century. Here's the tablular form of the data joined by Department Name.There are a number of ways to accomplish this, but they end up resolving to same single table that Tableau sees, or creates via data blending. Tableau is designed to access and analyze tabular data.It has no trouble connecting to these individual tables, and its intrinsic default aggregation scheme will return the values that mean what people think they mean, e.g.- showing the SUM(Budget) is correct whether it's for one or both Departments or Managers;- summing salary will similarly work for the different Employee dimensional contexts. This is good, but people frequently want to analyse the data for the whole organization, to see things like the sums of Salaries for Departments or Managers. In order to do this, the tables need to be related, in this casebased upon the Department Name. While this is technically correct, in the decomposed relational database view of the world it creates real and substantial analytical difficulties that manifest themselves as technical considerations the human analyst needsto be aware of and accommodate while analyzing the data or else risk achieving results that are arithmetically and technically correct in a narrow sense, but invisibly wrong in the human sense.The obvious case using Tableau, is: if the user puts Manager on the Rows shelf and Budget and Salary on the Colums shelf, Tableau will generate this: Clearly, this is a nonsensical result.The value '294' doesn't make sense for the Digging Department. Department 0 100 200 300 Budget 0 10 20 30 Salary Digging HQ 294 49 19 33 Sum Budget, Salary per Department 294 is shown as Digging's Budget sum because Tableau's default aggregation is SUM and 294 -is- the sum of the Budget values in the two records in the record set Tableau's looking at.In Tableau's world, the analyst is responsible for knowing the technical details of the data's organizationincluding the differing table granularities, how they're joined or blended, and the analytical context in order tochoose the right aggregation to achieve sensible results.Here's how Tableau can be coaxed into showing the correct analytic: Note: MIN(Budget) Chaining the Budget aggregation to MIN(), or any of the other aggregations that result in the identity value works.But it's a poor solution.Tableau should be smart enough to recognize the hierarchical relationship between Departments and Employees and implement the aggregation correctly according to the analytical context.It might seem like this is a bit of a pipe dream, that it's asking a lot, perhaps too much, of Tableau. That it's outside of the world of "real" data that Tableau was designed to work with.Except that it's not quite right.Hierarchical data was the normal way of storing data before everything was 'relationalized' and decomposed intonormalized tables that required recomposition before data with real world semantics could be available for analysis.And here's the best part: the problem of how to interpret an analytical request like "sum the Budget and Salary per Department" correctly in different analytical contexts was solved over thirty years ago, and formed the basis for entire families of data analytical tools. Modeling the Company's data to match the human view. This is an abstraction of the technique of using a structural description (a master file descriptor (or MFD)- and a prize to the first commentor to identify the reference) to identify the data contents and relationships. * Department Employee Name Manager Budget First Name Last Name Hire Date Salary Digging Slate 147 Fred Flintstone 01/01/1001 17 Barney Rubble 02/02/02 16 HQ Granite 49 Rock Quarry 11/11/01 19 MFD, schema, etc. Data Modern Data XML JSON This is how it used to be done. The structural description was in a human- and machine-readablefile and the data could be, if desired, kept in an optimzed dedicated file type designed for rapid access to its content. Really handy for retrieval during analysis.In this scheme, this is a legitmate analytical request: from DepartmentData sum Budget, Salaryand this would be the result: I've proposed this idea, that Tableau needs to recognize and accommodate sructural data relationshipsand sometimes hear something like: "well, that's all very nice and all, but modern data is stored in tables, and that's just the way things are".I don't think so.Here are a couple of examples of the Bedrock Mining Company data stored in modern forms. XML and JSON are excellent examples of approaches that are moving away from relational/tablular data modelingand storage in order to accommodate modern needs. The noSQL movement is increasing its presence in mainstream business data environemnts. Tableau technology partners aren't limited to the traditional relationaldatabase vendors; one of the major enterprise noSQL advertises that they store data internally as XML, eventhough the only visibilty to it through Tableau is via slices that resolve into single tables.It's worth noting that JSON is at the heart of the aggressive evolution occuring in data visualization, particularly theweb-based javascript approaches that are capable of creating exquisite data visualizations, albeit with a modicumof programming effort, and the increasing sophistication of the technological abstractions are leading to thedevelopment ot end-used based tools that will handle complex data easily, simply, and with very little friction. Beyond master-detail, the future beckons. The example above is the first step beyond flat tables. A two-level master-detail structure will enable an entire new world of analytical opportunities and abilities. But it's just a baby step.There are other horizons to reach, a whole new universe to explore.Beyond simple hierarchies are multi-level, multi-path hierarchies. Imagine if each Bedrock Mining Company department had, in addition to employees, vehicles and buildings they were responsible. It's easy to model this in XML, JSON, or any number of other data formats, but it's not easy to analyze the data, even though it might be interesting to see if there was any differential brake wear, comparing those departments with employees hired on or before the median date of hire fof all employees. When Tableau makes this a simple UI operation it'll really be on to something.But wait, there's more.Suppose you had an XML file containing data relating to a wide variety of different things, with many hierarchical levels, many paths, and this is where it really gets interesting, non-parallet semantic relationships between elements and cross-linking across paths with references in one location identifying elements in a different structural path and at different levels. If only Tableau had the ability to allow access to something like this we could really work some magic. I've been waiting for this for over five years, no telling how much longer to go.Admittedly, these aren't trivial matters. There are difficult problems to solve, technically and human/usability-wise.Someone's going to solve them and create the next great tool that makes it straightforward to do. And that will be good. DepartmentDiggingHQ Budget14749 Salary3319 DepartmentDiggingHQ Budget Salary 3319 14749 - rendered as Table - rendered Graphically This demonstration only shows the structural properties, the decorative elements are left out deliberately. each of which has


  1. Still remains the analyst's responsibility to make it happen, with LOD calculations coming out.

    1. Yes, John, it's still the analyst's responsibility. But it should not be - Tableau should provide simple, straightforward mechanisms to identify the information structure/hierarchy and automatically provide the appropriate aggregation in context.

      About LOD calculations. Although they really are a good thing, their mere presence reinforces the way Tableau works, and they add more mass to the body of things that conform and reinforce how Tableau works, each increment of which entrenches its shortcomings and makes it harder to adopt a different operational model.