Wednesday, February 12, 2014

A New Horizon

I've been sitting on this post for quite a while thinking I'd improve it enough to make it suitable for publishing. But that's not going to happen in the near term, so here it is in all its roughness.
Note: it's in SVG (I use Inkscape for sketching) which all modern browsers should render.

Tableau is superbly designed for accessing and analyzing tabular data. It's nearly as good at this as can be, notwithstanding the oddity here and there. It's almost trivially easy to connect to a table, or to more than one table using joins, custom SQL, and/or data blending, and to analyze the resulting flat record set that Tableau eventually sees.

It's simple and straightforward to organize the data, and to generate quantitative visualizations of it, and everything works smoothly. As long as the data falls within Tableau's data conceptual envelope. Once the data lies outside Tableau's horizon working with it isn't so easy, and in fact the same analytical operations that work well on simple tabular data may produce precise, arithmetically correct, and wrong results.

It would be very, very handy if Tableau grew to accommodate data that it doesn't now understand and provide an elegant user interface for analyzing. It need not, but it should, if it wants to keep it's leadership position.

This post covers the next step in expanding Tableau's data horizon: the simple, straightforward master-detail data structure.

image/svg+xml Name Manager Budget First Name Last Name * Department Employee Hire Date Salary Department Digging Slate 147 HQ Granite 49 Name Manager Budget First Name Last Name Employee Hire Date Salary Department Digging Barney Rubble 02/02/02 16 Digging Fred Flintstone 01/01/01 17 HQ Rock Quarry 11/11/01 19 First Name Last Name Hire Date Salary Department Digging Barney Rubble 02/02/02 16 Digging Fred Flintstone 01/01/01 17 HQ Rock Quarry 11/11/01 19 Department - Employee <joined> Digging Slate 147 HQ Granite 49 Name Manager Budget Digging Slate 147 First Name Last Name Hire Date Salary one one and one one each of which has Employees Department Each also has some Name Manager Budget one one and one Department Each has some atomic attributes: Consider an ordinary company. It has a standard (simple) organizational structure, where employees work in Departments, each of which has a manager and distinct properties. That's easy enough to fix, here's a diagram showing the information structure:This is a typical presentation of hierarchical information,of the sort that's been around seemingly forever, at leastin the history of computer-assisted data processing.This structure is almost universally intuited, and hierarchicaldatabases structured to match the information of the businessdomain they served were the norm until table-based databasesbecame the new standard during the late '89s and 90s.(why this happened is an interesting story for another time, but we've been suffering because of it ever since) Here's a normal tabular representation of the company's Departments and Employee data.This is how data has been commonly represented for the past quarter century. Here's the tabular form of the data joined by Department Name.There are a number of ways to accomplish this, but they end up resolving to same single table that Tableau sees,or creates via data blending. Tableau is designed to access tabular data.It has no trouble connecting to these individual tables, and it's intrinsic default aggregation schem will return the values that mean what people think they mean,e.g. - showing the SUM(Budget) is correct whether it's for one or both Departments or Managers;- summing salary will similarly work for the different Employee dimensional contexts. This is good, but people frequently want to analyse the data for the whole organization, to see things like the sums of Salaries for Departments or Managers. In order to do this, the tables need to be related, in this case based upon the Department Name. While this is technically correct, in the degenerate relational database view of the world, it creates real and substantial analytical difficulties that manifest themselves as technical considerations the human analyst needsto be aware of and accommodate while analyzing the data or else risk achieving results that are arithmetically and technically correct in a narrow sense, but invisibly wrong in the human sense.The obvious case, using Tableau, is: if the user puts Manager on the Rows shelf and Budget and Salary on the Colums shelf, Tableau will generate this: Clearly, this is a nonsensical result.The value '294' doesn't make sense for the Digging Department. Department 0 100 200 300 Budget 0 10 20 30 Salary Digging HQ 294 49 19 33 Sum Budget, Salary per Department 294 is shown as the Budget sum because Tableau's default aggregation is SUM and 294 -is- the sum of the Budget values in the two records in the table Tableau's looking at.In Tableau's world, the analyst is responsible for knowing the technical details of the data's organizationincluding the differing table granularities, how they're joined or blended, and the analytical context in order tochoose the right aggregation to achieve sensible results.Here's how Tableau can be coaxed into showing the correct analytic: Note: MIN(Budget) Changing the Budget aggregation to MIN(), or any of the other aggregations that result in the identity value works.But it's a poor solution.Tableau should be smart enough to recognize the hierarchical relationship between Departments and Employees and implement the aggregation correctly according to the analytical context.It might seem like this is a bit of a pipe dream, that it's asking a lot, perhaps too much, of Tableau. That it's outsde of the world of "real" data that Tableau was designed to work with.Balderdash.Hierarchical data was the normal way of storing data before everything was 'relationalized" and decomposed into normalized tables that required recomposition before data with real world semantics could be available for analysis.And here's the best part: the problem of how to interpret an analytical request like "sum the Budget and Salary per Department" correctly in different analytical contexts was solved over thirty years ago, and formed the basis for entire families of data analytical tools. Modeling the Company's data as it should be. This is an abstraction of the technique of using a structural description (a master file descriptor (or MFD), and there's a prize to the first commentor to identify the reference) to identify the data contents and relationships. * Department Employee Name Manager Budget First Name Last Name Hire Date Salary Digging Slate 147 Fred Flintstone 01/01/1001 17 Barney Rubble 02/02/02 16 HQ Granite 49 Rock Quarry 11/11/01 19 MFD, schema, etc. Data Modern Data This is how we did it in the old days. The structural description was in a human- and machine-readable file and the data was kept in an optimzed dedicated file type designed for rapid access to its content. This kept the data really handy for retrieval during analysis. (say...) I've proposed this idea, that Tableau needs to recognize and accommodate sructural data relationships and sometimes hear something like: "well, that's all very nice and all, but modern data is stored in tables, and that's just the way things are".I don't think so.Here are a couple of examples of the Bedrock Mining Company data stored in modern forms. <company> <Department> <Name>HQ</Name> <Manager>Granite</Manager> <Budget>123</Budget> <Employee> <FirstName>Rock</FirstName> <LastName>Quarry</LastName> <HireDate>11/11/01</HireDate> <Salary>19</Salary> </Employee> </Department> <Department> <Name>Digging</Name> <Manager>Slate</Manager> <Budget>147</Budget> <Employee> <FirstName>Fred</FirstName> <LastName>Flintstone</LastName> <HireDate>01/01/1001</HireDate> <Salary>17</Salary> </Employee> <Employee> <FirstName>Barney</FirstName> <LastName>Rubble</LastName> <HireDate>02/02/02</HireDate> <Salary>16</Salary> </Employee> </Department></company> { "company": { "Department": [ { "Name": "HQ", "Manager": "Granite", "Budget": "123", "Employee": { "FirstName": "Rock", "LastName": "Quarry", "HireDate": "11/11/01", "Salary": "19" } }, { "Name": "Digging", "Manager": "Slate", "Budget": "147", "Employee": [ { "FirstName": "Fred", "LastName": "Flintstone", "HireDate": "01/01/1001", "Salary": "17" }, { "FirstName": "Barney", "LastName": "Rubble", "HireDate": "02/02/02", "Salary": "16" } ] } ] }} XML JSON XML and JSON are excellent examples of approaches that are moving away from relational/tablular data modeling and storage in order to accommodate modern needs. The NoSQL movement is increasing its presence in mainstream business data environments. Tableau technology partners aren't limited to the traditional relational database vendors; one of the major enterprise noSQL advertises that they store data internally as XML, even though the only visibilty to it through Tableau is via slices that resolve into single tables.It's worth noting that JSON is at the heart of the aggressive evolution occuring in data visualization, particularly the web-based javascript approaches that are capable of creating exquisite data visualizations, albeit with a modicum of programming effort, and the increasing sophistication of the technological abstractions are leading to the development of end-used based tools that will handle complex data easily, simply, and with very little friction. Beyond master-detail, the future beckons. The example above is the first step beyond flat tables. A two-level master-detail structure will enable an entire new world of analytical opportunities and abilities. But it's just a baby step.There are other horizons to reach, a whole new universe to explore.Beyond simple hierarchies are multi-level, multi-path hierarchies. Imagine if each Bedrock Mining Company department had, in addition to employees, vehicles and buildings they were responsible. It's easy to model this in XML, JSON, or any number of other data formats, but it's not easy to analyze the data, even though it might be interesting to see if there was any differential brake wear, comparing those departments with employees hired on or before the median date of hire fof all employees. When Tableau makes this a simple UI operation it'll really be on to something.But wait, there's more.Suppose you had an XML file containing data covering a diverse set of properties, with many hierarchical levels, many paths, and this is where it really gets interesting, non-parallet semantic relationships between elements and cross-linking across paths with references in one location identifying elements in a different structural path and at different levels. If only Tableau had the ability to allow access to something like this we could really work some magic. I've been waiting for this for five years; this is the complexity boundary that made creating TWIS so difficult. Tableau workbooks are complex in this manner, making it extremely difficult to distill from them a reasonable, finite set of data sources that surface all the information one is likely to be interested in finding out about their workbooks. If Tableau provides a simple, elegant design that makes accessing and anlayzing this Tableau workbooks this problem will be solved, and the future will be wide open. But there's no telling if it's in the works, or if it is how much longer it's going to be. Admittedly, these aren't trivial matters. There are very difficult problems to solve, technically and from the human/usability perspective.But someone's going to solve them. And that will be good. Using blocks this way, the properties are straightforward and easy enough to understand, but something's missing ; it doesn't capture anything of the essence of the relationship betwen Departments and Employees.