What in the world were you thinking???

datasciencehumanmindI can tell you that as the father of two daughters, the grandfather of 7 and a 20 year veteran coach/instructor for thousands of adolescent female athletes I’ve probably said “What in the world were you thinking” at least a thousand times. You know what I mean … children so often do things that just completely defy all logic or known thought processes.

The irony is that as adults we say this mostly in gest as we roll our eyes. All the while knowing full well the problem wasn’t what they thought but the fact that they didn’t think. They simply allowed themselves to be distracted by something else.

Two years ago I began this blogging journey and I’ve greatly enjoyed every minute of research, every post and every conversation that was sparked about Data Visualization topics. But I have to be honest watching the battle of hype versus hope unravel right before my eyes on the Data Science and Big Data fronts has kind of driven me crazy. So as this blogging journey is about me, I find that I need to begin at least intermixing what I’m learning and feel about Data Science and Big Data in with my posts on Data Visualization.

The American Recovery and Reinvestment Act of 2009 pushed $20 Billion into data producing factories in the form of EHR systems. Unlike the common myth data storage isn’t cheap. You need bigger data centers, with more racks of disks, which require more power, which require more cooling, which requires more backups, more network bandwidth both internally and externally for redundancy and  require more staff to manage the infrastructure. Ugh!

Not really sure what they were thinking. To my knowledge real factories don’t produce goods that can’t be consumed. Yet here many of you sit 7 years later with data centers full of unused 0’s and 1’s. Producing them at a frantic pace, but doing nothing with them. Because the push was to collect data but there was no plan on how to utilize the data.

Data Science


Over the past several years I have spent a great many hours consuming free training about Data Science via Coursera. Why would I read “Data Science for dummies” when geniuses like Roger Peng and Jeffrey Leek of Johns Hopkins are teaching Data Science courses. Free courses! Free courses that I can take from the comfort of my own sofa I should add. When they recently authored Executive Data Science – A guide to training and managing the best Data Scientists, I figured I could afford to pay for their book since I had already MOOChed off of their expertise so much. I bring up their book because they had a profound concept that you may want to write in permanent ink on your monitor … “The key word in data science is not data, it is science. Data science is only useful when the data are used to answer a question. That is the science part of the equation.”

No wonder these guys are professors at Johns Hopkins. Seriously as I start this series on Big Data and Data Science I wanted to ensure that we are all on the same footing. As I refer to the term “Data Science” it’s always, always, always going to be in regards to applying science to data to answer some business question.

Data Science, like anything new, has been greatly over hyped for sure. Many businesses jumped in with both feet and lots of money praying that they would magically uncover a “Beer and Diapers” or “predicting pregnancy” story of their own that would help their company make a billion dollars in the following quarter. What in the world were they thinking? Data science isn’t black magic that you just conjure up answers with … it’s science. It follows scientific principles. It takes discipline.

Unfortunately due to some of the expected failures due to a lack of using reasoning, many, many more are sitting on the sidelines watching their business lose money hand over fist ignoring the fact that data science is available. They don’t understand how data science works so they simply ignore it instead. What in the world are they thinking?

Is data science for everyone? Of course not.  But tucking your head in the sand while other companies use it as a competitive asset just isn’t a good business practice. You want to separate “hype” from “hope” so you know if it is right for you then start with “What is the question I am trying to answer with data.” Follow that up with “Do I even have the data I need to answer it?” If the answer to both is yes, then allow the science to lead you to the answers that is hidden in your data.

Big Data

One of the reasons for so many dashed hopes and dreams is that some organizations starting building massive data lakes thinking the more data I have the better the answers I can get. They had no business questions in mind they just figured if they assembled enough data files together on disk drives that problems would somehow solve themselves. Quite simply they ignored the science and focused on the data. I don’t want you to make the same mistake.

If you abigdatawordcloudre going to undertake anything new like Data Science or Big Data you have to understand that major changes like this require organizational change as well. They aren’t just a technical matter.  If you are going to go with a Big Data solution then for goodness sakes please start by following sound advice like that found in Benjamin Bowen’s book titled Hadoop Operations. He makes it clear that organizations must combine three facets of strategy: Technical, Organizational and Cultural.

The difficulty for many who have succeeded in Analytics but are afraid to jump into Big Data is the simple fact that it’s hard for many to truly understand what Big Data really is. I can’t blame someone for not wanting to invest in something that they can’t understand. At least “science” is a word that people can relate to and that’s why Peng/Leek focused on their phrase immediately as they began their book. It gives you a point of reference.

Unfortunately Big Data is an entirely different beast. I wish I could write something profound like “The most important word in Big Data is big” or “The most important words in Big Data” is data. To help you focus. But the truth is the most important word in “Big Data” is neither big, nor data. The most important word to describe it is actually a set of the 3 words: Volume, Veracity and Variety. However the hard part even for the Qlik Dork to explain is that none of them alone explain the concept and you need to refer to them in combination and here is why:

Volume – Just because your organization has Gajigbytes of data doesn’t mean you need to turn to Big Data. Relational database systems, especially Teradata, can be grown to be as large as you will ever need so it’s not just volume that forces the issue.

Velocity – Simply means the speed with which the data is coming. There are all sorts of interfaces that handle rapidly moving data traffic so again, that alone doesn’t constitute a need for Big Data.

Variety – In the context of the Big Data field it is most often used to refer to the differences between structured and unstructured data. Unstructured data would be things like documents, videos, sound recordings etc. Don’t let me shock you when I say this but “I was storing those things into SQL Server 20 years ago as BLOB’s (binary large objects.)” So guess what, again this “variety” by itself isn’t what big data is about.

So what then is Big Data? It is a combination of all 3 of those things and oh by the way you also need to include business components like time and money. Big Data is centered around the fact that you can use commodity hardware including much cheaper disks than you would typically use for large Storage Area Network (SAN) disk infrastructure. The reason that it is typically considered “faster” in terms of storage is that it doesn’t deal with transactions and rows it simply deals with big old blocks of data so massive files are a breeze to store. The fact that it is block/file oriented means it doesn’t really matter what you throw at it. A stack of CSV or XLS or XML files, a bunch of streaming video or HL7 or sound no problem. You throw and go.

So you can store a wide variety of data, quicker and at less cost than you would using a traditional RDBMS type system. Bonus is also the time savings because nobody in IT really needs to be involved in the process once the infrastructure is put in place. You can have data available and within no time your analysts or your data scientists can begin consuming the data. No requirements documents. No prioritization process. No planning meetings. Very little overhead. And oh by the way it allows the business to actually own the process of solving the problems that they business has. Crazy concept I know.


Enough of my musing, let’s just get down to a few practical examples.

Vaccinations and Side Effects

This week I met two of the most wonderful young Data Scientists. Liam Watson and Misti Vogt just graduated from Cal State Fullerton and delivered a presentation at the Teradata Conference in Atlanta, Georgia on a phenomenal use for data regarding the side effects of vaccinations. In the coming weeks I will be presenting their research and application, but I wanted to quickly plant a seed regarding their work that I think makes an excellent pitch for those of you who may be on the fence about proceeding with Data Science or Big Data.

Much of the “science” of what they did revolved around data that parents completed to report side effects after getting their child vaccinated. The form, like so many in the healthcare and other industries is a typical check this box for this condition, check that box for that condition … Other (Please type in) kind of thing. The check boxes would be considered structured data. The “other” would certainly be considered unstructured 0’s and 1’s that get manufactured in our EHR factories and left to accumulate dust.


If these two used Static Reporting they would have had no choice but to simply ignore the “other” category and count up how many of A, B, C, D or E were checked. But let’s face it if these two were ordinary I wouldn’t be talking about them. Instead they chose the path of using Data Science (which says you can’t leave data behind just because it doesn’t fit your simple report query model and isn’t clean) and they needed to use Big Data because it provides them with so many wonderful text analytics functions.

What they uncovered was that White Blood Cell Disorder which came from the hand input “Other” text box was the third highest side effect. To me that’s like gold. It’s a discovery that quite simply would be overlooked in a traditional environment because it didn’t fit the “we can only deal with structured data mold.”

There is a lot of time and effort expended in tracking physicians and beating them over the heads if they don’t sign off on documentation in a timely manner. I certainly understand that without their signature the organization doesn’t get paid. But I can’t help but wonder what gold may be lying in the textual notes that physicians dictate daily. Don’t believe your organization is ready for Data Science and Big Data to mine for that gold? Not sure what you are thinking.


I recently recorded a video showcasing a stunning use of Data Science and Big Data that was created by two of Qlik’s partners, Bardess Group and Cloudera. The application demonstrate the impact that accumulating data quickly from a wide variety of sources like weather, flights, mosquito populations, suspected and reported Zika infections and supply chain data could have when brought to bear on a problem like Zika.

Right now most organizations are still struggling to understand their own costs and understand their own clinical variances. Move to a population health model? Unthinkable for them as they can’t produce the static reports nor consume them fast enough to understand their own patients, let alone begin consuming data from payers, the census bureau etc.

As you watch the video and you hear the variety of data sources involved in the Zika demo, imagine the time and energy that would have to go into a project to do the same thing in a traditional way. As much as I “like” the work they’ve done to help with the Zika virus issue (and the work is continuing with aid agencies and hospitals), I “love, love, love” the use case it makes for the healthcare world that we need to embrace Data Science and Big Data not run from it because neither fits our current working models.


Blaise Pascal, the 17th century mathematician, once wrote “People almost invariably arrive at their beliefs not on the basis of proof, but on the basis of what they find attractive.” We have science that can help us find truth in data and yet we continue to perpetuate treatment plans based on myths and heresay.

We know our current organizational structures are failing to keep pace with the onslaught of changes and the amounts of data we are generating. But instead of changing to grow cultures that are more data fluent organizations are converting employees to 2×2 cubes so that they can “collaborate” more. No more data is being consumed but at least the status quo is maintained and employees now get to hear endless conversations with spouses and children.

Would I be wrong if I guessed that your organization has a backlog of hundreds of reports, while the previous 10,000 are seldom even if read? What if I guessed that the morale of the report writers is at an all time low because new requests are far outpacing their ability to generate them?

In his book Big Data for Executives author David Macfie puts it pretty eloquently “In a traditional system the data is always getting to you after the event. With Data Science/Big Data the goal is to get the information into your hands before the event occurs.” Put simply static reporting and traditional processes simply aren’t designed to handle the crisis of overrun data centers. I’m not sure what in the world organizations are thinking that are doubling down on static reports.

To be honest I’m not entirely sure what in the world I was thinking taking so long to write this as my thoughts have been bubbling up for so long. If you have yet to actually begin researching or are among those burying your head in the sand and ignoring Data Science and Big Data then you know what is coming … What in the world are you thinking?

Posted in Data Science / Big Data | Tagged , , , , | 1 Comment

Visualizing Data that does not exist … aka Readmissions Dashboard

Many who make requests seem to have a belief that Business Intelligence is magic. They loose their ability to listen to logic and reason and simply ask you to do the impossible.


Pulling data from 18 different sources, many of which that you don’t even have access to. Childs play like pulling a rabbit from a hat.

Turning bad into good and interpreting the meaning of the data. A little tougher kind of like making your stunning assistant float in midair.

Creating a readmissions dashboard. Hey we aren’t Houdini.

That data doesn’t even really exist. Oh sure it exists in the minds of the people who want you to produce it out of thin air, but I’ve yet to see a single Electronic Health Record that stored readmission data. They only store admission data, not RE-admission data.

Patient Name Admission Date Discharge Date
John Doe 1/1/2016 1/4/2016
John Doe 1/7/2016 1/10/2016
John Doe 1/30/2016 2/4/2016

Those who want dashboards for Readmissions look at data like the above and talk to you like you are insane because in their minds it is clear as day that John Doe was readmitted on 1/7, 3 days after their first visit, and was then readmitted again on 1/30, 20 days after his second visit.

You try to explain to them that there is nothing in any of those rows of data that says that. They have filled in the missing data in their minds but in reality it doesn’t exist in the EHR. They respond with all you need to do is have the “report” do the same thing and compare the admission date to the discharge date for subsequent visits. You respond with “Let’s say I could make SQL which is a row based tool magically compare rows, what should I do about the following which is more like the real data?”

Patient Name Admission Date Discharge Date Patient Type
John Doe 1/1/2016 1/4/2016 Inpatient
John Doe 1/7/2016 1/10/2016 Outpatient
John Doe 1/30/2016 2/4/2016 Inpatient

They say “Oh that’s easy, when you get to the visit on 1/30 just skip the visit from 1/7 because it’s an outpatient row and we don’t really care about those and compare the 1/30 admission to the 1/4 discharge.” To which you respond “Well that’s easy enough now I’ll not only somehow make SQL which can’t compare rows magically try to compare rows and if it is an outpatient row I’ll tell SQL to skip it and compare it to something 2 rows above, or maybe 3 rows above or 10 rows above.”

Just then you remember the reality is more complicated than that. In reality you aren’t just comparing all inpatient visits (other than for fun) what you really care about are if the visits were for the same core diagnosis or not.

Enc ID Patient Name Admission Date Discharge Date Patient Type Diagnosis
1 John Doe 1/1/2016 1/4/2016 Inpatient COPD
2 John Doe 1/7/2016 1/10/2016 Outpatient Stubbed toe
3 John Doe 1/30/2016 2/4/2016 Inpatient Heart Failure
4 John Doe 2/6/2016 2/10/2016 Inpatient COPD
5 John Doe 2/11/2016 2/16/2016 Inpatient Heart Failure

You don’t want to compare the 1/30 visit to the 1/4 discharge because the diagnosis aren’t the same you only want to compare the 2/6 visit to the 1/4 discharge and you need to compare the 2/11 visit with the 2/4 discharge.

If you think this is like making a 747 disappear before a crowd of people on all sides, just wait it gets worse.

Not only does the EHR not include the “readmission” flags, it doesn’t really tell you what core diagnosis the visit should count as. Instead what they really store is a table of 15-25 diagnosis codes

Enc ID ICD9_1 ICD9_2 ICD9_3 ICD9_4 ICD9_…. ICD9_25
1 491.1 023.2 33.5 V16.9 37.52

Good thing for your company you used to be a medical coder so you actually understand what the mysterious ICD9 or ICD10 codes stand for. You know for instance that the 491.1 really means “Mucopurulent chronic bronchitis.” It would be nice if that correlated directly to saying “This patient visit is for COPD.” But since we are uncovering magic why not explain the whole trick. You see if the primary diagnosis code is any of the following:

491.1, 491.20, 491.21, 491.22, 491.8, 491.9, 492.0, 492.8, 493.20, 493.21, 493.22, 494.0, 494.1, 496

 Then the visit may be the result of COPD but you also have to check all of the other diagnosis codes and ensure that none of them contain any of the following other diagnosis codes:

33.51, 33.52, 37.51, 37.52, 37.53, 37.54, 37.62, 37.63′, 33.50, 33.6, 50.51, 50.59, 52.80, 52.82, 55.69′,’196.0, 196.1, 196.2, 196.3, 196.5, 196.6, 196.8, 196.9, 197.0, 197.1, 197.2, 197.3, 197.4, 197.5, 197.6, 197.7, 197.8, 198.0, 198.1, 198.2, 198.3, 198.4, 198.5, 198.6, 198.7, 198.81, 198.82, 198.89, 203.02, 203.12, 203.82, 204.02, 204.12, 204.22, 204.82, 204.92, 205.02, 205.12, 205.22, 205.82, 205.92, 206.02, 206.12, 206.22, 206.82, 206.92, 207.02, 207.12, 207.22, 207.82, 208.02, 208.12, 208.22, 208.82, 208.92, 480.3, 480.8, 996.80, 996.81, 996.82, 996.83, 996.84, 996.85, 996.86, 996.87, 996.89, V42.0, V42.1, V42.4, V42.6, V42.7, V42.81, V42.82, V42.83, V42.84, V42.89, V42.9, V43.21, V46.11

If you have ever been asked to produce a Readmissions Dashboard you probably understand why I’ve correlated this to magic. Every time you think you know how to grab the rabbit by the ears to accomplish the trick, the rabbit changes into an elephant.

Fortunately your assistant isn’t the traditional 6 foot blonde, your assistant is Qlik. I’m going to explain how to make the 747 disappear in three easy steps that any of you will be able to reproduce:

Step 1

The heavy lifting for this trick actually involves the ICD9/10 codes. If you combine the 15-25 diagnosis codes into 1 field, then you you can use it to more easily compare the values to determine what core diagnosis you need to assign to each encounter. Qlik helps you accomplish that with simple concatenation as you are loading your encounter diagnosis data:

ICD9_Diagnoses_1 & ‘, ‘ & ICD9_Diagnoses_2 & ‘, ‘ & ICD9_Diagnoses_3 & ‘, ‘ & ICD9_Diagnoses_4 & ‘, ‘ & ICD9_Diagnoses_5 & ‘, ‘ & ICD9_Diagnoses_6 & ‘, ‘ & ICD9_Diagnoses_7 & ‘, ‘ & ICD9_Diagnoses_8 & ‘, ‘ & ICD9_Diagnoses_9 & ‘, ‘ & ICD9_Diagnoses_10 & ‘, ‘ & ICD9_Diagnoses_11 &’, ‘ & ICD9_Diagnoses_12 & ‘, ‘ & ICD9_Diagnoses_13 & ‘, ‘ & ICD9_Diagnoses_14 & ‘, ‘ & ICD9_Diagnoses_15 as [All Diagnosis]

Step 2

One of the really nifty tricks that Qlik can perform in data loading is a preceeding load. A preceeding load simply means you have the ability to write code to refer to fields that don’t exist yet and won’t exist until the code is actually run. The following code is abbreviated slightly so that it’s easier to follow logically but the entire set of code is attached to the post so that you can download it. The “Load *” right below Encounters tells Qlik to load all of the other from the second load statement first, then come back and do the code below. This way we can construct the [All Diagnosis] field and refer to it within this code. You could repeat all of the logic for concatenating all of the fields for all 5-10 of the core diagnosis you want to track, or you could load the encounters and simply do a subsequent join load but you don’t have to. The Preceeding load makes your life easy and works super fast.


This is the preceeding load
Load *,
// If the primary matches then it’s possibly COPD and if the none of the other 14 are one of the values listed then it definitely is COPD
IF ( Match([ICD9 Diagnoses 1] , ‘491.1’, ‘491.20’ … ‘493.21’, ‘493.22’, ‘494.0’, ‘494.1’, ‘496’) > 0
And WildMatch([All Diagnosis], ‘*33.51*’, ‘*33.52*’, ‘*37.51*’ … ‘*V43.21*’, ‘*V46.11*’) = 0, ‘COPD’,
// If we found COPD great, otherwise we need to check for Sepsis
IF (Match ([ICD9 Diagnoses 1] , ‘003.1’, ‘027.0’, … ‘785.52’ ) > 0
And WildMatch([All Diagnosis], ‘*33.50*’, ‘*33.51*’ … ‘*V43.21*’, ‘*205.32*’) = 0, ‘Sepsis’,
‘Nothing’)) as [Core Diagnosis];

This is the regular load from the database or file
[ICD9 Diagnoses 1],
[ICD9 Diagnoses 2] …..

Step 3

The final step, which many believe to be the hardest is actually the easiest to do within Qlik. In fact truth be told when I was a young whipper snapper starting out on my Qlik journey I tried to do everything in SQL because I knew it so well, and did minimal ETL within Qlik itself until I found about this Qlik ETL function. The function is simply called “Previous.” It does exactly what it sounds like it … it allows you to look at the previous row of data. Seriously, while you are on row 2 you can check the value of a field on row 1. In practice it works just like this:

IF(MRN = Previous(MRN) …..

How cool is that? How do I use it for solving this readmissions magic trick? Just like this:

IF(MRN = Previous(MRN),’Yes’, ‘No’) as [Inpatient IsReadmission Flag],

If the MRN of the row I’m on now, is the same as the MRN of the previous row, then yes this is a readmission, otherwise no this is not a readmission it is a new patients first admission. Actually that’s the simplified version of my code.

My code actually thinks through how the results would need to be visualized. Besides an easy human language Yes/No flag someone is going to want to get a count of the readmissions right? Does the Qlik Dork want to have charts or expressions that would have to use IF statements to say if the flag = Yes, of course not. I want the ability to have field that is both human readable Yes/No, but also computer readable for counting 1/0. That’s where the magic of the DUAL function comes into play. It gives me a single field that can be used for both needs.

IF(MRN = Previous(MRN),Dual(‘Yes’, 1),Dual(‘No’,0)) as [Inpatient IsReadmission Flag],

Using the Dual data type allows me to provide the end user with a list box while also allowing me to provide very fast performing expressions:

Sum([Inpatient IsReadmission Flag])

How does the entire Readmissions load work? After loading the encounters, and allowing the preceeding load to qualify the encounters into core diagnosis types I simply do a self-join to the encounter table referring only to the inpatient records and ordering the data by the MRN and the Admission date and time.

Left Join (Encounters)
IF(MRN = Previous(MRN),Dual(‘Yes’, 1),Dual(‘No’,0)) as [Inpatient IsReadmission Flag],
IF(MRN = Previous(MRN),Previous([Discharge Dt/Tm])) as [Inpatient Previous Discharge Date],
IF(MRN = Previous(MRN),Previous(EncounterID)) as [Inpatient Previous EncounterID],
IF(MRN = Previous(MRN),NUM(Interval([Admit Dt/Tm]-Previous([Discharge Dt/Tm])),’#,##0.00′)) as [Inpatient Readmission Difference],
IF(MRN = Previous(MRN),IF(Interval([Admit Dt/Tm]-Previous([Discharge Dt/Tm])) <= 30.0, Dual(‘Yes’, 1),  Dual(‘No’,0)), Dual(‘No’,0)) as [Inpatient IsReadmission within 30]
Resident Encounters
Where [Patient Type] = ‘Inpatient’
Order by MRN, [Admit Dt/Tm];

If you are paying attention you’ll notice that the above is simply our “for fun” counts to show all inpatient readmissions and has nothing to do with any of the core diagnosis. In order to perform that trick I do the same basic steps but I enhance my where clause to only look for encounters that have a core diagnosis of COPD and I simply name my flags and other fields differently.

Left Join (Encounters)
IF(MRN = Previous(MRN),Dual(‘Yes’, 1),Dual(‘No’,0)) as [COPD IsReadmission Flag],
IF(MRN = Previous(MRN),Previous([Discharge Dt/Tm])) as [COPD Previous Discharge Date],
IF(MRN = Previous(MRN),Previous(EncounterID)) as [COPD Previous EncounterID],
IF(MRN = Previous(MRN),NUM(Interval([Admit Dt/Tm]-Previous([Discharge Dt/Tm])),’#,##0.00′)) as [COPD Readmission Difference],
IF(MRN = Previous(MRN),IF(Interval([Admit Dt/Tm]-Previous([Discharge Dt/Tm])) <= 30.0, Dual(‘Yes’, 1), Dual(‘No’,0)), Dual(‘No’,0)) as [COPD IsReadmission within 30],
IF(MRN = Previous(MRN),IF(Interval([Admit Dt/Tm]-Previous([Discharge Dt/Tm])) <= 90.0,’Yes’, ‘No’), ‘No’) as [COPD IsReadmission within 90]
Resident Encounters
Where [Patient Type] = ‘Inpatient’ and  [Core Diagnosis] = ‘COPD’
Order by MRN, [Admit Dt/Tm];

And just when you think I’ve pulled as much handkerchief out of my sleeve that it can possibly I hold I do the same steps for Sepsis this time.

Left Join (Encounters)
IF(MRN = Previous(MRN),Dual(‘Yes’, 1),Dual(‘No’,0)) as [Sepsis IsReadmission Flag],
IF(MRN = Previous(MRN),Previous([Discharge Dt/Tm])) as [Sepsis Previous Discharge Date],
IF(MRN = Previous(MRN),Previous(EncounterID)) as [Sepsis Previous EncounterID],
IF(MRN = Previous(MRN),NUM(Interval([Admit Dt/Tm]-Previous([Discharge Dt/Tm])),’#,##0.00′)) as [Sepsis Readmission Difference],
IF(MRN = Previous(MRN),IF(Interval([Admit Dt/Tm]-Previous([Discharge Dt/Tm])) <= 30.0,Dual(‘Yes’, 1), Dual(‘No’,0)), Dual(‘No’,0)) as [Sepsis IsReadmission within 30],
IF(MRN = Previous(MRN),IF(Interval([Admit Dt/Tm]-Previous([Discharge Dt/Tm])) <= 90.0,’Yes’, ‘No’), ‘No’) as [Sepsis IsReadmission within 90],
IF(MRN = Previous(MRN),IF(Interval([Admit Dt/Tm]-Previous([Discharge Dt/Tm])) <= 120.0,’Yes’, ‘No’), ‘No’) as [Sepsis IsReadmission within 120],
IF(MRN = Previous(MRN),IF(Interval([Admit Dt/Tm]-Previous([Discharge Dt/Tm])) > 120.0,’Yes’, ‘No’), ‘No’) as [Sepsis IsReadmission > 120]
Resident Encounters
Where [Patient Type] = ‘Inpatient’ and  [Core Diagnosis] = ‘Sepsis’
Order By MRN, [Admit Dt/Tm];

And then for AMI. And then for CHF. And then for … Oh you know the handkerchief can go on forever and eventually we end up with a data model that includes all of these awesome fields that didn’t exist when we began so that we can actually do our work.


Voila a Readmissions Dashboard

Not only can we then provide a really nice looking dashboard which includes accurate statistics we can do it using very simple expressions that are incredibly fast.



Click this link to get the entire Readmissions Code start script: ReadmissionsCodeScript

Posted in Visualization | Tagged , | 7 Comments

Have you ever wondered …

Have you ever wondered what events happen to patients after a particular surgery is performed?


Well I did. Like I seriously can’t sleep when I start wondering about things like that. I start believing crazy things like we can change the world by using analytics. What do you when you get crazy analytical questions in your head? Do you just let them go or do you dig and scratch and claw until you pull the data together and solve the puzzle?

In this case even though it’s just a hypothetical example for a blog post I still worked crazy hours setting up the data, building the application, filming the video and writing this post. Why? Because I think there is huge value in tracking not just the variances in costs and timing for individual procedures but in analyzing an entire series of events as well.

Notice I used the word “events” and not just “procedures.” Certainly it would be nice to know if having 1 procedure leads to another procedure in 75% of the cases for a physician. But wouldn’t it also be nice to know how often a procedure leads to a patient having a Code Blue? Or having to have a tube placed? You know … KEY MEDICAL EVENTS in a patients stay. Or … even their return after a stay?

Ok now that we are all agreed me working crazy hours to set this up is a valuable exercise let’s examine what I will demonstrate in my video.

  1. I use an Aster NPath SQL-MR query just like in a previous post to process a set of surgical event data that I’ve loaded.
  2. I also take advantage of Qlik’s ability to do some cool ETL things on the fly and I capture the First Event and the Last Event so that in the UI I can choose which procedure I want to start with or likewise in your world you could select the last event to occur and find the various paths that led to that preceded that event’s occurrence.
  3. While I was at it I also load in some sample patient demographic information to demonstrate that the advanced analytics you can do with Teradata Aster doesn’t have to be visualized in a vacuum. Of course you will want to take advantage of the Qlik Associative model and load data from as many sources as needed.
  4. The application consists of two basic screens. The first is a blah-blah-blah you can filter the data using demographic information and see the results of the NPath query visualized in a Sankey Diagram just like you would expect. The second screen is more a “Are you kidding me I didn’t know you could do Alternate States in Qlik Sense like you can in QlikView” kind of thing you would expect from a Qlik Dork. I demonstrate the ability to compare the event paths between different patient sets thanks to the great extensions built by Svetlin Simeonov.

I could have just shared the video but where is the fun in that I had to do a little creative setup so that you would understand what you were watching.

Do I think you are going to run right out and start building an application like this to analyze surgical events?

Of course I do. I’m a dreamer. I wouldn’t put this kind of effort into something if I didn’t believe it would spark an interest in at least a few of the readers to really start putting advanced analytics to work. Perhaps not for this specific situation but certainly there is some other big problem you’ve wanted to tackle that is like this. You have all of the pieces you need at your fingertips … so GET GOING!


Posted in Data Science / Big Data, Visualization | Tagged , , , , | 1 Comment

Thousands, and Millions and Billions … Oh My!!!!!

When most people think of Qlik they think of our patented Qlik Indexing Engine having all of your data in memory. I love demonstrating the lightning fast speeds and responsiveness with hundreds of millions of rows of data. More and more recently though I’m getting the smiles mixed with “That’s awesome but can you handle billions of rows of data?”

C’mon really????? Billions of rows of data? Gosh that’s an awful lot of data. I’m afraid.

Just kidding even that much data doesn’t scare me.

In fact it thrills me.

Gives me goose bumps to think about the kind of decisions that can be made when that much data is made available to the analysts and the decision makers. It also provides an opportunity for me to discuss one of the least known features that Qlik offers. It’s called Direct Discovery and it allows you to consume even billions of rows of data.

Direct Discovery

Direct Discovery is a two step process. In step 1 Qlik reads enough information to allow the end user to select a cohort. Step 2 then uses the primary key information for that cohort to go back to the massive data store and read all of the details live.

Oh wait you want an example? More details? Well since you asked so nicely.

Typically with Qlik you would read all of your data from the source with a command like:

SQL Select {my fields} from {some table};

It would bring all of the data back, perform our Qlik magic on it to compress it in memory and you would be off to the races. With Direct Discovery the query is different and uses a different syntax. You start with something like this:






When the data load encounters that Qlik actually issues 2 separate commands to the source:

  1. Select distinct record_id
  2. Select distinct procedure

Why? Because it’s easier and faster of course. The data source only has to prepare a minimal amount of records. Your network only has to transmit a minimum amount of data. Finally Qlik only has to read a minimum amount of data.

The final part of the syntax would be something like:









from surgery_events;

The fields that you identify in the DETAIL section of the command are usable immediately within Qlik despite the fact that it doesn’t actually retrieve the data for them. You can see the field names in the data model viewer they just show as having 0 rows of data. You can see the fields in a field list. You can add the fields to charts. There just isn’t any real data for them. Yet anyway.

Your application is then designed to allow the end user to select a cohort using the DIMENSION fields in some way and then Qlik will go and retrieve the data live from the data source for that cohort.

I’ve had so much fun working with Teradata Aster lately that it only made sense for me to use my Teradata database as a data source. It provides a robust, high performance and highly reliability storage mechanism for those with massive amounts of data. In the video I use the command above to extract the dimensions, select a cohort of patients, then allow Qlik to extract the data live. Just for fun I also utilize the Aster Management Console to show you the commands that Teradata processes from Qlik to further solidify how it all works. Kind of the extra step you’d expect from me.

You want more don’t you?

The ferocious appetite in you to consume massive amounts of data wants more information doesn’t it? You can check out all of the details on the Qlik Sense help page for Direct Discovery:


The following post contains a fantastic PDF document explaining even more including some nifty variables you can use like the one I documented in the video:


Yes you can even use the Direct Discovery feature for cases where you want closer to real time information from smaller sets of data. You know those situations where you only have a few hundred million rows of data but you still need the functionality of pulling live rather than having pre-loaded all of the detailed data.

Posted in Data Science / Big Data, Training | Tagged , , , , | 1 Comment

Visualizing Advanced Analytics

Advanced Analytics with Aster

I recently stumbled upon Teradata’s Aster and I’m pretty fired up. It turns out there is an entire community dedicated to helping data visualization people like myself learn how to implement advanced analytic functions. The site includes a link to download Aster Express free of charge and includes a slew of great training videos.

Click here to see the Teradata Aster Community

I can almost hear the Data Scientists reading this post laughing at me for just discovering that. Meanwhile all of the Data Visualization people stopped reading and have already clicked the link and started downloading.

Visualization with Qlik Sense

Well if you Data Scientists are so cool did you know that there is likewise an online community site dedicated to helping you learn how to visualize your super cool analytic results? Well did you? The Qlik Sense Community offers similar free downloads for the product as well a slew of great training videos.

Click here to see the Qlik Sense Community

Guess me and the other Data Visualization peeps get the last laugh after all.

Kidding, and sharing of links aside, this is a serious post about how Data Science and Data Visualization can be married through the partnership of Qlik Sense and Teradata Aster. They are an easy and natural fit. Why?

Because Aster uses an SQL’ish syntax they call SQL-MR. Qlik Sense can easily fire any native SQL-MR directly against Aster, retrieve the results and then visualize them. No need to build out views. No need to save the results into tables. Simply fire the SQL-MR queries directly as written.

By offering a complete set of Open API’s Qlik Sense provides developers around the world the ability to construct visualizations to enhance what is available natively in the product. Like what you ask? Well a Sankey for one thing so you can visualize paths. Network/Graphing objects for another so you can visualize networks. Like … oh go see for yourself at:

Click here to see the Qlik Sense Community for Extensions

For your viewing pleasure

I could write and write and write and bore you to tears … or … I could take advantage of this chance to show of my cool new Qlik Dork video stinger and demonstrate the functionality … visually.

In a mere 3:57  I take the pure NPath SQL-MR query that John Thuma demonstrated in the Aster training video series for bank web clicks data and I implement it inside of Qlik Sense. I then take the results and display them in the raw form and using a Sankey.

Wowed yet? Don’t be that’s just me getting warmed up. In a paltry 3:05 this second video demonstrates how you can modify the NPath query so that the results aren’t aggregated. Why wouldn’t I allow it to aggregate the million plus paths? So that I can tie the raw paths together with customer demographics information. Allowing you to then discover the paths for selected cohorts. No way!!!

Yes way. C’mon I’m the Qlik Dork of course I would go the extra step for you. I even utilize a mapping object to select customers from selected states. All while the Sankey diagram is being updated to show the paths that were returned from Aster based on the selections.

But wait! There’s more.

I know you are now fired up and you want more. Don’t worry my friends I’m just getting started down this path of marrying Data Science and Data Visualization. What can you expect next? Keep it a secret but given my background in healthcare it may just have something to do with utilizing an NPath SQL-MR query in Aster to analyze the events for surgical patients but you didn’t hear it from me. After all it’s not like I’m trying to actually help people do real world stuff like that.


Posted in Data Science / Big Data, Visualization | Tagged , , , , , , , | 2 Comments

To achieve, or not to achieve action

Portrait of William Shakespeare

Portrait of William Shakespeare

That is the question.

At least it’s the question that we in the business intelligence community should be focusing on. Why weave my title so closely to one of the most famous lines by William Shakespeare?

Simple. Our ability to drive actionable intelligence relies heavily on our ability to weave a story around the data insights that we have discovered.

Discovering that we have 10 serious issues in our company and having $5 in your pocket will get you a cup of coffee at Starbucks. But being able to share the information about even 1 of those issues in a way that leads to actual change will put such a spring in your step that coffee will be unneeded.

In her fantastic book “Storytelling with Data” author Cole Nussbaumer Knaflic introduces two great phrases which really brought about great clarification to me. Exploratory Analysis vs Explanatory Analysis.

Exploratory Analysis are the actions that we take to do data discovery. It’s the drilling around. Poking under the hood. Using our human intuition to question the data. And the lights that dawn as a result.

Explanatory Analysis on the other hand is the art of being able to use the data to communicate a story that helps induce actions from those that have the power to make them. It involves our ability to use one of the oldest forms of human communication, storytelling, that has sadly become a lost art.

Emotional Call to Action

Storytelling can involve some very in your face kind of messages as a way to ensure that leadership has a call to action. For example imagine that we’ve spent a few days consuming clinical and financial data using a dashboard similar to the following that has multiple linked screens that we utilized to find an issue with a particular set of selections.


We could hold a meeting and put leadership to sleep showing them how cool our ability to navigate is or we can simply lead with a slide like the following that grabs attention.


You probably don’t want to use humorous sarcasm in your presentation to point the finger at a group but I think it works for this post as you kind of expect it from me. The slide includes enough details to insight some action and by all means include the actions you want to see taken. Of course you may have to prove your details and that’s exactly why the Storytelling feature in Qlik Sense is so valuable you can jump in and out of your story to do demonstrate the exploratory analysis you have done to support the explanatory analysis you are using in the meeting.


Perhaps your data doesn’t really require such an emotional tug to ensure action is taken. Perhaps all you are trying to do is provide some narration to help draw attention to help explain the data.

Consider the following chart before and after adding a few narrative elements are added to help the audience focus on the important things:





As I share on my About page I am far from an expert on any of the things I write about. I’m reading. Learning. Growing. Every single day just like you with the help of many others in the industry. Data is my thing and I own that. But I will be honest and tell you that providing narration for my stories is not something that comes naturally to me.

In fact the key points above … yeah I stole them. Well not actually stole them so much as I copied them to the clipboard and pasted them into my storyboard from what I think is one of the coolest new elements of technology that I’ve seen in a long time. It’s a narration extension for Qlik Sense that you simply tell which chart you want it to consider and it does the narration for you. That is a serious help to someone like me who is trying to learn how to help my audience understand the data that I’m presenting to them.

The fact that Qlik chose to construct it’s architecture using an Open API and the fact that anyone who can code can gain access to the patented Qlik technology while adding value through their secret sauce is what makes it possible for a group like Narrative Science  who is blazing trails in the field of natural language to build such an awesome extension.

The following video will let you see the narrative science extension in action. If you are a Qlik customer you will get all of the instructions you need and can download this exciting new object from this download location that includes instructions on how to install and has it’s own video that demonstrates it’s powerful capabilities. .

To achieve, or not to achieve action

There was a day when all we had to do in our field was surface data. Yeah those days have long since passed. Our jobs now entail not only finding the needles in the data hay stacks but helping our leadership teams understand them so that they can take action. I challenge you today to grow not only in the field of Exploratory Analysis but also in the emerging field of Explanatory Analysis.

Become a storyteller.

Add narration to your charts rather than just pasting them into presentations because you think they look pretty.

Use your newly developed skills to “incite action” and effect real change in your organizations.

Finally quit being selfish and keeping my tips to yourselves. For crying out loud start sharing these pages with others.

Posted in Data Literacy | Tagged , , , , , , , | 2 Comments

A Bunch of Whiny Brats

Ever have one of those days where you feel like you are surrounded by a bunch of whiny brats? No I’m not talking about your children (or grand children in my case.) I’m talking about your leadership team. You’ve written thousands of reports. Labored over hundreds of applications. Yet they keep whining about wanting Actionable Intelligence.

Your Reaction

You beat your head against the wall to surface data from a cocktail napkin and merge it with 147 other data sources from database systems, Excel sheets and external data sources on the web and you make it work. You put all of the data into an amazing analytical application that is truly Functional Art that even Alberto Cairo would give you two thumbs up for. But without even so much as a pat on the back for the great job the first response is “We want something simpler. We already have Executive Portal can’t you just embed those charts into the site we are already have a link to?”

A bunch of whiny brats right. It’s just one more link to save to your favorites. It’s just one more application to learn. But noooooo they want to press the easy button because unlike you that has to learn 189 things per day to stay current they don’t want to change their delicate little processes.

Embedded Analytics

Well don’t be dismayed my friend there are whiny brats like that all over the world and the Qlik platform enables you to support them. I’m not joking. The Qlik API’s enable you to take the gorgeous work you’ve done and embed the KPI’s or charts directly into your existing portal and this quick 6 and a half minute video I show you exactly how to do that.

Ok now how could anyone could complain about this right? You can embed your genius analytical solutions right into the portal they use every day. You can embed Finance related data right into their Sharepoint page and it relates and allows interaction.

C’mon even your leadership team has to stand back in awe. Amazed at your skill and the innovation of Qlik’s platform to support that kind of functionality. Right?


These are whiny little brats you are dealing with. Their first reactions are “That’s pretty nice but I don’t want to see the same 5 charts that Bob sees. I need to control my own dashboard because I’m the center of my universe.”

Are you kidding me??? They have access to key information on their mobile device from their executive portal and that isn’t enough?

No it’s not enough.

The reality is that your leadership team aren’t whiny little brats they are saavy business people who need to constantly push the threshold. They need access to the company data that has been kept from them for years. For crying out loud their mothers use Pinetrest everyday to “pin” recipes and come back to them whenever they want. Yet there you stand telling them that every time they want something added/removed from the portal they have to fill out a ticket request and wait for you to be the bottleneck in their accessing the information they need to do their job?

Self Service Dashboards

C’mon this is Qlik we are talking about. A company named by Forbes as one of the Top 10 Innovative Growth companies. Of course they can provide Self Service Dashboard capabilities. What do you think they are doing just helping you visualize data on your own workstation?

How simple can they make it? You know that Pinetrest site that has had “pins” pressed over 50 Billion times … yeah … they’ve made it that simple. In this short 4 minute video I demonstrate how to do the same thing.

An Innovative Platform

“There are no dreams to large, no innovation unimaginable and no frontier’s beyond our reach.” – John S Herrington.

“There’s a way to do it better – find it.” – Thomas Edison

Unless your leaders can consume it your companies data is not an asset it is a very expensive liability. Qlik is providing you a platform that allows only your mind to limit how you surface it. You have right now at your disposal the tools to surface your data via embedded analytics on your existing portals as well as allowing your staff to surface only the data they are actually interested in via their own personalized dashboards by simply “pinning” objects.

Actionable Intelligence

Just building data visualizations isn’t the answer. Presenting Actionable Intelligence in a way that can be consumed and acted upon is the goal. Now that you know what’s available it’s just a matter of whether you want to innovate the way data is consumed within your company or not.

Posted in Self Service, User Adoption | Tagged , , , , , , | 4 Comments

Visualizing How to Improve

Besides helping customers by day and being an all around Qlik Dork at other times I happen to have a very strong passion for helping fastpitch softball players elevate their game.

When I say elevate their game I mean getting over their greatest fears so that they can play the game like they OWN IT.

I have zero interest in spending hours of my life working with players on how to improve the minutia of their game (foot work for a double play, where to go to receive a cutoff, etc), that’s where their coaches and hours and hours and hours of practice come in. What I teach them to do is dive. Head first. All out. No fear. Diving aggressively with no fear.

The change in every aspect of their game is so astronomically improved once they overcome that fear the rest of their game falls into place.  Click this link and watch the intro to one of my instructional videos to see what all out speed and a lack of fear looks like exploding through the air

You are still reading because you know me well enough by now to realize that there is a solid point to why I brought up what I do in softball. If you are going to set goals to improve it seems only reasonable that you figure out how to make the biggest impact with your time. Whether it is in the lives of young ball players, whether it is with your own actions or whether you are trying to improve quality at a health facility to help improve the health of your patients.

Clinical Quality Measurements

I recently had the pleasure of working with a large health system who wanted to focus on analysis of their Clinical Quality Measurement data. To set the stage they had 62.5 million quality measurement records covering 35 different measures across 8 systems and involving 511 practice groups and covering 2,241 providers.

Naturally we needed to illustrate some “dashboardy” type deal to reflect their starting point. They happened to be at 56.01%. Is that good or bad?

That my friends is a trick question. Starting points are neither good nor bad they are simply starting points. So as you consider your improvement efforts don’t judge yourself based on some myth in your head of where you should have been. Simply measure and report where you are. Only then can we look at going forward.

Too much data to see the problem

The next logical step of course is to begin analyzing the data. In a traditional report driven world we would ask for some details based on the different Compliancequality measures. So we did that. We created a very simple chart that showed the name of the quality measure, the # of members involved (patients), the number of quality measurements taken for the measure as well as the % of the measurements that were compliant.

You know the typical stuff that emulates what you could get out of any $9.95 report writing tool. Then we sorted the chart in order of the % of records that were compliant. It’s where they were.

Naturally we also added the ability to change the dimension (Measure Name) to System, Practice or Provider so we could see the same details as we drilled in.

Visualizing How to Improve

The purpose of the project was to improve their compliance percentages as an organization. So here is where my opening point comes in. What should they spend their time on? Who should they speak to?

The natural inclination for folks is to start with the worst on the chart and go from there. The compliance for “Seizure – New Onset” was at 30.18%. Again neither good nor bad, just where it was. I said the natural inclination is to start there, but that would be wrong.  My friends the biggest bang for the buck isn’t to charge down the halls trying to improve the compliance of a measurement that only has 45,255 out of the 62.5 Million overall records. So if we’re going to help them determine how to best spend the time of their valuable human resources I better create a visualization that actually does that.

Visualizing what Isn’t Right

The visualization that I believes help most with where to spend time is a Pareto Chart. Instead of looking at compliance percentages a Pareto Chart does the opposite it looks at what isn’t compliant. More to the point a Pareto Chart looks at each Quality Measure (or any dimension) and looks into how many non-compliant measurements it has versus all of the non-compliant measurements in the entire data set.

It also visualizes the cumulative effect as you proceed through the list. Sometimes graphics in posts are simply to add a little excitement like the girl diving, but in this case a picture is needed to really understand the tremendous impact of what a Pareto Chart can do for you.

In the image below you will see that “Colorectal Cancer Screening” by itself as a measure makes up 37% of all of the non compliant measurements. Why? If you look at the detail chart above closely it has over 15 million measurements for it’s members. It’s the biggest piece of the pie by a long shot. Followed by “Cervical Cancer Screening” and then “Diabetes Management.” The red line indicates the cumulative affect and you’ll see that if you focused your time on simply the top 6 of the 35 measurements you would be effecting a cumulative 80% of all measurements.


How did you do that?

That Pareto dealio is pretty powerful isn’t it? Kind of illustrates how to improve in a slick and easy method so I know you want to jump right into your systems and add it so the question you may be asking is “How did you do that?”

I start be creating a combo chart. The bars simply represent the total number of items that are not compliant divided by the overall total of non compliant items which is handled using simple SUM functions and the wonderful key word “TOTAL” that tells the system to ignore the dimension that the current row may represent. (The IsNotCompliant field is simply a bit field with a 0 or 1 value indicating if the measure was not compliant or not)

 SUM(IsNotCompliant)  / SUM(TOTAL IsNotCompliant)

The cumulative line is simply the exact same expression and the clicking of the radio button that says “Full Accumulation.”

IFullAccumulationt’s really that easy.

It’s really that powerful.

The question now is simply “Where can you use a Pareto Chart to help your organizations ”

Meaningful Use? Absolutely!

CPOE? Absolutely!

What I love about this profession is that we have the tools that make visualizing how to improve so easy. What would take weeks in an old fashioned report writing and hours and hours of old fashioned human analytic skills can be created in a single chart that instantly identifies where people need to spend their time if they want to improve the numbers and not just measure the numbers.

Posted in User Adoption, Visualization | Tagged , , , , | 1 Comment

Avoiding a Data Tornado

tornadoYou know I love to go out on a limb using data metaphors. Sometimes they are my own and sometimes I flat out steal them from others. (Imitation is the sincerest form of flattery you know.)

I’ve wanted to continue my series on The Data Consumption Continuum for a few weeks now. But just writing my thoughts? That’s crazy. I’ve had to show great patience in waiting for just the right metaphor to come along to catch your attention and draw you in.  The “what in the world is Qlik Dork up to now” kind of lead. Recently inspiration struck as I came across this beautiful data metaphor “Data Tornado” from Tyler Bell.

In his post “Big Data: An opportunity in search of a metaphor” he introduces the concept as one of the major thought processes that surrounds data consumption in this great big data world we now find ourselves. He frames data as a problem of near-biblical scale, with subtle undertones of assured disaster if proper and timely preparations are not considered. (Don’t worry it’s not all doom and gloom he also introduces several positive metaphors but hey read those on your own time I’m trying to make a point here.)

We are at an age in the history of information where many analysts and businesses are begging for Self Service. Screaming if you will at IT “Just give me access to the data it belongs to the company I’m tired of waiting for you to write a report.” They are savvy and they know full well that the data is just sitting in a database or on a file share somewhere so why can’t they have access to it?

So why doesn’t IT want to just turn over the data and stop listening to the griping? Because the IT leadership team is worried about the Data Tornado that will ensue from all of these yahoos just randomly grabbing data and reporting 18 versions of the truth. You wondered where I was going with it didn’t you? And who can really blame them. You immediately understood the term “18 versions of the truth” because you’ve been burned by it in the past … multiple times.

DataFluencyYou can’t get any more succinct than Zach and Chris Gemignani in their book “Data Fluency” — “You can’t dump data into an organization and expect it to be useful. Creating value from data is a complex puzzle; one that few organizations have solved.” The answer to why not is found partly in another of their excerpts “The goal of a data fluent culture, in part, is to ensure that everyone knows what is meant by a term like customer satisfaction. A data fluency culture breaks down when people spend more time debating terminology, calculations, and validity of data sources rather than discussing what action to take based on the results.”

Enter Governed Self Service

Rest easy my friend. My post isn’t about the wide spread panic currently surrounding “self service” and that terms association with a “data tornado.” It’s about how to AVOID it. It’s about a new phrase you should repeat to yourself in the mirror a few dozen times until you begin believing your own facial expressions when you say it “Governed Self Service.”

The word “governed” seems to have negative connotations by many and those thoughts need to change. It doesn’t (have to) mean that IT is restricting you from accessing data. It can and should mean that IT is adding value to the data to ensure that the right data is used by the right people at the right times. They don’t want to be storm chasers or fire fighters dealing with the carnage after a data tornado has struck. Data Governance is a way for them to prevent the tornado in the first place by ensuring that you fully understand what you are surfacing.

Enter Qlik Sense

Self Service is a technology agnostic term. Many high quality tools are in the market that allow you to display data. Qlik Sense goes beyond the ability to display data and allows you to build in the governance that is so desperately needed to avoid data tornadoes and satisfy the well phrased concerns needed for a truly data fluent organization through the use of pre-defined Dimensions and Measures.

Imagine that we have a set of data that surrounds customers and the analyst needs to display a count of the customers. Easy enough … after we define what the term “customer count” means. If we are just looking at table that has customer demographics the count is obvious. But what if we are looking at a table of data that is all of the customer orders. Is the count the literal count that 100 customer (orders) were placed or should we display the unique count of customers so that we know we only had 76 different customers that placed those 100 orders?

Dimensions and Measures allow IT to build a framework of understanding to help analysts surface data in a way that avoids confusion. This screen shot illustrates how much metadata IT can add to a measure that can be used by an analyst in a way that ensures they use something as simple as a count correctly. You will see that the measure can contain a name, it shows the expression, it contains a description and holy cow it can even have tags associated that analysts can search for desired measures in a world where there might be thousands.


Enter Architeqt

As I’ve literally crisscrossed the country this year presenting to potential (and existing) Qlik customers they love this concept. But many in IT have begged for even more governance. “Dalton that’s great but Dimensions and Measures are only defined within single applications. What happens if we make changes? How can we apply changes across all of the applications? What if we need to add more as we develop more sources of data? After 30 years in the IT trenches I can do nothing but whole heartedly agree with them because maintenance is one of those things that IT considers but many analysts don’t.

No problem because that’s where Architeqt comes in. Architeqt is the framework for providing serious data governance across all of your Qlik Sense applications and is the brain child of Alexander Karlsson. It provides you the ability to create what he calls “Blueprints” which are the dimensions/measures/visuals that you need to share across all of your applications and then … oh this is so cool … use those blueprints in any of your Qlik Sense applications. And keep them in synch when you make changes.


There are many very small incremental steps that I’ve seen in my career. But my hat goes off to him because Architeqt isn’t one of those things. To me what Alexander has created provides the infrastructure that IT has been clamoring for. It provides them the assurance that they can maintain all of those vital formulas across all of the applications while still allowing analysts to freely access data. Combined with the ease of use of Qlik Sense provides to analysts to grab data and go forth with consuming data it finally provides a framework for … say it with me … Governed Self Service.

Exit Stage Right

While I would love to go and on with lots of additional information I know this is the right time for me to step off the stage and allow you to dig into Architeqt for yourself. Simply click this link and it will take you directly to this phenomenal new extension. The site will contain all of the information you need to download and configure this Qlik Sense Extension as well as a nifty You Tube video where you can see it all in action.

Posted in Data Literacy, Self Service, User Adoption | Tagged , , , , , , | Leave a comment

Flipping Homes or just Flipping Out?

I’ve enjoyed my career in Business Intelligence but after seeing the following visualization which shows the amazing potential for earning profit in the home flipping business I think it’s time I became a real estate mogul.

FlipChartUnless you’ve been under a rock or you are probably aware of the blitz of television shows dedicated entirely to showing us how easy it is. The underlying needs for house flipping is the startup capital to make purchases with, and the keen eye of a designer to help you choose the right colors to slap on the walls. I’ve got like $12 saved up which is probably more than enough to get started and fortunately I’m blessed with a wife that has a great eye for design. If you aren’t as fortunate as me you may need to find a business partner and a designer who you will more than likely have to pay.

Getting started

As business intelligence professionals I think it’s only good common sense for us to get started by playing to our strengths … use analytics to help us make our home purchases. After all as advocates of actionable intelligence certainly we would trust our own life savings in our analytical hands. Right???

The first thing we would want to do is figure out what aspects of a home are most responsible for attracting the highest price. Those data science types call what we are trying to do a “multiple regression.” In real estate mogul language it means – “Hey dingbat before throwing all $12 down on the table to buy a home you probably need to know whether it’s the homes square footage or the lot size or the number of bathrooms or the number of bedrooms or the amount of taxes or the proximity to schools that has the most impact on the sales price.”

Multiple Regression

Not too hard to understand the importance that knowledge would have on our ability to turn a profit. But how does that data science multiple regression stuff? It’s simple you fire up R, load your data, you run the LM function and let it give you the answers.

Seriously it’s that easy. Here is how we would load our previous home sales data:

Housing = read.table(“C:/RealEstateMogul/housing.txt”, header=TRUE)

Then if we want R to tell us what the correlation is between the Price of the home and the Size (of home) and the Lot (size) we simply type the following

Results = lm(Price ~ Size + Lot, data=Housing)

Iterating combinations

R very well may tell you that there is a really strong correlation between the home size, lot size and the price. But unless you are lazy you would probably also want to know if there is an even stronger correlation. In other words is the size of the home and the number of bathrooms more important? Or perhaps lot size and number of bedrooms? In our case all we would have to do is go through every possibility of 2 variables. Then all combinations of 3 variables. Then all combinations of 4 variables. Then all combinations of … you get the idea.

As you can imagine it’s this manual coding of all of the combinations, this grunt work, that those data scientists don’t really enjoy. Fortunately as a budding tycoon I’m also a Qlik Dork and I have full intentions of using QlikView as well as R.

QlikView and R Integration

You see this is kind of the perfect use case for the QlikView and R integration. Not only do I want to be able to simply check whatever combination of variables I want to use, I also want to be able to filter the data and choose what is passed to R. That way I can verify the best combination of variables as well as confirm that the correlation holds true across time periods, across zip code ranges etc. Or I may determine the variables that are best suited to 30542 versus 90210.


Behind the scenes there are only a handful of lines of vbscript code behind the button that says “Run in R.” Basically it outputs the data from a table so that whatever you have filtered is put into a CSV type file. Then it calls R tells it to read the file it just output, then tells it to run the LM function using the variables you’ve checked and asks it to output the results to a file and then reads that data back in to QlikView so you can see the results. Including a scatter plot output showing relationship between all of the variables.



Some aren’t even aware that QlikView integrates with R. Others that do know figure “I’m going to do the modeling in R anyway and figure there really isn’t much that the QlikView integration can do for them.” Hopefully both types of people end up stumbling on this post. Feel free to nudge them by passing on the link. You see the beauty isn’t just that QlikView can call R. It isn’t just that you can check variables on a screen. You are more than free to write additional code that would literally iterate through every potential combination, and instruct R to write the results to filenames that match the combinations so that in 1 button press you get all of the results for all combinations.

So what? So what!!! The “so what” here is that so many of you out there are thinking “data scientists are seriously expensive and we can’t afford them in our company.” You are so right. You can’t afford to pay a data scientist full time to sit and iterate through every combination of your data. After all housing variables are mere child’s play compared to the massive amount of variables in healthcare for instance.

But you can afford to consult with one. You could have them build a model and then you simply use QlikView to iterate through all of the variables and then send them the output to review. Or what about that grad student in data science who has a few days in which to get some “real world experience” would QlikView’s integration to R allow you to take advantage of them?

Predictive Analytics is an important part of the overall data consumption continuum. The integration and what QlikView offers you sitting on top of R may be just what your organization needs to jump start your ability to reap huge rewards that predictive analytics offers.

As for me, it was fun using house flipping as a great use case to help me convey how to use predictive analytics. As you guessed though it turns out that $12 isn’t even enough to buy a gallon of paint to slap on walls. So I guess I’ll just have to continue doing what I love … helping others consume data.

Resources for those hungry for more

You know how this blogging stuff works. If I write to much then I lose my audience. But in this case I know that flipping homes is really on the minds of a lot of you. So I’ve tried to predict some of your questions and provide you with links to more detailed answers and opportunities because that’s just the kind of dork I am.

“I want to see more so I can get a better idea of just how cool this stuff is”

The following You Tube video is a Qlik Dork exclusive and will probably not go viral so you shouldn’t have any problems at all viewing it. https://youtu.be/jwZ1K6invPI

“No fair having all of the fun yourself. I want to be a house flipping phenom as well. How do I get my hands on this stuff?”

Great question. You can download the QlikView application used, as well as an implementation guide to help you configure R on your machine by clicking this link when prompted the password is “PredictiveAnalytics”

“I am somewhat familiar with R and I really do have an interest in house flipping. How can I get more information about the subjec?”

I’m not a data scientist, I don’t play one on TV and I haven’t even stayed at a Holiday Inn Express recently. However, the following links will give you all of the information you need about how to do a multiple regression on home sales data and how to read the results. They are from the serious data science minds at Columbia.

Summary version to wet your whistle

Really complex document that will blow your mind if you aren’t really into statistics


Posted in Predictive Analytics | Tagged , , , | Leave a comment