Draining the Data Swamp

By now you have at least one "big data" project, and chances are good you have a "data lake."

This lake (technically a reservoir) was created by damming up the data stream that was feeding your data warehouse. But what do you have now, really? Executives, frustrated with the cost and time to add another source to the ETL/ Data warehouse workflow, gave in to the mostly unsubstantiated claims that "Data Scientists" could just plumb this "lake" and find "insights."

But this is, prima facie, pretty absurd. One of our clients just "populated" their data lake with 28,000 tables worth of data. It's no wonder that data scientists spend 60% of their time "data wrangling" and only a small part of their time doing analytics.

We can't turn back the hands of Gartner. This hype cycle has left the station. You're going to have a data lake whether you want it or not. What we can do is use the inevitable chaos as an opportunity.

In this talk, we will describe how to build a layer of meaning atop the data lake that is rapidly taking on data in your organization, and thereby salvage some of its value.

The data lake has been promoted along with the idea of "schema on read," which is at least an intriguing idea. We are so used to the idea that all our schemas must be in place, complete, and correct before we write any data. The new paradigm challenges this. "We don't need no stinking schemas." There are germs of truth in both camps. Having to have all schemas defined and agreed upon ahead of time promotes delay and disappointment. But the alternative is anarchy.

Semantic Technology provides a means to lay down the Data Lake with minimal commitment to schema, and then add understanding, in the form of schema additives, as you work with the data.

This presentation will summarize what it is about graph databases and semantic technology that makes this possible. We will show some examples of starting with a simple schema and elaborating on it in place. We will show, by demonstration, how to execute a semantic query (in SPARQL) and retrieve conforming data from a Data Lake. After attending this talk, the attendees will:

Be aware of the danger of naive adoption of "Data Lakes"
Understand how to grow a schema in place without converting the data
Be prepared with a prescription of what to do to make the data in the data lake economically accessible

Dave McComb has over 30 years of experience with enterprise level systems and enterprise architecture. He has built enterprise ontologies for over a dozen major enterprises.

Dan Carey is an ontologist and data architect with 30 years of consulting experience, 25 of it designing databases, data models, and data strategies with major IT service firms. He has worked primarily for government clients at the federal, state, and local levels. Most recently, he has designed semantic technology products in OWL and RDF to assist in military human resources management, and data exchange standards in OWL and XSD. He holds a Bachelor's degree in Applied Physics from Georgia Tech.