Skip to main content

Transforming and Accessing Data through Custom Built Pipelines

One of the biggest hurdles in data analysis is just getting access to data in the first place. At Data Tapestry, we offer end-to-end analytics services beginning with data acquisition, performing analytics, and providing end user products. Keith Shook walks us through how to maintain data security and integrity when dealing with a variety of situations.

Tell us a little about your background and your role at Data Tapestry.

Currently, I’m a senior data engineer, but I actually started off as an intern ingesting data into Postgres and SQL databases. I then shifted into visualization using D3, a javascript library, but we found that Tableau was much more efficient. Since then, I’ve gained a variety of experience using scala, Hive, AWS, and building clusters.


Can you walk us through a project you’ve worked on?

Data engineering is pretty straightforward as far as the process goes. You get the data, ingest it into the database, and then hand it off to the data scientist. You have to be flexible with how you approach the process, because you can get data in a variety of different formats. Sometimes you know what the data looks like or what it should look like. Other times there’s a lot of cleaning involved.


During one project, I actually got the data from a physical flash drive. The company has a fleet of trucks that install electric dog fences. There was some sensitive geo-location data involving where the drivers stopped and even where they lived. Their data was being housed in Microsoft SQL server which is not the best structure. So, I exported all of the data to CSVs and imported it into Postgres and then gave access to it.

What steps do you take to protect sensitive information like that?

Data security is very important to me and my client. We follow best practices to ensure the data is secure at all times. As an additional measure, I treat all data like personally identifiable information (PII) so that there are no mishaps.


How do you interface with the client?

It depends on the project. Sometimes I serve in a small capacity, or I’ll be an embedded part of the team. Other times, I may be leading and consulting on the project.


What types of software and technology do you use for data ingestion?

There’s not really a particular toolset for all the jobs. I write a custom python script for each. For many projects, I’d visualize the data in Tableau and that would allow me to see what was actually in the data. Sometimes I had to create calculated fields and join different tables together in order to create a dataset that made sense.


What’s been the most interesting engineering challenge you’ve come across?

I’ve had to work with HL7 data. It’s a stream of data that hospitals create by recording various message types. The message types can be things like, there was a discharge on June 8th at 12:00pm.  They are highly specialized data streams that require custom built pipelines to deal with the weird formats.



Post a Comment

Popular posts from this blog

From Spreadsheets to Tableau: Creating Dynamic Data Visualizations

Data visualization can be a tough undertaking for many organizations. Choosing which metrics and how to visualize them in an effective manner can be challenging when you are limited to just tools like excel. Additionally, how do you keep them up to date with the everchanging demands of your business? Alexa Tipton explains how she partnered with a client to achieve just that.    Can you tell us about your background and current role at Data Tapestry? Currently, I’m a data scientist at Data Tapestry. Before that, I was a research assistant at UT Knoxville for the center of ultra-wide area resilient electric networks or CURENT. While there, I used a number of techniques including text mining, machine learning, and NLP to analyze tweets from Twitter. The goal was to understand the public’s sentiment towards their energy providers around natural disasters, but more importantly improve electric grid structures overall as part of the national science foundation project.  Tell me about how you

Utility Corridor Management using Machine Learning

At Data Tapestry, our team's expertise spans a variety of specialties. While we've been able to apply NLP techniques, forecasting, and predictive analytics to many problems, most recently our team had to work with image data and the complexities that it presents. We combined resources with unmanned imaging experts at Skytec, LLC to create a solution for overgrowth and vegetation management in utility corridors.  Damages in these areas due to overgrowth can occur without warning. Tower damage and power outages can cost millions of dollars in repairs and regulatory fines. It is even more important to detect these encroachments since an electricity arc or flashover can occur within less than 15 feet of power lines, thereby damaging equipment or causing fire to nearby vegetation. Unfortunately, manual efforts to monitor overgrowth can be extremely manpower intensive, expensive, and inefficient. Our Solution Imaging experts at Skytec provide aerial photos of utility corridors via un