Blog

Warehouse or Lake?

Data lakes are a means of storing data using its native format in a manner that allows for both structured, semi-structured, and unstructured data to be compared and analyzed in the same place; with the added benefit of being used as a sandbox where data relations can be explored. Data lakes are great if you’re new to analyzing and storing data, and allows for greater agility, flexibility, and ease when it comes to identifying trends, and more.

“What about data warehouses?”

While warehouses constitute the most common method for storing data, it only works for data that is well-structured. This means it’s more expensive and time-consuming up front, but designed to deliver higher quality results in the long-term, enabling you to perform predictable tasks. Warehouses exist to store and aggregate data with a designated purpose – meaning if you change your mind about what you want to use it for, you’re out of luck.

Put simply, if a data lake were a city, the data warehouse would be a suburb – still a great place to live, but meticulously planned, predictable, and with limited room for change. While you get cookie cutter houses in a suburb, sometimes you want the unpredictability, adaptiveness, and the vibrancy that city life provides. But how can this adaptiveness be leveraged?

Every there are roughly 100 000 scheduled flights every day. Commercial planes from companies like Airbus have around 10 000 sensors in each wing alone, and one commercial flight can generate terabytes of data. We’d do the math, but we think you get the idea – that’s a lot of data.

Airlines are discovering that data lakes can go beyond archiving data and be used to explore the possibilities of what it can help companies do. Want to figure out how to save on fuel? AirAsia did it with a data lake, and managed to save 1% of their fuel costs per year. To put that in perspective, 1% savings per year for the aviation industry translates into a whopping $30 billion dollars saved.

“How do I get started?”

Lake Vector The most important takeaway is your data lake needs to be geared towards the people who are using it the most. SQL, Spark, Hortonworks, Cloudera, and more are the most valuable in the hands of those who know how to use them. There are a number of powerful tools that can be leveraged, and what you decide depends on if you’re hosting using a cloud service or in-house.

A cloud-based data lake is likely the easiest and best option for most businesses, especially for those that don’t have staff with the specialized skillsets needed to maintain in-house systems. Pre-made solutions such as HDInsight (Azure) or Elastic MapReduce (Amazon) work out of the box, and can provide a solid foundation to build upon without the need for a specialized skillset or in-house infrastructure. This gives you the ability to increase or decrease the size of your distributed computing cluster based on what you need it to do. Processing Big Data requires a lot of time and power, and the cloud gives you the flexibility to use more as you need it.

For years, companies have been collecting and storing their generated data, with the promise that someday, somehow they’ll be able to use it in a beneficial way. Data lakes allow businesses who waited patiently to get a return on that investment. With the right team and the right tools for your business, a data lake can be the solution needed to extract value, identify correlations and trends, and create the visuals to determine what story your data is telling you.

 

To find out how Lixar can drive results from your data, contact:

data@lixar.com data@lixar.com


 

Career opportunities:

Data Team Lead | Data Scientist | Data Developer

team-member

Data Scientist

team

team-member

Engineer

team

team-member

Visualist

team

team-member

Data Translator

team

team-member

Data Architect

team

 

We’re always working on innovative projects.

If you are dynamic and love data, please send a message to our Data or HR team today.