What is Big Data?

What’s the story Rory?

Table of contents

  1. What exactly is Big Data?
  2. What can Big Data do?
  3. Why Has It Become So Popular?
  4. Why Should Businesses Care?
  5. Big data and analytics
  6. What Could That Data Be, Exactly?
  7. IT infrastructure to support big data
  8. Big data skills
  9. What Are the Most Commonly Held Misconceptions About Big Data?
  10. Big Data and Cloud Computing
  11. Book
  12. Influencers List
  13. Courses
  14. Links

1. What exactly is Big Data?

Big Data started appearing in many of my conversations with many of my tech friends. So when I met this “Mr. Know It All Consultant”, I asked him ‘What is Big Data?’. He went on to explain why Big Data is the next ‘in thing’ and why everyone should know about Big Data but never directly answered my question.

At first glance, the term seems rather vague, referring to something that is large and full of information. That description does indeed fit the bill, yet it provides no information on what Big Data really is.

Big Data is often described as extremely large data sets that have grown beyond the ability to manage and analyze them with traditional data processing tools. Searching the Web for clues reveals an almost universal definition, shared by the majority of those promoting the ideology of Big Data, that can be condensed into something like this: Big Data defines a situation in which data sets have grown to such enormous sizes that conventional information technologies can no longer effectively handle either the size of the data set or the scale and growth of the data set. In other words, the data set has grown so large that it is difficult to manage and even harder to garner value out of it. The primary difficulties are the acquisition, storage, searching, sharing, analytics, and visualization of data.

I kept asking that question to various other folks and I did not get the same answer twice from any number of people. ‘Oh, it’s a lot of data’. ‘It’s variety of data’. ‘It’s how fast the data is piling up’. Really? I thought to myself but was afraid to ask more questions. As none of it made much sense to me, I decided to dig into it myself. Obviously, my first stop was Google.

When I typed ‘Big Data’ at that time, this showed up:
“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it …” Dan Ariely

One popular interpretation of big data refers to extremely large data sets but I particularly prefer the 3 Vs definition: (volume, variety and velocity)
Doug Laney from Gartner who was credited with the 3 ‘V’s of Big Data. Gartner’s Big Data is ‘high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.’ Gartner is referring to the size of data (large volume), speed with which the data is being generated (velocity), and the different types of data (variety)

Mike Gualtieri of Forrester said that the 3 ‘V’s mentioned by Gartner are just measures of data and Mike insisted that Forrester’s definition is more actionable. And that definition is: ‘Big Data is the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers. Forrester seems to be saying that any data that is beyond the current reach (i.e. frontier) of that firm to store (i.e. large volumes of data), process (i.e. needs innovative processing), and access (new ways of accessing that data) is the Big Data. So the question is: What is the ‘frontier’? Who defines the frontier?
I kept searching for those answers. I looked at McKinsey’s definition: “Datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” Well, similar to all the above but still not specific for me to decide when the data becomes Big Data.
IBM added ‘Veracity’ referring to the quality of data. And then several people start to add more Vs to the Big Data definition.

I found two famous one:
10 Vs of Big data

42 V’s of Big Data
That I think is too much and a little bit funny, especially “23 – Version Control and 40 – Voodoo”.

Wikipedia it said ‘Big Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing application.’ So wikipedia’s definition is focusing on ‘volume of data’ and ‘complexity of processing that data’.

O’Reilly Media it said “Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”. To some extent, Wikipedia and O’Reilly’s definitions are similar in that both refer to ‘processing capacity’ and ‘conventional database systems’ but O’Reilly media adds a new twist by mentioning ‘too big’ and ‘moves fast’.

I couldn’t find a clear answer to the question of what is the volume threshold of data that makes it Big Data, but I found some small and not famous articles that they said “some have defined big data as an amount that exceeds ten petabytes” and there’s the fact that you can use Big Data for a small amount of data as well.

Even though there is no single definition for Big Data that is universally accepted, there are some common concepts that almost all seem to converge on.
Data that is of large volume, data that is not a single type i.e. structured and a variety of structured, unstructured etc and data that requires newer ways to store, process, analyze, visualize, and integrate.
The truth is that data is being generated at a much faster rate than in the past from all kinds of sources including social media and we need one way to handle that.

What is Big data by Oracle

The popular use of big data can be traced to a single research paper published in 2004:
MapReduce: Simplified Data Processing on Large Clusters”, by Jeffrey Dean and Sanjay Ghemawat.

2. What can Big Data do?

A list of some use cases:

  • Recommendation engines
  • Fraud detection
  • Predictive analytics
  • Customer segmentation
  • Customer churn prevention
  • Product development
  • Price optimization
  • Customer sentiment analysis
  • Real-time analytics

Big Data is not just limited to software or application development. Big Data development is used in many other sectors like:

  • Fintech
  • Robotics
  • Meteorology
  • Medicine
  • Environmental research
  • Informatics and cybersecurity

3. Why Has It Become So Popular?

Big Data’s recent popularity has been due in large part to new advances in technology and infrastructure that allow for the processing, storing and analysis of so much data. Computing power has increased considerably in the past five years while at the same time dropping in price – making it more accessible to small and midsize companies. In the same vein, the infrastructure and tools for large-scale data analysis have gotten more powerful, less expensive and easier to use.
As the technology has gotten more powerful and less expensive, numerous companies have emerged to take advantage of it by creating products and services that help businesses to take advantage of all Big Data has to offer.

4. Why Should Businesses Care?

Data has always been used by businesses to gain insights through analysis. The emergence of Big Data means that they can now do this on an even greater scale, taking into account more and more factors. By analyzing greater volumes from a more varied set of data, businesses can derive new insights with a greater degree of accuracy. This directly contributes to improved performance and decision making within an organization.
Big Data is fast becoming a crucial way for companies to outperform their peers. Good data analysis can highlight new growth opportunities, identify and even predict market trends, be used for competitor analysis, generate new leads and much more. Learning to use this data effectively will give businesses greater transparency into their operations, better predictions, faster sales and bigger profits.

5. Big data and analytics

What really delivers value from all the big data organizations are gathering is the analytics applied to the data. Without analytics, it’s just a bunch of data with limited business use.
By applying analytics to big data, companies can see benefits such as increased sales, improved customer service, greater efficiency, and an overall boost in competitiveness.
Data analytics involves examining data sets to gain insights or draw conclusions about what they contain, such as trends and predictions about future activity.
Analytics can refer to basic business intelligence applications or more advanced, predictive analytics such as those used by scientific organizations. Among the most advanced type of data analytics is data mining, where analysts evaluate large data sets to identify relationships. patterns, and trends.

6. What Could That Data Be, Exactly?

it could be all the point of sale data for Best Buy. That’s a huge data set — everything that goes through a cash register. For us, it’s all of the activity on a website, so a ton of people coming through, doing a bunch of different things. It’s not really exactly cohesive and structured.
With point of sale, for example, you’re looking at what people are purchasing and what they’ve done historically. You’re looking at what they’ve clicked on in email newsletters, loyalty program data, and coupons that you’ve sent them in direct mail — have those been redeemed? All these things come together to form a data set around purchasing behavior. You can look at what “like” customers do in order to predict what similar customers will buy as well.

7. IT infrastructure to support big data

For the concept of big data to work, organizations need to have the infrastructure in place to gather and house the data, provide access to it, and secure the information while it’s in storage and in transit.
At a high level, these include storage systems and servers designed for big data, data management and integration software, business intelligence and data analytics software, and big data applications.

8. Big data skills

Big data and big data analytics endeavors require specific skills, whether they come from inside the organization or through outside experts.
Many of these skills are related to the key big data technology components, such as Hadoop, Spark, NoSQL databases, in-memory databases, and analytics software.
Others are specific to disciplines such as data science, data mining, statistical and quantitative analysis, data visualization, general-purpose programming, and data structure and algorithms. There is also a need for people with overall management skills to see big data projects through to completion.

Under the umbrella of Big Data, there are many technologies and concepts. This is not an exhaustive list!

Google File System – GFS – is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware.

Distributed File System – In computing, a distributed file system (DFS) or network file system is any file system that allows access to files from multiple hosts sharing via a computer network. The DFS makes it convenient to share information and files among users on a network in a controlled and authorized way. This makes it possible for multiple users on multiple machines to share files and storage resources.

Hadoop – Hadoop is a massive system for distributed parallel processing of huge amounts of data and provide a distributed file system to that.
Hadoop is composed of the distributed file system, Map Reduce and yarm.

MapReduce – a programming model that makes combining the data from various hard drives a much easier task. There are two parts to the programming model – the map phase and the reduce phase—and it’s the interface between the two where the “combining” of data occurs. MapReduce enables anyone with access to the cluster to perform large-scale data analysis.

HDFS – Hadoop File System

Yarn – Yet Another Resource manager – Apache Hadoop YARN is the resource management and job scheduling technology in the open source Hadoop distributed processing framework. One of Apache Hadoop’s core components, YARN is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.

Hadoop Ecosystem – Hadoop Ecosystem is a platform or framework which solves big data problems. You can consider it as a suite which encompasses a number of services (ingesting, storing, analyzing and maintaining) inside it.

Spark – is an open-source distributed general-purpose cluster computing framework with (mostly) in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages.
The 4 main part are: Spark SQL, Spark Streaming, MlLib and GraphX

Pig – Apache Pig is a platform for analyzing large data sets that consist of a high-level language for creating MapReduce programs. These programs can then be run in parallel on large-scale Hadoop clusters. Complex tasks are broken into small data flow sequences, which make them easier to write, maintain, and understand. Users are able to focus more on semantics rather than efficiency with Pig Latin, because tasks are encoded in a way that allows the system to automatically optimize the execution. By utilizing user-defined functions, users are also able to extend Pig Latin. These functions can be written in many popular programming languages such as Java, Python, JavaScript, Ruby, or Groovy and then called directly using Pig Latin.

Hive – Apache Hive is a data warehouse system for data summarization, analysis and querying of large data systems in open source Hadoop platform. It converts SQL-like queries into MapReduce jobs for easy execution and processing of extremely large volumes of data.

Hbase – Apache HBase is a highly distributed, NoSQL database solution that scales to store large amounts of sparse data. In the scheme of Big Data, it fits into the storage category and is simply an alternative or additional data store option. It is a column-oriented, key-value store that has been modeled after Google’s BigTable.

Oozie – is a server-based workflow scheduling system to manage Hadoop jobs. Workflows in Oozie are defined as a collection of control flow and action nodes in a directed acyclic graph. Control flow nodes define the beginning and the end of a workflow (start, end, and failure nodes) as well as a mechanism to control the workflow execution path (decision, fork, and join nodes). Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing task.

Kafka – is an open-source stream-processing software platform. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue designed as a distributed transaction log,” making it highly valuable for enterprise infrastructures to process streaming data.

Flume – is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

Solr – open source search engine.

Cloudera – Is the one of the most famous Hadoop distribution.

Hortonworks – Is another famous Hadoop distribution.

MapR – Is another famous Hadoop distribution.

ETL – is short for extract, transform and load.

ELT – is short for extract, load and transform.

Algorithm – A programming algorithm is a computer procedure that is a lot like a recipe (called a procedure) and tells your computer precisely what steps to take to solve a problem or reach a goal.

Analytics – Analytics often involves studying past historical data to research potential trends, to analyze the effects of certain decisions or events, or to evaluate the performance of a given tool or scenario. The goal of analytics is to improve the business by gaining knowledge which can be used to make improvements or changes.

Descriptive Analytics – is a preliminary stage of data processing that creates a summary of historical data to yield useful information and possibly prepare the data for further analysis.

Predictive Analytics – is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trends.

Prescriptive Analytics – is the area of business analytics (BA) dedicated to finding the best course of action for a given situation. Prescriptive analytics is related to both descriptive and predictive analytics.

Batch processing – is a general term used for frequently used programs that are executed with minimum human interaction. Batch process jobs can run without any end-user interaction or can be scheduled to start up on their own as resources permit.

Dark Data – is data which is acquired through various computer network operations but not used in any manner to derive insights or for decision making. The ability of an organisation to collect data can exceed the throughput at which it can analyse the data.

Data lake – is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.

Data warehouse – is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process. Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, “sales” can be a particular subject.

Data mining – is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

Data Scientist – DS – is a multidisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining.

Data analytics – DA – is the process of examining data sets in order to draw conclusions about the information they contain, increasingly with the aid of specialized systems and software.

Data Engineer – are the designers, builders and managers of the information or “big data” infrastructure. They develop the architecture that helps analyze and process data in the way the organization needs it. And they make sure those systems are performing smoothly.

Data modeling – This is a conceptual application of analytics in which multiple “what-if” scenarios can be applied via algorithms to multiple data sets. Ideally, the modeled information changes based on the information made available to the algorithms, which then provide insight to the effects of the change on the data sets. Data modeling works hand in hand with data visualization, in which uncovering information can help with a particular business endeavor.

AI – Artificial intelligence – is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning (the acquisition of information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions) and self-correction.

Machine learning – is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

Nosql – Is data model that addresses several issues that the relational model is not designed to address: Large volumes of rapidly changing structured, semi-structured, and unstructured data.

Stream processing – is a computer programming paradigm, equivalent to dataflow programming, event stream processing, and reactive programming, that allows some applications to more easily exploit a /limited form of parallel processing.

Structured data – is data that has been organized into a formatted repository, typically a database, so that its elements can be made addressable for more effective processing and analysis. A data structure is a kind of repository that organizes information for that purpose.

Unstructured Data – or unstructured information is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well.

Java – is a high-level programming language developed by Sun Microsystems. … The Java syntax is similar to C++, but is strictly an object-oriented programming language. For example, most Java programs contain classes, which are used to define objects, and methods, which are assigned to individual classes.

Scala – is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way. … Scala smoothly integrates the features of object-oriented and functional languages. This tutorial explains the basics of Scala in a simple and reader-friendly way.

Python – is an interpreted, high-level, general-purpose programming language.

R – is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.

Traditional business intelligence – BI – This consists of a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data. BI delivers actionable information, which helps enterprise users make better business decisions using fact-based support systems. BI works by using an in-depth analysis of detailed business data, provided by databases, application data, and other tangible data sources. In some circles, BI can provide historical, current, and predictive views of business operations.

Statistical applications – These look at data using algorithms based on statistical principles and normally concentrate on data sets related to polls, census, and other static data sets. Statistical applications ideally deliver sample observations that can be used to study populated data sets for the purpose of estimating, testing, and predictive analysis. Empirical data, such as surveys and experimental reporting, are the primary sources for analyzable information.

9. What Are the Most Commonly Held Misconceptions About Big Data?

In my opinion, people think it’s this magical thing. They think, “We’ll just turn that on and now things will just work and we’ll know all this stuff.” But it’s just not that simple — it’s actually really complicated and you need the right equipment and people that understand how to analyze and work with big data.
Increasingly, simplified tools are coming out for non-technical users to create dashboards and get some of the information they’re looking for, but it is a really specialized skillset. It’s not something you can just turn on and have. There’s an investment in people, time, and hard costs to make this stuff work.

10. Big Data and Cloud Computing


11. Book


12. Influencers List


13. Courses


14. Links

What can Big Data do – https://www.bernardmarr.com/default.asp?contentID=1076
What is Big Data? A Complete Guide – https://learn.g2crowd.com/big-data
Hadoop Ecosystem table – https://hadoopecosystemtable.github.io/

Even though there is no single definition for Big Data that is universally accepted, there are some common concepts that almost all seem to converge on.
This post is a simple and brief overview of Big data and the ecosystem around.

Leave a Reply

Your email address will not be published. Required fields are marked *