Led Race – Another DublinMaker

Hey you!

Last week 20th July was the Dublin Maker 2019.

So, let’s start with an explanation about what is Dublin Maker.

Dublin Maker is a free to attend, community run event, which was held on Saturday, July 20th, 2019 in Merrion Square. Dublin Maker takes the form of a “show and tells” experience where inventors/makers sourced through an open call, to have an opportunity to showcase their creations in a carnival atmosphere. It is a family friendly showcase of invention, creativity and resourcefulness, and a celebration of the maker movement. It’s a place where people show what they are making and share what they are learning.

If you have science interested kids, or you’re a kid yourself, this is a great event with lots of interesting open people and things that at least some of them you’ll probably not get the chance to see again.

I have been attending the event for a long time now, and today I want to explain the project that I presented this year. The Open Led Race. Back in March I was on holidays and I saw one tweet from Arduino about the project and I said to myself, it’s exactly what I’m going to show on Dublin Maker.

My first concern was about using an arcade button, because the idea is to let a lot of kids to play, I was afraid that they would break the button fast and I need to replace fast as well because a lot of people came to see the project.
After few days looking at the instructions and the components I had the idea to change the arcade button to MakeyMakey and because of that I decided to create the idea using a Raspberry PI to simplify the idea.

Apparent I’m the first one to do it.

The project is really simple. I’m using the WS2813 led strip and using the API that I found on the internet.

Python library wrapping for the rpi-ws281x library

You can check my GitHub to see the full code.

It’s basically a Python code that runs on the Raspberry Pi that controls the Led Strip and the MakeyMakey that I used to simulate one keyboard. Every click I move the led 3 positions forward.
The makeymakey part is just one aluminium foil and play-doh.

I used the GPIO 10 (pin 19) and 18 (pin 12) for LED_PIN and LED_DMA and GPIO 9 (pin 6) for Ground.

I created a simple version of the Led Race, but there are lots of space for improvement.

I came with some ideas that one day I’ll implement;

  • Add a monitor where I can show a timer and the best lap timer;
  • I can show a speedometer or something like the number of push per second;
  • I can display the best lap overall;
  • four players put for cars at the same time;

Other ideas are some things that I saw on the Open Led Races Arduino web site and the comments, like;

  • Add velocity;
  • Add some physics when the car goes up, more push is needed, or increase the speed when going down;
  • Add a second Led Strip and then the car can go left and right and they can leave some kind of weapons on the track, and the car gets stuck there if hits for fill push;

I want to add here a big thanks to Elaine Akemi who helped me with the project. She is also my official partner of Hackathons and events, and she was with me in the last two Dublin Maker editions.

Hadoop Ecosystem & Hadoop Distributions

Alright boss?

The complexity of Hadoop ecosystem comes from having many tools that can be used, and not from using Hadoop itself.

The objective of this Apache Hadoop ecosystem components tutorial is to have an overview of what are the different components of Hadoop ecosystem that make Hadoop so powerful, to give you a nice overview of some Hadoop related products and about somel Hadoop distributions in the market. I did some research and this is what I know or found and probably is not a exhaustive list!

if you’re not familiar with Big Data or Data lake, I suggest you have a look on my previous post “What is Big Data?” and “What is data lake?” before.
This post is a collection of links, videos, tutorials, blogs and books that I found mixed with my opinion.

Hadoop Ecosystem

What does Hadoop Ecosystem mean?

Hadoop Ecosystem is neither a programming language nor a service, it is a platform or framework which solves big data problems. You can consider it as a suite which encompasses a number of services (ingesting, storing, analyzing and maintaining) inside it. The Hadoop ecosystem includes both official Apache open source projects and a wide range of commercial tools and solutions. Most of the solutions available in the Hadoop ecosystem are intended to supplement one or two of Hadoop’s four core elements (HDFS, MapReduce, YARN, and Common). However, the commercially available framework solutions provide more comprehensive functionality.

The Hadoop Ecosystem Table https://hadoopecosystemtable.github.io/

Hortonworks https://hortonworks.com/ecosystems/

Hive
Hive is data warehousing software that addresses how data is structured and queried in distributed Hadoop clusters. Hive is also a popular development environment that is used to write queries for data in the Hadoop environment. It provides tools for ETL operations and brings some SQL-like capabilities to the environment. Hive is a declarative language that is used to develop applications for the Hadoop environment, however it does not support real-time queries.

Pig
Pig is a procedural language for developing parallel processing applications for large data sets in the Hadoop environment. Pig is an alternative to Java programming for MapReduce, and automatically generates MapReduce functions. Pig includes Pig Latin, which is a scripting language. Pig translates Pig Latin scripts into MapReduce, which can then run on YARN and process data in the HDFS cluster. Pig is popular because it automates some of the complexity in MapReduce development.

HBase
HBase is a scalable, distributed, NoSQL database that sits atop the HFDS. It was designed to store structured data in tables that could have billions of rows and millions of columns. It has been deployed to power historical searches through large data sets, especially when the desired data is contained within a large amount of unimportant or irrelevant data (also known as sparse data sets). It is also an underlying technology behind several large messaging applications, including Facebook’s.

Oozie
Oozie is the workflow scheduler that was developed as part of the Apache Hadoop project. It manages how workflows start and execute, and also controls the execution path. Oozie is a server-based Java web application that uses workflow definitions written in hPDL, which is an XML Process Definition Language similar to JBOSS JBPM jPDL. Oozie only supports specific workflow types, so other workload schedulers are commonly used instead of or in addition to Oozie in Hadoop environments.

Sqoop
Think of Sqoop as a front-end loader for big data. Sqoop is a command-line interface that facilitates moving bulk data from Hadoop into relational databases and other structured data stores. Using Sqoop replaces the need to develop scripts to export and import data. One common use case is to move data from an enterprise data warehouse to a Hadoop cluster for ETL processing. Performing ETL on the commodity Hadoop cluster is resource efficient, while Sqoop provides a practical transfer method.

HCatalog
It is a table and storage management layer for Hadoop. HCatalog supports different components available in Hadoop ecosystems like MapReduce, Hive, and Pig to easily read and write data from the cluster. HCatalog is a key component of Hive that enables the user to store their data in any format and structure.
By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC file formats.

Avro
Acro is a part of Hadoop ecosystem and is a most popular Data serialization system. Avro is an open source project that provides data serialization and data exchange services for Hadoop. These services can be used together or independently. Big data can exchange programs written in different languages using Avro.

Thrift
It is a software framework for scalable cross-language services development. Thrift is an interface definition language for RPC(Remote procedure call) communication. Hadoop does a lot of RPC calls so there is a possibility of using Hadoop Ecosystem componet Apache Thrift for performance or other reasons.

Drill
The main purpose of the Hadoop Ecosystem Component is large-scale data processing including structured and semi-structured data. It is a low latency distributed query engine that is designed to scale to several thousands of nodes and query petabytes of data. The drill is the first distributed SQL query engine that has a schema-free model.

Mahout
Mahout is open source framework for creating scalable machine learning algorithm and data mining library. Once data is stored in Hadoop HDFS, mahout provides the data science tools to automatically find meaningful patterns in those big data sets.

Flume
Flume efficiently collects, aggregate and moves a large amount of data from its origin and sending it back to HDFS. It is fault tolerant and reliable mechanism. This Hadoop Ecosystem component allows the data flow from the source into Hadoop environment. It uses a simple extensible data model that allows for the online analytic application. Using Flume, we can get the data from multiple servers immediately into hadoop.

Ambari
Ambari, another Hadop ecosystem component, is a management platform for provisioning, managing, monitoring and securing apache Hadoop cluster. Hadoop management gets simpler as Ambari provide consistent, secure platform for operational control.

Zookeeper
Apache Zookeeper is a centralized service and a Hadoop Ecosystem component for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper manages and coordinates a large cluster of machines.

Lucene
Apache Lucene is a full-text search engine which can be used from various programming languages, is a free and open-source information retrieval software library, originally written completely in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License.

Sorl
Solr is an open-source enterprise-search platform, written in Java, from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling.

Phoenix
Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store.

Presto
Presto is a high performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Kafka, and MongoDB. One can even query data from multiple data sources within a single query.

Zeppelin
Apache Zeppelin is a multi-purposed web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop and Spark.

Storm
Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz and team at BackType, the project was open sourced after being acquired by Twitter.

Flink
Apache Flink is an open-source stream-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner.

Samza
Apache Samza is an open-source near-realtime, asynchronous computational framework for stream processing developed by the Apache Software Foundation in Scala and Java.

Arrow
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Airflow
Airflow is a platform to programmatically author, schedule and monitor workflows. Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies.

Blazingdb
BlazingSQL is a distributed GPU-accelerated SQL engine with data lake integration, where data lakes are huge quantities of raw data that are stored in a flat architecture. It is ACID-compliant. BlazingSQL targets ETL workloads and aims to perform efficient read IO and OLAP querying. BlazingDB refers to the company and BlazingSQL refers to the product.

Chukwa
A data collection system for managing large distributed systems.

Impala
The open source, native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon.

Get started with Hadoop: From evaluation to your first production cluster
https://www.oreilly.com/ideas/getting-started-with-hadoop

Hadoop distributions

When talk about HAdoop distribution the top 3 most famous are Cloudera, Hortonworks and MapR.

September 2018 Cloudera and Hortonworks announcing a merge to be completed summer 2019.

Cloudera and Hortonworks Announce Merger to Create World’s Leading Next Generation Data Platform and Deliver Industry’s First Enterprise Data Cloud https://www.cloudera.com/about/news-and-blogs/press-releases/2018-10-03-cloudera-and-hortonworks-announce-merger-to-create-worlds-leading-next-generation-data-platform-and-deliver-industrys-first-enterprise-data-cloud.html

Cloudera and Hortonworks Complete Planned Merger https://www.cloudera.com/about/news-and-blogs/press-releases/2019-01-03-cloudera-and-hortonworks-complete-planned-merger.html

  • Cloudera offers the highest performance and lowest cost platform for using data to drive better business outcomes. Cloudera has a track record of bringing new open source solutions into its platform (such as Apache Spark, Apache HBase, and Apache Parquet) that are eventually adopted by the community at large. Cloudera Navigator provides everything your organization needs to keep sensitive data safe and secure while still meeting compliance requirements. Cloudera Manager is the easiest way to administer Hadoop in any environment, with advanced features like intelligent configuration defaults, customized monitoring, and robust troubleshooting. Cloudera delivers the modern data management and analytics…
  • Hortonworks Sandbox is a personal, portable Apache Hadoop environment that comes with dozens of interactive Hadoop and it’s ecosystem tutorials and the most exciting developments from the latest HDP distribution. Hortonworks Sandbox provides performance gains up to 10 times for applications that store large datasets such as state management, through a revamped Spark Streaming state tracking API. It provides seamless Data Access to achieve higher performance with Spark. Also provides dynamic Executor Allocation to utilize cluster resources efficiently through Dynamic Executor Allocation functionality that automatically expands and shrinks resources based on utilization. Hortonworks Sandbox
  • MapR Converged Data Platform integrates the power of Hadoop and Spark with global event streaming, real-time database capabilities, and enterprise storage for developing and running innovative data applications. Modules include MapR-FS, MapR-DB, and MapR Streams. Its enterprise- friendly design provides a familiar set of file and data management services, including a global namespace, high availability, data protection, self-healing clusters, access control, real-time performance, secure multi-tenancy, and management and monitoring. MapR tests and integrates open source ecosystem projects such as Hive, Pig, Apache HBase and Mahout, among others. MapR Community

Commercial Hadoop Vendors

1) Amazon Elastic MapReduce
2) Microsoft Azure’s HDInsight – Cloud based Hadoop Distribution
3) IBM Open Platform
4) Pivotal Big Data Suite
5) Datameer Professional
6) Datastax Enterprise Analytics
7) Dell – Cloudera Apache Hadoop Solution.
8) Oracle

Top Hadoop Appliances

Hadoop Appliances providers offer hardware optimised for Apache Hadoop or enterprise versions .

Dell provides PowerEdge servers, Cloudera Enterprise Basic Edition and Dell Professional Services, Dell PowerEdge servers with Intel Xeon processors, Dell Networking and Cloudera Enterprise and Dell In-Memory Appliance for Cloudera Enterprise with Apache Spark.

EMC provides Greenplum HD and Greenplum MR. EMC provides Pivotal HD, which is an Apache Hadoop distribution that natively integrates EMC Greenplum massively parallel processing (MPP) database technology with the Apache Hadoop framework.

Teradata Appliance for Hadoop provides optimized hardware, flexible configurations, high-speed connectors, enhanced software usability features, proactive systems monitoring, intuitive management portals, continuous availability, and linear scalability.

HP AppSystem for Apache Hadoop is an enterprise ready Apache Hadoop platform and provides RHEL v6.1, Cloudera Enterprise Core – the market leading Apache Hadoop software, HP Insight CMU v7.0 and a sandbox that includes HP Vertica Community Edition v6.1

NetApp Open Solution for Hadoop provides a ready to deploy, enterprise class infrastructure for the Hadoop platform to control and gain insights from big data.

Oracle Big Data Appliance X6-2 Starter Rack contains six Oracle Sun x86 servers within a full-sized rack with redundant Infiniband switches and power distribution units. Includes all Cloudera Enterprise Technology software including Cloudera CDH, Cloudera Manager, and Cloudera RTQ (Impala).

Top Hadoop Managed Services

Amazon EMR
Amazon EMR simplifies big data processing, providing a managed Hadoop framework that makes it easy, fast, and cost effective way to distribute and process vast amounts data across dynamically scalable Amazon EC2 instances.

Microsoft HDInisght
HDInsight is a managed Apache Hadoop, Spark, R, HBase, and Storm cloud service made easy. It provides a Data Lake service, Scale to petabytes on demand, Crunch all data structured, semi structured, unstructured and Develop in Java, .NET, and more. Provides Apache Hadoop, Spark, and R clusters in the cloud.

Google Cloud Platform
Google offers Apache Spark and Apache Hadoop clusters easily on Google Cloud Platform.

Qubole
Qubole Data Service (QDS) offers Hadoop as a Service and is a cloud computing solution that makes medium and large-scale data processing accessible, easy, fast and in

IBM BigInsights
IBM BigInsights on Cloud provides Hadoop-as-a-service on IBM’s SoftLayer global cloud infrastructure. It offers the performance and security of an on-premises deployment.

Teradata Cloud for Hadoop
Teradata Cloud for Hadoop includes Teradata developed software components that make Hadoop ready for the enterprise: high availability, performance, scalability, monitoring, manageability, data transformation, data security, and a full range of tools and utilities.

Altiscale Data Cloud
Altiscale Data Cloud is a fully managed Big Data platform, delivering instant access to production ready Apache Hadoop and Apache Spark on the world’s best Big Data infrastructure.

Rackspace Apache
Rackspace Apache Hadoop distribution includes common tools like MapReduce, HDFS, Pig, Hive, YARN, and Tez. Rackspace provide root access to the application itself, allowing users to interact directly with the core platform.

Oracle
Oracle offers a Cloudera solution on the top of the Oracle cloud infrastructure.

Links

Hadoop Ecosystem and Their Components – A Complete Tutorial

Big Data & Data Lake a complete overview

What’s the crack jack?

If you ever wanted to know what is Big Data and not what you think Big Data is or If you ever wanted to know what is Data Lake and not what you think Data Lake is, you should check this out.

I just finished a series of blog post where I did an overview in Big Data, Data Lake, Hadoop, Apache Spark and Apache Kafka.

The idea here is a complete post with a good overview and a good start point to discover these areas and technologies.

What is Big Data?

What is Data Lake?

What is Hadoop?

What is Apache Spark?

What is Apache Kafka?

All post are based on a collection of links, videos, tutorials, blogs and books that I found mixed with my opinion.

There content to spend two hours reading, so good studies!

Thank you for taking the time to read this post.

What is Kafka?

How’s the form?

if you’re not familiar with Big Data or Data lake, I suggest you have a look on my previous post “What is Big Data?” and “What is data lake?” before.
This post is a collection of links, videos, tutorials, blogs and books that I found mixed with my opinion.

Table of contents

1. What is Kafka?
2. Architecture
3. History
4. Courses
5. Books
6. Influencers List
7. Link

1. What is Kafka?

In simple terms, Kafka is a messaging system that is designed to be fast, scalable, and durable. It is an open-source stream processing platform. Kafka is a distributed publish-subscribe messaging system that maintains feeds of messages in partitioned and replicated topics.

Wikipedia definition: Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue designed as a distributed transaction log, making it highly valuable for enterprise infrastructures to process streaming data. Additionally, Kafka connects to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library.

I Googled about and I found …
Kafka is designed for distributed high throughput systems. Kafka tends to work very well as a replacement for a more traditional message broker. In comparison to other messaging systems, Kafka has better throughput, built-in partitioning, replication and inherent fault-tolerance, which makes it a good fit for large-scale message processing applications.

Other …
Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you to pass messages from one end-point to another. Kafka is suitable for both offline and online message consumption. Kafka messages are persisted on the disk and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper synchronization service. It integrates very well with Apache Storm and Spark for real-time streaming data analysis.

Producers

Producers produce messages to a topic of their choice. It is possible to attach a key to each message, in which case the producer guarantees that all messages with the same key will arrive to the same partition.

Consumers

Consumers read the messages of a set of partitions of a topic of their choice at their own pace. If the consumer is part of a consumer group, i.e. a group of consumers subscribed to the same topic, they can commit their offset. This can be important if you want to consume a topic in parallel with different consumers.

Topics and Logs

A topic is a feed name or category to which records are published. Topics in Kafka are always multi-subscriber — that is, a topic can have zero, one, or many consumers that subscribe to the data written to it. For each topic, the Kafka cluster maintains a partition log that looks like this:

Topics are logs that receive data from the producers and store them across their partitions. Producers always write new messages at the end of the log.

Partitions

A topic may have many partitions so that it can handle an arbitrary amount of data. In the above diagram, the topic is configured into three partitions (partition{0,1,2}). Partition 0 has 13 offsets, Partition 1 has 10 offsets, and Partition 2 has 13 offsets.

Partition Offset

Each partitioned message has a unique sequence ID called an offset. For example, in Partition 1, the offset is marked from 0 to 9. The offset is the position in the log where the consumer last consumed or read a message.

Distribution

The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.

Each partition has one server which acts as the “leader” and zero or more servers which act as “followers”. The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

Geo-Replication

Kafka MirrorMaker provides geo-replication support for your clusters. With MirrorMaker, messages are replicated across multiple datacenters or cloud regions. You can use this in active/passive scenarios for backup and recovery; or in active/active scenarios to place data closer to your users, or support data locality requirements.

Replicas

Replicas are nothing but backups of a partition. If the replication factor of the above topic is set to 4, then Kafka will create four identical replicas of each partition and place them in the cluster to make them available for all its operations. Replicas are never used to read or write data. They are used to prevent data loss.

Messaging System

A messaging system is a system that is used for transferring data from one application to another so that the applications can focus on data and not on how to share it. Kafka is a distributed publish-subscribe messaging system. In a publish-subscribe system, messages are persisted in a topic. Message producers are called publishers and message consumers are called subscribers. Consumers can subscribe to one or more topic and consume all the messages in that topic.

Two types of messaging patterns are available − one is point to point and the other is publish-subscribe (pub-sub) messaging system. Most of the messaging patterns follow pub-sub.

  • Point to Point Messaging System – In a point-to-point system, messages are persisted in a queue. One or more consumers can consume the messages in the queue, but a particular message can be consumed by a maximum of one consumer only. Once a consumer reads a message in the queue, it disappears from that queue. The typical example of this system is an Order Processing System, where each order will be processed by one Order Processor, but Multiple Order Processors can work as well at the same time.
  • Publish-Subscribe Messaging System – In the publish-subscribe system, messages are persisted in a topic. Unlike point-to-point system, consumers can subscribe to one or more topic and consume all the messages in that topic. In the Publish-Subscribe system, message producers are called publishers and message consumers are called subscribers. A real-life example is Dish TV, which publishes different channels like sports, movies, music, etc., and anyone can subscribe to their own set of channels and get them whenever their subscribed channels are available

Brokers

Brokers are simple systems responsible for maintaining published data. Kafka brokers are stateless, so they use ZooKeeper for maintaining their cluster state. Each broker may have zero or more partitions per topic. For example, if there are 10 partitions on a topic and 10 brokers, then each broker will have one partition. But if there are 10 partitions and 15 brokers, then the starting 10 brokers will have one partition each and the remaining five won’t have any partition for that particular topic. However, if partitions are 15 but brokers are 10, then brokers would be sharing one or more partitions among them, leading to unequal load distribution among the brokers. Try to avoid this scenario.

Cluster

When Kafka has more than one broker, it is called a Kafka cluster. A Kafka cluster can be expanded without downtime. These clusters are used to manage the persistence and replication of message data.
Multi-tenancy
You can deploy Kafka as a multi-tenant solution. Multi-tenancy is enabled by configuring which topics can produce or consume data. There is also operations support for quotas. Administrators can define and enforce quotas on requests to control the broker resources that are used by clients. For more information, see the security documentation.

Zookeeper

ZooKeeper is used for managing and coordinating Kafka brokers. ZooKeeper is mainly used to notify producers and consumers about the presence of any new broker in the Kafka system or about the failure of any broker in the Kafka system. ZooKeeper notifies the producer and consumer about the presence or failure of a broker based on which producer and consumer makes a decision and starts coordinating their tasks with some other broker.

2. Architecture

Kafka has four core APIs:

  • The Producer API allows an application to publish a stream of records to one or more Kafka topics.
  • The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
  • The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
  • The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

Apache describes Kafka as a distributed streaming platform that lets us:

  • Publish and subscribe to streams of records.
  • Store streams of records in a fault-tolerant way.
  • Process streams of records as they occur.

Apache.org states that:

  • Kafka runs as a cluster on one or more servers.
  • The Kafka cluster stores a stream of records in categories called topics.
  • Each record consists of a key, a value, and a timestamp.

3. History

Kafka was developed around 2010 at LinkedIn by a team that included Jay Kreps, Jun Rao, and Neha Narkhede. The problem they originally set out to solve was low-latency ingestion of large amounts of event data from the LinkedIn website and infrastructure into a lambda architecture that harnessed Hadoop and real-time event processing systems. The key was the “real-time” processing. At the time, there weren’t any solutions for this type of ingress for real-time applications.

There were good solutions for ingesting data into offline batch systems, but they exposed implementation details to downstream users and used a push model that could easily overwhelm a consumer. Also, they were not designed for the real-time use case.

Kafka was developed to be the ingestion backbone for this type of use case. Back in 2011, Kafka was ingesting more than 1 billion events a day. Recently, LinkedIn has reported ingestion rates of 1 trillion messages a day.

https://www.confluent.io/blog/apache-kafka-hits-1-1-trillion-messages-per-day-joins-the-4-comma-club/

Why Kafka?

In Big Data, an enormous volume of data is used. But how are we going to collect this large volume of data and analyze that data? To overcome this, we need a messaging system. That is why we need Kafka. The functionalities that it provides are well-suited for our requirements, and thus we use Kafka for:

  • Building real-time streaming data pipelines that can get data between systems and applications.
  • Building real-time streaming applications to react to the stream of data.

Kafka can work with Flume/Flafka, Spark Streaming, Storm, HBase, Flink, and Spark for real-time ingesting, analysis and processing of streaming data. Kafka is a data stream used to feed Hadoop Big Data lakes. Kafka brokers support massive message streams for low-latency follow-up analysis in Hadoop or Spark. Also, Kafka Streaming (a subproject) can be used for real-time analytics.

Why is it so popular?

RedMonk.com published an article in February 2016 documenting some interesting stats around the “rise and rise” of a powerful asynchronous messaging technology called Apache Kafka.
https://redmonk.com/fryan/2016/02/04/the-rise-and-rise-of-apache-kafka/

Kafka has operational simplicity. Kafka is to set up and use, and it is easy to figure out how Kafka works. However, the main reason Kafka is very popular is its excellent performance. It is stable, provides reliable durability, has a flexible publish-subscribe/queue that scales well with N-number of consumer groups, has robust replication, provides producers with tunable consistency guarantees, and it provides preserved ordering at the shard level (i.e. Kafka topic partition). In addition, Kafka works well with systems that have data streams to process and enables those systems to aggregate, transform, and load into other stores. But none of those characteristics would matter if Kafka was slow. The most important reason Kafka is popular is Kafka’s exceptional performance.

Who Uses Kafka?

A lot of large companies who handle a lot of data use Kafka. LinkedIn, where it originated, uses it to track activity data and operational metrics. Twitter uses it as part of Storm to provide a stream processing infrastructure. Square uses Kafka as a bus to move all system events to various Square data centers (logs, custom events, metrics, and so on), outputs to Splunk, for Graphite (dashboards), and to implement Esper-like/CEP alerting systems. It’s also used by other companies like Spotify, Uber, Tumbler, Goldman Sachs, PayPal, Box, Cisco, CloudFlare, and Netflix.

Why Is Kafka So Fast?

Kafka relies heavily on the OS kernel to move data around quickly. It relies on the principals of zero copy. Kafka enables you to batch data records into chunks. These batches of data can be seen end-to-end from producer to file system (Kafka topic log) to the consumer. Batching allows for more efficient data compression and reduces I/O latency. Kafka writes to the immutable commit log to the disk sequential, thus avoiding random disk access and slow disk seeking. Kafka provides horizontal scale through sharding. It shards a topic log into hundreds (potentially thousands) of partitions to thousands of servers. This sharding allows Kafka to handle massive load.

Benefits of Kafka

Four main benefits of Kafka are:

  • Reliability. Kafka is distributed, partitioned, replicated, and fault tolerant. Kafka replicates data and is able to support multiple subscribers. Additionally, it automatically balances consumers in the event of failure.
  • Scalability. Kafka is a distributed system that scales quickly and easily without incurring any downtime.
  • Durability. Kafka uses a distributed commit log, which means messages persists on disk as fast as possible providing intra-cluster replication, hence it is durable.
  • Performance. Kafka has high throughput for both publishing and subscribing messages. It maintains stable performance even when dealing with many terabytes of stored messages.

Use Cases

Kafka can be used in many Use Cases. Some of them are listed below

  • Metrics − Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.
  • Log Aggregation Solution − Kafka can be used across an organization to collect logs from multiple services and make them available in a standard format to multiple con-sumers.
  • Stream Processing − Popular frameworks such as Storm and Spark Streaming read data from a topic, processes it, and write processed data to a new topic where it becomes available for users and applications. Kafka’s strong durability is also very useful in the context of stream processing.

4. Courses

https://www.udemy.com/apache-kafka-tutorial-for-beginners/

5. Book

Kafka: The Definitive Guide is the best option to start.

oreilly
pdf
github

6. Influencers List

@nehanarkhede

@rmoff

@tlberglund

7. Link

Confluent

Apache Kafka

Thorough Introduction to Apache Kafka

A good Kafka explanation

What is Kafka

Kafka Architecture and Its Fundamental Concepts

Apache Kafka Tutorial — Kafka For Beginners

What to consider for painless Apache Kafka integration

What is Data Lake?

How’s it going there?

if you’re not familiar with Big Data, I suggest you have a look on my post “What is Big Data?
” before.
This post is a collection of links, videos, tutorials, blogs and books that I found mixed with my opinion.

Table of contents

1. What is Data Lake?
2. History
3. courses
4. Books
5. Influencers List
6. Link

1.What is Data Lake?

Like Big Data is something no stratforward to explain and there’s no unique answer to that. Even though there is no single definition for Data Lake that is universally accepted, there are some common concepts and I’ll try to cover in this post.

I like the simple definition:
Data lake is a place to store your structured and unstructured data, as well as a method for organizing large volumes of highly diverse data from diverse sources.

I Googled about and I found a different answer.
A data lake is a massive, easily accessible, centralized repository of large volumes of structured and unstructured data. The data lake architecture is a store-everything approach to big data. Data are not classified when they are stored in the repository, as the value of the data is not clear at the outset. As a result, data preparation is eliminated. A data lake is thus less structured compared to a conventional data warehouse. When the data are accessed, only then are they classified, organized or analyzed.

Other answer.
A data lake is a collection of storage instances of various data assets additional to the originating data sources. These assets are stored in a near-exact, or even exact, copy of the source format. The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse).

According to Nick Huedecker at Gartner,
Data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format. The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization.

Forbes try to explain Data Lake making a comparison with Data Warehouse:
https://www.forbes.com/sites/bernardmarr/2018/08/27/what-is-a-data-lake-a-super-simple-explanation-for-anyone/#3905f53776e0

Martin Fowler.
The idea is to have a single store for all of the raw data that anyone in an organisation might need to analyse. Commonly people use Hadoop to work on the data in the lake, but the concept is broader than just Hadoop.
https://martinfowler.com/bliki/DataLake.html

Data Lakes by Oracle
https://blogs.oracle.com/bigdata/data-lakes

Data lakes are becoming increasingly important as people, especially in business and technology, want to perform broad data exploration and discovery. Bringing data together into a single place or most of it in a single place can be useful for that.
Most data lake implementations are probably based on the Hadoop ecosystem, which is a set of tools that makes it easy to use MapReduce or other computation models.
All data lakes have some distributed file systems. Data should be persisted in raw format because it’s not possible to structure them on ingestion. To achieve this, ingested data should be left in raw form; later they can be structured with transformation processes. As you can see, there is a need for a dedicated layer which allows unstructured data to persist efficiently. In Hadoop, HDFS fulfills this role.

To build ingestion and transformation processes, we need to use some computation system that is fault-tolerant, easily scalable, and efficient at processing large data sets. Nowadays, streaming systems are gaining in popularity. Spark, Storm, Flink… At the beginning of BigData, only MapReduce was available, which was (and still is) used as a bulk-processing framework.
Scalability in a computation system requires resource management. In a data lake, we have huge amounts of data requiring thousands of nodes. Prioritization is achieved by allocating resources and queuing tasks. Some transformations require more resources; some require less. Major tasks get more resources. This resources allocation role in Hadoop is performed by YARN.

https://40uu5c99f3a2ja7s7miveqgqu-wpengine.netdna-ssl.com/wp-content/uploads/2017/02/Understanding-data-lakes-EMC.pdf

What is Streaming?

Streaming is data that is continuously generated by different sources. Such data should be processed incrementally using Stream Processing techniques without having access to all of the data. In addition, it should be considered that concept drift may happen in the data which means that the properties of the stream may change over time.
It is usually used in the context of big data in which it is generated by many different sources at high speed.
Data streaming can also be explained as a technology used to deliver content to devices over the internet, and it allows users to access the content immediately, rather than having to wait for it to be downloaded. Big data is forcing many organizations to focus on storage costs, which brings interest to data lakes and data streams.

In Big Data management, data streaming is the continuous high-speed transfer of large amounts of data from a source system to a target. By efficiently processing and analyzing real-time data streams to glean business insight, data streaming can provide up-to-the-second analytics that enable businesses to quickly react to changing conditions

What can Data Lake do?

This is not an exhaustive list!

  • Ingestion of semi-structured and unstructured data sources (aka big data) such as equipment readings, telemetry data, logs, streaming data, and so forth. A data lake is a great solution for storing IoT (Internet of Things) type of data which has traditionally been more difficult to store, and can support near real-time analysis. Optionally, you can also add structured data (i.e., extracted from a relational data source) to a data lake if your objective is a single repository of all data to be available via the lake.
  • Experimental analysis of data before its value or purpose has been fully defined. Agility is important for every business these days, so a data lake can play an important role in “proof of value” type of situations because of the “ELT” approach discussed above.
  • Advanced analytics support. A data lake is useful for data scientists and analysts to provision and experiment with data.
  • Archival and historical data storage. Sometimes data is used infrequently, but does need to be available for analysis. A data lake strategy can be very valuable to support an active archive strategy.
  • Support for Lambda architecture which includes a speed layer, batch layer, and serving layer.
  • Preparation for data warehousing. Using a data lake as a staging area of a data warehouse is one way to utilize the lake, particularly if you are getting started.
  • Augment a data warehouse. A data lake may contain data that isn’t easily stored in a data warehouse, or isn’t queried frequently. The data lake might be accessed via federated queries which make its separation from the DW transparent to end users via a data virtualization layer.
  • Distributed processing capabilities associated with a logical data warehouse.
  • Storage of all organizational data to support downstream reporting & analysis activities. Some organizations wish to achieve a single storage repository for all types of data. Frequently, the goal is to store as much data as possible to support any type of analysis that might yield valuable findings.
  • Application support. In addition to analysis by people, a data lake can be a data source for a front-end application. The data lake might also act as a publisher for a downstream application (though ingestion of data into the data lake for purposes of analytics remains the most frequently cited use).

Under the umbrella of Data Lake there are many of technologies and concepts. This is not an exhaustive list!

Data ingestion

Data ingestion is the process of flowing data from its origin to one or more data stores, such as a data lake, though this can also include databases and search engines.
Sources can be clickstreams, data center logs, sensors, APIs or even databases. They use various data formats (structured, unstructured, semi-structured, multi-structured), can make data available in a stream or batches, and support various protocols for data movement.

There are two processing frameworks which “ingest” data into data lakes:

  • Batch processing – Millions of blocks of data processed over long periods of time (hours-to-days).
  • Stream processing – Small batches of data processed in real-time. Stream processing is becoming increasingly valuable for businesses that harness real-time analytics.

Data Marts

A Data Mart is an archive of stored, normally structured data, typically used and controlled by a specific community or department. It is normally smaller and more focused than a Data Warehouse and, currently, is often a subdivision of Data Warehouses. Data Marts were the first evolutionary step in the physical reality of Data Warehouses and Data Lakes

Data Silos

Data Silos are part of a Data Warehouse and similar to Data Marts, but much more isolated. Data Silos are insulated management systems that cannot work with other systems. A Data Silo contains fixed data that is controlled by one department and is cut off from other parts of the organization. They tend to form within large organizations due to the different goals and priorities of individual departments. Data Silos also form when departments compete with one another instead of working as a team toward common business goals.

Data Warehouses

Data Warehouses are centralized repositories of information that can be researched for purposes of making better informed decisions. The data comes from a wide range of sources and is often unstructured. Data is accessed through the use of business intelligence tools, SQL clients, and other Analytics applications.

Data swamp

In data lakes there are many things going on and it’s not possible to manage them manually. Without constraints and a thoughtful approach to processes, a data lake will become degenerated very quickly. If ingested data do not contain business information, then we can’t find the right context for them. If everyone generates anonymous data without lineage, then we will have tons of useless data. No one will know what is going on. Who is the author of changes? Where did the data come from? Everything starts to look like a data swamp.

Data Mining

Data mining is defined as “knowledge discovery in databases,” and is how data scientists uncover previously unseen patterns and truths through various models. data mining is about sifting through large sets of data to uncover patterns, trends, and other truths that may not have been previously visible.

Data Cleansing

The goal of data cleansing is to improve data quality and utility by catching and correcting errors before it is transferred to a target database or data warehouse. Manual data cleansing may or may not be realistic, depending on the amount of data and number of data sources your company has. Regardless of the methodology, data cleansing presents a handful of challenges, such as correcting mismatches, ensuring that columns are in the same order, and checking that data (such as date or currency) is in the same format.

Data Quality

People try to describe data quality using terms like complete, accurate, accessible, and de-duped. And while each of these words describes a specific element of data quality, the larger concept of data quality is really about whether or not that data fulfills the purpose or purposes you want to use it for.

ETL

ETL stands for “Extract, Transform, Load”, and is the common paradigm by which data from multiple systems is combined to a single database, data store, or warehouse for legacy storage or analytics.

ELT

ELT is a process that involves extracting the data, loading it to the target warehouse, and then transforming it after it is loaded. In this case, the work of transforming the data is completed by the target database.

Data extraction

Data extraction is a process that involves retrieval of data from various sources. Frequently, companies extract data in order to process it further, migrate the data to a data repository (such as a data warehouse or a data lake) or to further analyze it. It’s common to transform the data as a part of this process.

Data loading

Data loading refers to the “load” component of ETL. After data is retrieved and combined from multiple sources (extracted), cleaned and formatted (transformed), it is then loaded into a storage system, such as a cloud data warehouse.

Data Transformation

Data transformation is the process of converting data from one format or structure into another format or structure. Data transformation is critical to activities such as data integration and data management. Data transformation can include a range of activities: you might convert data types, cleanse data by removing nulls or duplicate data, enrich the data, or perform aggregations, depending on the needs of your project.

Data Migration

Data migration is simply the process of moving data from a source system to a target system. Companies have many different reasons for migrating data. You may want to migrate data when you acquire another company and you need to integrate that company’s data. Or, you may want to integrate data from different departments within your company so the data is available across your entire business. You may want to move your data from an on-premise platform to a cloud platform. Or, perhaps you’re moving from an outdated data storage system to a new database or data storage system. The concept of data migration is simple, but it can sometimes be a complex process.

Lambda architecture

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation. The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce.

Kappa Architecture

Kappa Architecture is a software architecture pattern. Rather than using a relational DB like SQL or a key-value store like Cassandra, the canonical data store in a Kappa Architecture system is an append-only immutable log. From the log, data is streamed through a computational system and fed into auxiliary stores for serving.
Kappa Architecture is a simplification of Lambda Architecture. A Kappa Architecture system is like a Lambda Architecture system with the batch processing system removed. To replace batch processing, data is simply fed through the streaming system quickly.

SIEM Security information and event management

The underlying principles of every SIEM system is to aggregate relevant data from multiple sources, identify deviations from the norm and take appropriate action. For example, when a potential issue is detected, a SIEM might log additional information, generate an alert and instruct other security controls to stop an activity’s progress.

Security information and event management (SIEM) is an approach to security management that combines SIM (security information management) and SEM (security event management) functions into one security management system. The acronym SIEM is pronounced “sim” with a silent e.

software collects and aggregates log data generated throughout the organization’s technology infrastructure, from host systems and applications to network and security devices such as firewalls and antivirus filters.The software then identifies and categorizes incidents and events, as well as analyzes them.

SNA social network analysis

Social network analysis is the process of investigating social structures through the use of networks and graph theory. It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties, edges, or links (relationships or interactions) that connect them.

Data lake vs data warehouse

Data warehouses rely on structure and clean data, whereas data lakes allow data to be in its most natural form. This is because advanced analytic tools and mining software intake raw data and transform it into useful insight.

Both data lakes and data warehouses are repositories for data. That’s about the only similarity between the two. Now, let’s touch on some of the key differences:

  • Data lakes are designed to support all types of data, whereas data warehouses make use of highly structured data – in most cases.
  • Data lakes store all data that may or may not be analyzed at some point in the future. This principle doesn’t apply to data warehouses since irrelevant data is typically eliminated due to limited storage.
  • The scale between data lakes and data warehouses is drastically different due to our previous points. Supporting all types of data and storing that data (even if it’s not immediately useful) means data lakes need to be highly scalable.
  • Thanks to metadata (data about data), users working with a data lake can gain basic insight about the data quickly. In data warehouses, it often requires a member of the development team to access the data – which could create a bottleneck.
  • Lastly, the intense data management required for data warehouses means they’re typically more expensive to maintain compared to data lakes.

I’m believer that modern data warehousing is still very important. Therefore, a data lake itself, doesn’t entirely replace the need for a data warehouse (or data marts) which contain cleansed data in a user-friendly format. The data warehouse doesn’t absolutely have to be in a relational database anymore, but it does need a semantic layer which is easy to work with that most business users can access for the most common reporting needs.

There’s always tradeoffs between performing analytics on a data lake versus from a cleansed data warehouse: Query performance, data load performance, scalability, reusability, data quality, and so forth. Therefore, I believe that a data lake and a data warehouse are complementary technologies that can offer balance. For a fast analysis by a highly qualified analyst or data scientist, especially exploratory analysis, the data lake is appealing. For delivering cleansed, user-friendly data to a large number of users across the organization, the data warehouse still rules.

2. History

Data Lakes allow Data Scientists to mine and analyze large amounts of Big Data. Big Data, which was used for years without an official name, was labeled by Roger Magoulas in 2005. He was describing a large amount of data that seemed impossible to manage or research using the traditional SQL tools available at the time. Hadoop (2008) provided the search engine needed for locating and processing unstructured data on a massive scale, opening the door for Big Data research.

In October of 2010, James Dixon, founder and former CTO of Pentaho, came up with the term “Data Lake”

Data Lake came from the idea that the data drop in one place, and this place becomes a lake.
Visiting a large lake is always a very pleasant feeling. The water in the lake is in its purest form and there are different activities different people perform on the Lake. Some are people are fishing, some people are enjoying a boat ride, this lake also supplies drinking water to people. In short, the same lake is used for multiple purposes. Data Lake Architecture, like the water in the lake, data in a data lake is in the purest possible form. Like the lake, it caters to need to different people, those who want to fish or those who want to take a boat ride or those who want to get drinking water from it, a data lake architecture caters to multiple personas.

This is not a official history about how the name came from. If you know please leave one comment.

3. Courses

https://www.whizlabs.com/blog/top-big-data-influencers/

https://www.mooc-list.com/course/data-lakes-big-data-edcast

4. Books

The Enterprise Big Data Lake
oreilly

Data Lake for Enterprises
oreilly

5. Influencers List

https://www.whizlabs.com/blog/top-big-data-influencers/

6. Link

Data lakes and the promise of unsiloed data

Every software engineer should know about real time datas

Questioning the lambda architecture

Whats a data lake

Data Lake architecture

What’s the Difference Between a Data Lake, Data Warehouse and Database?

What is data streaming

What is Apache Spark?

How heya?

if you’re not familiar with Big Data, I suggest you have a look on my post “What is Big Data?” before.
This post is a collection of links, videos, tutorials, blogs and books that I found mixed with my opinion.

Table of contents

1. What is Apache Spark?
2. Architecture
3. History
4. Courses
5. Books
6. Influencers List
7. Link

1. What is Apache Spark?

Apache Spark is an open source parallel processing framework for storing and processing Big Data across clustered computers. Spark can be used to perform computations much faster than what Hadoop can rather Hadoop and Spark can be used together efficiently. Spark is written in Scala, which is considered the primary language for interacting with the Spark Core engine, but it doesn’t require developers to know Scala, which executes inside a Java Virtual Machine (JVM). APIs for Java, Python, R, and Scala ensure Spark is within reach of a wide audience of developers, and they have embraced the software.

Apache Spark is most actively developed open source project in big data and probably the most widely used as well. Spark is a general purpose Execution Engine, which can perform its cluster management on top of Big Data very quickly and efficiently. It is rapidly increasing its features and capabilities like libraries to perform different types of Analytics.

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application

The Hadoop YARN-based architecture provides the foundation that enables Spark and other applications to share a common cluster and dataset while ensuring consistent levels of service and response. Spark is now one of many data access engines that work with YARN in HDP.
Spark is designed for data science and its abstraction makes data science easier.

Spark also includes MLlib, a library that provides a growing set of machine algorithms for common data science techniques: Classification, Regression, Collaborative Filtering, Clustering and Dimensionality Reduction.

The Driver and the Executer

Spark uses a master-slave architecture. A driver coordinates many distributed workers in order to execute tasks in a distributed manner while a resource manager deals with the resource allocation to get the tasks done.

Driver

Think of it as the “Orchestrator”. The driver is where the main method runs. It converts the program into tasks and then schedules the tasks to the executors. The driver has at its disposal 3 different ways of communicating with the executors; Broadcast, Take and DAG. It controls the execution of a Spark application and maintains all of the states of the Spark cluster, which includes the state and tasks of the executors. The driver must interface with the cluster manager in order to get physical resources and launch executors. To put this in simple terms, this process is just a process on a physical machine that is responsible for maintaining the state of the application running on the cluster.

  • Broadcast Action: The driver transmits the necessary data to each executor. This action is optimal for data sets under a million records, +- 1gb of data. This action can become a very expensive task.
  • Take Action: Driver takes data from all Executors. This action can be a very expensive and dangerous action as the driver might run out of memory and the network could become overwhelmed.
  • DAG Action: (Direct Acyclic Graph) This is the by far least expensive action out of the three. It transmits control flow logic from the driver to the executors.

Executer –  “Workers”

Executers execute the delegated tasks from the driver within a JVM instance. Executors are launched at the beginning of a Spark application and normally run for the whole life span of an application. This method allows for data to persist in memory while different tasks are loaded in and out of the execute throughout the application’s lifespan.
The JVM worker environments in Hadoop MapReduce in stark contrast powers down and powers up for each task. The consequence of this is that Hadoop must perform reads and writes on disk at the start and end of every task.

Cluster manager

This is responsible for allocating resources across the spark application. The Spark context is capable of connecting to several types of cluster managers like Mesos, Yarn or Kubernetes apart from the Spark’s standalone cluster manager.
Cluster Manager is responsible for maintaining a cluster of machines that will run your Spark Application. Cluster managers have their own ‘driver’ and ‘worker’ abstractions, but the difference is that these are tied to physical machines rather than processes.

Spark Context

It holds a connection with Spark cluster manager. All Spark applications run as independent set of processes, coordinated by a SparkContext in a program.

Spark has three data representations viz RDD, Dataframe, Dataset. For each data representation, Spark has a different API. Dataframe is much faster than RDD because it has metadata (some information about data) associated with it, which allows Spark to optimize query plan.
A nice place to understand more about RDD, Dataframe, Dataset is this article: “A Tale of Three Apache Spark APIs: RDDs vs DataFrames and Datasets” https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

RDD

A Resilient Distributed Dataset (RDD), is the primary data abstraction in Apache Spark and the core of Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. Is the most basic data abstraction in Spark. It is a fault-tolerant collection of elements that can be operated on in parallel.RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.

We can apply 2 types of operations on RDDs

Transformation: Transformation refers to the operation applied on a RDD to create new RDD.
Action: Actions refer to an operation which also apply on RDD that perform computation and send the result back to driver.

Example: Map (Transformation) performs operation on each element of RDD and returns a new RDD. But, in case of Reduce (Action), it reduces / aggregates the output of a map by applying some functions (Reduce by key).

RDDs use Shared Variable

The parallel operations in Apache Spark use shared variable. It means that whenever a task is sent by a driver to executors program in a cluster, a copy of shared variable is sent to each node in a cluster, so that they can use this variable while performing task. Accumulator and Broadcast are the two types of shared variables supported by Apache Spark.
Broadcast: We can use the Broadcast variable to save the copy of data across all node.
Accumulator: In Accumulator variables are used for aggregating the information.

How to Create RDD in Apache Spark

Existing storage: When we want to create a RDD though existing storage in driver program (which we would like to be parallelized). For example, converting a list to RDD, which is already created in a driver program.

External sources: When we want to create a RDD though external sources such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

The features of RDDs :

  • Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
  • Distributed with data residing on multiple nodes in a cluster.
  • Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

The key reasons RDDs are an abstraction that works better for distributed data processing, is because they don’t feature some of the issues that MapReduce, the older paradigm for data processing (which Spark is replacing increasingly). Chiefly, these are:

1. Replication: Replication of data on different parts of a cluster, is a feature of HDFS that enables data to be stored in a fault-tolerant manner. Spark’s RDDs address fault tolerance by using a lineage graph. The different name (resilient, as opposed to replicated) indicates this difference of implementation in the core functionality of Spark

2. Serialization: Serialization in MapReduce bogs it down, speed wise, in operations like shuffling and sorting.

Dataframe

It is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame but with lot more stuff under the hood. DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstractions. From 2.0, the Data frames API was merged with Dataset. Now a DataFrame is a Dataset organized into named columns.
Data frames can be created from RDDs.
Dataframes also tries to solve a lot of performance issues that spark had with non-jvm languages like python and R . Historically, using RDD’s in python was much slower than in Scala. With Dataframes, code written all the languages perform the same with some exceptions.

Dataset

It is a collection of strongly-typed domain-specific objects that can be transformed in parallel using functional or relational operations. A logical plan is created and updated for each transformation and a final logical plan is converted to a physical plan when an action is invoked. Spark’s catalyst
optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. Further, there are Encoders generates optimized, lower memory footprint binary structure. Encoders know the schema of the records. This is how they offer significantly faster
serialization and deserialization (comparing to the default Java or Kryo serializers).

The main advantage of Datasets is Type safety. When using Datasets we are assured that both the syntax errors and Analysis errors are caught during compile time. In contrast with Dataframes, where a syntax error can be caught during compile time but an Analysis error such as referring to a nonexisting column name would be caught only once you run it. The run times can be quite expensive and also as a developer it would be nice to have compiler and IDE’s to do these jobs for you.

2. Architecture

Spark Core

Spark Core is the base engine for large-scale parallel and distributed data processing. Further, additional libraries which are built on the top of the core allows diverse workloads for streaming, SQL, and machine learning. It is responsible for memory management and fault recovery, scheduling, distributing and monitoring jobs on a cluster & interacting with storage systems.

BlindDB

BlindDB or Blind Database is also known as an Approximate SQL database. If there is a huge amount of data barraging and you are not really interested in accuracy, or in exact results, but just want to have a rough or an approximate picture, BlindDB gets you the same. Firing a query, doing some sort of sampling, and giving out some output is called Approximate SQL. Isn’t it a new and interesting concept? Many a time, when you do not require accurate results, sampling would certainly do.

Spark SQL

Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing.

Spark Streaming

Spark Streaming is one of those unique features, which have empowered Spark to potentially take the role of Apache Storm. Spark Streaming mainly enables you to create analytical and interactive applications for live streaming data. You can do the streaming of the data and then, Spark can run its operations from the streamed data itself.

Structured Streaming

Structured Streaming (added in Spark 2.x) is to Spark Streaming what Spark SQL was to the Spark Core APIs: A higher-level API and easier abstraction for writing applications. In the case of Structure Streaming, the higher-level API essentially allows developers to create infinite streaming dataframes and datasets. It also solves some very real pain points that users have struggled with in the earlier framework, especially concerning dealing with event-time aggregations and late delivery of messages. All queries on structured streams go through the Catalyst query optimizer, and can even be run in an interactive manner, allowing users to perform SQL queries against live streaming data.
Structured Streaming is still a rather new part of Apache Spark, having been marked as production-ready in the Spark 2.2 release. However, Structured Streaming is the future of streaming applications with the platform, so if you’re building a new streaming application, you should use Structured Streaming. The legacy Spark Streaming APIs will continue to be supported, but the project recommends porting over to Structured Streaming, as the new method makes writing and maintaining streaming code a lot more bearable.

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

MLLib

MLLib is a machine learning library like Mahout. It is built on top of Spark, and has the provision to support many machine learning algorithms. But the point difference with Mahout is that it runs almost 100 times faster than MapReduce. It is not yet as enriched as Mahout, but it is coming up pretty well, even though it is still in the initial stage of growth.

GraphX

For graphs and graphical computations, Spark has its own Graph Computation Engine, called GraphX. It is similar to other widely used graph processing tools or databases, like Neo4j, Girafe, and many other distributed graph databases.

3. History

Apache Spark is about to turn 10 years old.
Spark started in 2009 as a research project in the UC Berkeley RAD Lab, later to become the AMPLab. The researchers in the lab had previously been working on Hadoop MapReduce, and observed that MapReduce was inefficient for iterative and interactive computing jobs. Thus, from the beginning, Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas like support for in-memory storage and efficient fault recovery.

Soon after its creation it was already 10–20× faster than MapReduce for certain jobs.
Some of Spark’s first users were other groups inside UC Berkeley, including machine learning researchers such as the Mobile Millennium project, which used Spark to monitor and predict traffic congestion in the San Francisco Bay Area.

Spark was first open sourced in March 2010, and was transferred to the Apache Software Foundation in June 2013, where it is now a top-level project. It is an open source project that has been built and is maintained by a thriving and diverse community of developers. In addition to UC Berkeley, major contributors to Spark include Databricks, Yahoo!, and Intel.

Internet powerhouses such as Netflix, Yahoo, and eBay have deployed Spark at massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes. It has quickly become the largest open source community in big data, with over 1000 contributors from 250+ organizations.

In-memory computation

The biggest advantage of Apache Spark comes from the fact that it saves and loads the data in and from the RAM rather than from the disk (Hard Drive). If we talk about memory hierarchy, RAM has much higher processing speed than Hard Drive (illustrated in figure below). Since the prices of memory has come down significantly in last few years, in-memory computations have gained a lot of momentum.
Spark uses in-memory computations to speed up 100 times faster than Hadoop framew

Spark VS Hadoop

Speed

  • Apache Spark — it’s a lightning-fast cluster computing tool. Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop by reducing the number of read-write cycles to disk and storing intermediate data in-memory.
  • Hadoop MapReduce — MapReduce reads and writes from disk, which slows down the processing speed and overall efficiency.

Ease of Use

  • Apache Spark — Spark’s many libraries facilitate the execution of lots of major high-level operators with RDD (Resilient Distributed Dataset).
  • Hadoop — In MapReduce, developers need to hand-code every operation, which can make it more difficult to use for complex projects at scale.

Handling Large Sets of Data

  • Apache Spark — since Spark is optimized for speed and computational efficiency by storing most of the data in memory and not on disk, it can underperform Hadoop MapReduce when the size of the data becomes so large that insufficient RAM becomes an issue.
  • Hadoop — Hadoop MapReduce allows parallel processing of huge amounts of data. It breaks a large chunk into smaller ones to be processed separately on different data nodes. In case the resulting dataset is larger than available RAM, Hadoop MapReduce may outperform Spark. It’s a good solution if the speed of processing is not critical and tasks can be left running overnight to generate results in the morning.

Hadoop vs Spark
How do Hadoop and Spark Stack Up?

4. Courses

https://mapr.com/company/press-releases/free-apache-spark-courses-and-discounted-certification-exam/

5. Books

Spark: The Definitive Guide is the best option to start.
oreillypdfGithub

Learning Spark: Lightning-Fast Big Data Analysis
oreillypdfGithub

Advanced Analytics with Spark: Patterns for Learning from Data at Scale
oreilly

High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark
oreilly

6. Influencers List

@SparkAISummit
@BigData_LDN
@caroljmcdonald
@marklit82
@holdenkarau
@schmarzo

7. Link

Practical apache spark 10 minutes

Apache Spark Architecture

Structured streaming

Churn prediction with Spark

Introduction to Spark Graphframe

Spark Tutorial

What is Hadoop?

Alright, Boyo?

if you’re not familiar with Big Data, I suggest you have a look on my post “What is Big Data?
” before.
This post is a collection of links, videos, tutorials, blogs and books that I found mixed with my opinion.

Table of contents

1. What is Hadoop?
2. Architecture
3. History
4. Courses
5. Books
6. Influence’s List
7. Podcasts
8. Newsletters
9. Links

1. What is Hadoop?

Hadoop is a framework that allows you to first store Big Data in a distributed environment, so that, you can process it parallelly.
Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. It is licensed under the Apache License 2.0.

According to the definition of 3V’s of big data, Apache Hadoop came to solve these problems.

The first problem is storing Big data.

HDFS provides a distributed way to store Big data. Your data is stored in blocks across the DataNodes and you can specify the size of blocks. Basically, if you have 512MB of data and you have configured HDFS such that, it will create 128 MB of data blocks. So HDFS will divide data into 4 blocks as 512/128=4 and store it across different DataNodes, it will also replicate the data blocks on different DataNodes. Now, as we are using commodity hardware, hence storing is not a challenge.
It also solves the scaling problem. It focuses on horizontal scaling instead of vertical scaling. You can always add some extra data nodes to HDFS cluster as and when required, instead of scaling up the resources of your DataNodes.

The second problem was storing a variety of data.

With HDFS you can store all kinds of data whether it is structured, semi-structured or unstructured. Since in HDFS, there is no pre-dumping schema validation. And it also follows write once and read many models. Due to this, you can just write the data once and you can read it multiple times for finding insights.

The third challenge was accessing & processing the data faster.

This is one of the major challenges with Big Data. In order to solve it, we move processing to data and not data to processing. What does it mean? Instead of moving data to the master node and then processing it. In MapReduce, the processing logic is sent to the various slave nodes & then data is processed parallely across different slave nodes. Then the processed results are sent to the master node where the results are merged and the response is sent back to the client.
In YARN architecture, we have ResourceManager and NodeManager. ResourceManager might or might not be configured on the same machine as NameNode. But, NodeManagers should be configured on the same machine where DataNodes are present.

2. Architecture

We can divide Hadoop in some modules;

A. Hadoop Common: contains libraries and utilities needed by other Hadoop modules
B. Hadoop Distributed File System (HDFS): a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster
C. Hadoop MapReduce: a programming model for large scale data processing
D. Hadoop YARN: a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications

A. Hadoop Common
Refers to the collection of common utilities and libraries that support other Hadoop modules. It is an essential part or module of the Apache Hadoop Framework, along with the Hadoop Distributed File System (HDFS), Hadoop YARN and Hadoop MapReduce.

B. How the Hadoop Distributed File System (HDFS) works
Hadoop has a file system that is much like the one on your desktop computer, but it allows us to distribute files across many machines. HDFS organizes information into a consistent set of file blocks and storage blocks for each node. In the Apache distribution, the file blocks are 64MB and the storage blocks are 512 KB. Most of the nodes are data nodes, and there are also copies of the data. Name nodes exist to keep track of where all the file blocks reside.

Each node in a Hadoop instance typically has a single namenode, and a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication. Clients use Remote procedure call (RPC) to communicate with each other. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high.

The HDFS file system includes a so-called secondary namenode, which misleads some people into thinking that when the primary namenode goes offline, the secondary namenode takes over. In fact, the secondary namenode regularly connects with the primary namenode and builds snapshots of the primary namenode’s directory information, which the system then saves to local or remote directories.
File access can be achieved through the native Java API, the Thrift API, to generate a client in the language of the users’ choosing ( Java, Python, Scala, …), the command-line interface, or browsed through the HDFS-UI web app over HTTP.

NameNode
Maintains and Manages DataNodes.
Records Metadata i.e. information about data blocks e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc.
Receives status and block report from all the DataNodes.

DataNode
Slave daemons. It sends signals to NameNode.
Stores actual It stores in data blocks.
Serves read and write requests from the clients.

C. How MapReduce works

Map Reduce is a really powerful programming model that was built by some smart guys at Google. It helps to process really large sets of data on a cluster using a parallel distributed algorithm.

As the name suggests, there are two steps in the MapReduce process—map and reduce. Let’s say you start with a file containing all the blog entries about big data in the past 24 hours and want to count how many times the words Hadoop, Big Data, and Greenplum are mentioned. First, the file gets split up on HDFS. Then, all participating nodes go through the same map computation for their local dataset—they count the number of times these words show up. When the map step is complete, each node outputs a list of key-value pairs.
Once mapping is complete, the output is sent to other nodes as input for the reduce step. Before reduce runs, the key-value pairs are typically sorted and shuffled. The reduce phase then sums the lists into single entries for each word.

Above the file systems comes the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible.

With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network.

If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.

Some of the terminologies in the MapReduce process are:

MasterNode – Place where JobTracker runs and which accepts job requests from clients
SlaveNode – It is the place where the mapping and reducing programs are run
JobTracker – it is the entity that schedules the jobs and tracks the jobs assigned using Task Tracker
TaskTracker – It is the entity that actually tracks the tasks and provides the report status to the JobTracker
Job – A MapReduce job is the execution of the Mapper & Reducer program across a dataset
Task – the execution of the Mapper & Reducer program on a specific data section
TaskAttempt – A particular task execution attempt on a SlaveNode
Map Function – The map function takes an input and produces a set of intermediate key value pairs.
Reduce Function – The reduce function accepts an Intermediate key and a set of values for that key. It merges together these values to form a smaller set of values. The intermediate values are supplied to user’s reduce function via an iterator.
Thus map reduce converts each task to a group of map reduce functions and each map and reduce task can be performed by different machines. The results can be merged back to produce the required output.

D. How YARN works: Yet Another Resource Negotiator

MapReduce has undergone a complete overhaul in Hadoop 0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.
Apache Hadoop YARN is a sub-project of Hadoop at the Apache Software Foundation introduced in Hadoop 2.0 that separates the resource management and processing components. YARN was born of a need to enable a broader array of interaction patterns for data stored in HDFS beyond MapReduce. The YARN-based architecture of Hadoop 2.0 provides a more general processing platform that is not constrained to MapReduce.

YARN enhances the power of a Hadoop compute cluster in the following ways:

  • Scalability: The processing power in data centers continues to grow quickly. Because YARN ResourceManager focuses exclusively on scheduling, it can manage those larger clusters much more easily.
  • Compatibility with MapReduce: Existing MapReduce applications and users can run on top of YARN without disruption to their existing processes.
  • Improved cluster utilization: The ResourceManager is a pure scheduler that optimizes cluster utilization according to criteria such as capacity guarantees, fairness, and SLAs. Also, unlike before, there are no named map and reduce slots, which helps to better utilize cluster resources.
  • Support for workloads other than MapReduce: Additional programming models such as graph processing and iterative modeling are now possible for data processing. These added models allow enterprises to realize near real-time processing and increased ROI on their Hadoop investments.
  • Agility: With MapReduce becoming a user-land library, it can evolve independently of the underlying resource manager layer and in a much more agile manner.

The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker/TaskTracker into separate entities:

  • a global ResourceManager
  • a per-application ApplicationMaster
  • a per-node slave NodeManager
  • a per-application container running on a NodeManager

The ResourceManager and the NodeManager form the new, and generic, system for managing applications in a distributed manner. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is a framework-specific entity and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the component tasks. The ResourceManager has a scheduler, which is responsible for allocating resources to the various running applications, according to constraints such as queue capacities, user-limits, etc. The scheduler performs its scheduling function based on the resource requirements of the applications. The NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (CPU, memory, disk, network) and reporting the same to the ResourceManager. Each ApplicationMaster has the responsibility of negotiating appropriate resource containers from the scheduler, tracking their status, and monitoring their progress. From the system perspective, the ApplicationMaster runs as a normal container.

TEZ

Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.

Apache Tez provides a developer API and framework to write native YARN applications that bridge the spectrum of interactive and batch workloads. It allows those data access applications to work with petabytes of data over thousands of nodes. The Apache Tez component library allows developers to create Hadoop applications that integrate natively with Apache Hadoop YARN and perform well within mixed workload clusters.

Since Tez is extensible and embeddable, it provides the fit-to-purpose freedom to express highly optimized data processing applications, giving them an advantage over end-user-facing engines such as MapReduce and Apache Spark. Tez also offers a customizable execution architecture that allows users to express complex computations as dataflow graphs, permitting dynamic performance optimizations based on real information about the data and the resources required to process it.

What is Tez?

MPP

MPP stands for Massive Parallel Processing, this is the approach in grid computing when all the separate nodes of your grid are participating in the coordinated computations. MPP DBMSs are the database management systems built on top of this approach. In these systems each query you are staring is split into a set of coordinated processes executed by the nodes of your MPP grid in parallel, splitting the computations the way they are running times faster than in traditional SMP RDBMS systems. One additional advantage that this architecture delivers to you is the scalability, because you can easily scale the grid by adding new nodes into it. To be able to handle huge amounts of data, the data in these solutions is usually split between nodes (sharded) the way that each node processes only its local data. This further speeds up the processing of the data, because using shared storage for this kind of design would be a huge overkill – more complex, more expensive, less scalable, higher network utilization, less parallelism. This is why most of the MPP DBMS solutions are shared-nothing and work on DAS storage or the set of storage shelves shared between small groups of servers.

When to use Hadoop?

Hadoop is used for: (This is not an exhaustive list!)

  • Log processing – Facebook, Yahoo
  • Data Warehouse – Facebook, AOL
  • Video and Image Analysis – New York Times, Eyealike

Till now, we have seen how Hadoop has made Big Data handling possible. But there are some scenarios where Hadoop implementation is not recommended.

When not to use Hadoop?

Following are some of those scenarios : (This is not an exhaustive list!)

  • Low Latency data access : Quick access to small parts of data.
  • Multiple data modification : Hadoop is a better fit only if we are primarily concerned about reading data and not modifying data.
  • Lots of small files : Hadoop is suitable for scenarios, where we have few but large files.
  • After knowing the best suitable use-cases, let us move on and look at a case study where Hadoop has done wonders.

Hate Hadoop? Then You’re Doing It Wrong

3. History

In 2003, Doug Cutting launches project Nutch to handle billions of searches and indexing millions of web pages. Later in Oct 2003 – Google releases papers with GFS (Google File System). In Dec 2004, Google releases papers with MapReduce. In 2005, Nutch used GFS and MapReduce to perform operations.

Hadoop was created by Doug Cutting and Mike Cafarella in 2005. It was originally developed to support distribution for the Nutch search engine project. Doug, who was working at Yahoo! at the time and is now Chief Architect of Cloudera.
The name Hadoop came from his son’s toy elephant. Cutting’s son was 2 years old at the time and just beginning to talk. He called his beloved stuffed yellow elephant “Hadoop” (with the stress on the first syllable). Now 12, Doug’s son often exclaims, “Why don’t you say my name, and why don’t I get royalties? I deserve to be famous for this!”
There are some similar stories about the name.

Later in Jan 2008, Yahoo released Hadoop as an open source project to Apache Software Foundation. In July 2008, Apache tested a 4000 node cluster with Hadoop successfully. In 2009, Hadoop successfully sorted a petabyte of data in less than 17 hours to handle billions of searches and indexing millions of web pages. Moving ahead in Dec 2011, Apache Hadoop released version 1.0. Later in Aug 2013, Version 2.0.6 was available, in Sep 2016, Version 3.0.0-alpha was available and in Dec 2017, Version 3.0.0 was available.

Is Hadoop dying?

Nowadays a lot of people start to talk about Hadoop is dying and that Spark is the future. But what exactly this means?
Hadoop itself is not dying but MapReduce that is batch orientate is being replaced to Spark because Spark can run in memory and with this be faster. The other thing is about the rising of the clouds, and is now possible to use cloud storage to replace HDFS and is totally possible to use tools like Spark without Hadoop. In the other hand, Hadoop 3 is supporting integration with Object Storage System and already changes yarn to run with GPU.

Hadoop Slowdown

Hadoop 3

With Hadoop 3.0 YARN will enable running all types of clusters and mix CPU and GPU intensive processes. For example, by integrating TensorFlow with YARN an end-user can seamlessly manage resources across multiple Deep Learning applications and other workloads, such as Spark, Hive or MapReduce.

The major changes are:

  • Minimum Required Java Version in Hadoop 3 is 8
  • Support for Erasure Encoding in HDFS
  • YARN Timeline Service v.2
  • Shell Script Rewrite
  • Shaded Client Jars
  • Support for Opportunistic Containers
  • MapReduce Task-Level Native Optimization
  • Support for More than 2 NameNodes
  • Default Ports of Multiple Services have been Changed
  • Support for Filesystem Connector
  • Intra-DataNode Balancer
  • Reworked Daemon and Task Heap Management

Link1
Link2

4. Courses

https://hackernoon.com/top-5-hadoop-courses-for-big-data-professionals-best-of-lot-7998f593d138

https://dzone.com/articles/top-5-hadoop-courses-to-learn-online

https://www.datasciencecentral.com/profiles/blogs/hadoop-tutorials

5. Books

Hadoop: The Definitive Guide is the best option to start.
http://grut-computing.com/HadoopBook.pdf
https://github.com/tomwhite/hadoop-book

15+ Great Books for Hadoop

6. Influencers List

https://www.kdnuggets.com/2015/05/greycampus-150-most-influential-people-big-data-hadoop.html

7. Podcasts

https://player.fm/podcasts/Hadoop

8. Newsletters

https://blog.feedspot.com/hadoop_blogs/

9. Links

Hadoop Docs

Hadoop Architecture Overview

Hadoop complete Overview

What is MapReduce?

What is MapReduce?

MapReduce in easy way

MapReduce Concepts

Understanding Hadoop

What is Big Data?

What’s the story Rory?

Table of contents

  1. What exactly is Big Data?
  2. What can Big Data do?
  3. Why Has It Become So Popular?
  4. Why Should Businesses Care?
  5. Big data and analytics
  6. What Could That Data Be, Exactly?
  7. IT infrastructure to support big data
  8. Big data skills
  9. What Are the Most Commonly Held Misconceptions About Big Data?
  10. Big Data and Cloud Computing
  11. Book
  12. Influencers List
  13. Courses
  14. Links

1. What exactly is Big Data?

Big Data started appearing in many of my conversations with many of my tech friends. So when I met this “Mr. Know It All Consultant”, I asked him ‘What is Big Data?’. He went on to explain why Big Data is the next ‘in thing’ and why everyone should know about Big Data but never directly answered my question.

At first glance, the term seems rather vague, referring to something that is large and full of information. That description does indeed fit the bill, yet it provides no information on what Big Data really is.

Big Data is often described as extremely large data sets that have grown beyond the ability to manage and analyze them with traditional data processing tools. Searching the Web for clues reveals an almost universal definition, shared by the majority of those promoting the ideology of Big Data, that can be condensed into something like this: Big Data defines a situation in which data sets have grown to such enormous sizes that conventional information technologies can no longer effectively handle either the size of the data set or the scale and growth of the data set. In other words, the data set has grown so large that it is difficult to manage and even harder to garner value out of it. The primary difficulties are the acquisition, storage, searching, sharing, analytics, and visualization of data.

I kept asking that question to various other folks and I did not get the same answer twice from any number of people. ‘Oh, it’s a lot of data’. ‘It’s variety of data’. ‘It’s how fast the data is piling up’. Really? I thought to myself but was afraid to ask more questions. As none of it made much sense to me, I decided to dig into it myself. Obviously, my first stop was Google.

When I typed ‘Big Data’ at that time, this showed up:
“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it …” Dan Ariely

One popular interpretation of big data refers to extremely large data sets but I particularly prefer the 3 Vs definition: (volume, variety and velocity)
Doug Laney from Gartner who was credited with the 3 ‘V’s of Big Data. Gartner’s Big Data is ‘high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.’ Gartner is referring to the size of data (large volume), speed with which the data is being generated (velocity), and the different types of data (variety)

Mike Gualtieri of Forrester said that the 3 ‘V’s mentioned by Gartner are just measures of data and Mike insisted that Forrester’s definition is more actionable. And that definition is: ‘Big Data is the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers. Forrester seems to be saying that any data that is beyond the current reach (i.e. frontier) of that firm to store (i.e. large volumes of data), process (i.e. needs innovative processing), and access (new ways of accessing that data) is the Big Data. So the question is: What is the ‘frontier’? Who defines the frontier?
I kept searching for those answers. I looked at McKinsey’s definition: “Datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” Well, similar to all the above but still not specific for me to decide when the data becomes Big Data.
IBM added ‘Veracity’ referring to the quality of data. And then several people start to add more Vs to the Big Data definition.

I found two famous one:
10 Vs of Big data

42 V’s of Big Data
That I think is too much and a little bit funny, especially “23 – Version Control and 40 – Voodoo”.

Wikipedia it said ‘Big Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing application.’ So wikipedia’s definition is focusing on ‘volume of data’ and ‘complexity of processing that data’.

O’Reilly Media it said “Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”. To some extent, Wikipedia and O’Reilly’s definitions are similar in that both refer to ‘processing capacity’ and ‘conventional database systems’ but O’Reilly media adds a new twist by mentioning ‘too big’ and ‘moves fast’.

I couldn’t find a clear answer to the question of what is the volume threshold of data that makes it Big Data, but I found some small and not famous articles that they said “some have defined big data as an amount that exceeds ten petabytes” and there’s the fact that you can use Big Data for a small amount of data as well.

Even though there is no single definition for Big Data that is universally accepted, there are some common concepts that almost all seem to converge on.
Data that is of large volume, data that is not a single type i.e. structured and a variety of structured, unstructured etc and data that requires newer ways to store, process, analyze, visualize, and integrate.
The truth is that data is being generated at a much faster rate than in the past from all kinds of sources including social media and we need one way to handle that.

What is Big data by Oracle

The popular use of big data can be traced to a single research paper published in 2004:
MapReduce: Simplified Data Processing on Large Clusters”, by Jeffrey Dean and Sanjay Ghemawat.

2. What can Big Data do?

A list of some use cases:

  • Recommendation engines
  • Fraud detection
  • Predictive analytics
  • Customer segmentation
  • Customer churn prevention
  • Product development
  • Price optimization
  • Customer sentiment analysis
  • Real-time analytics

Big Data is not just limited to software or application development. Big Data development is used in many other sectors like:

  • Fintech
  • Robotics
  • Meteorology
  • Medicine
  • Environmental research
  • Informatics and cybersecurity

3. Why Has It Become So Popular?

Big Data’s recent popularity has been due in large part to new advances in technology and infrastructure that allow for the processing, storing and analysis of so much data. Computing power has increased considerably in the past five years while at the same time dropping in price – making it more accessible to small and midsize companies. In the same vein, the infrastructure and tools for large-scale data analysis have gotten more powerful, less expensive and easier to use.
As the technology has gotten more powerful and less expensive, numerous companies have emerged to take advantage of it by creating products and services that help businesses to take advantage of all Big Data has to offer.

4. Why Should Businesses Care?

Data has always been used by businesses to gain insights through analysis. The emergence of Big Data means that they can now do this on an even greater scale, taking into account more and more factors. By analyzing greater volumes from a more varied set of data, businesses can derive new insights with a greater degree of accuracy. This directly contributes to improved performance and decision making within an organization.
Big Data is fast becoming a crucial way for companies to outperform their peers. Good data analysis can highlight new growth opportunities, identify and even predict market trends, be used for competitor analysis, generate new leads and much more. Learning to use this data effectively will give businesses greater transparency into their operations, better predictions, faster sales and bigger profits.

5. Big data and analytics

What really delivers value from all the big data organizations are gathering is the analytics applied to the data. Without analytics, it’s just a bunch of data with limited business use.
By applying analytics to big data, companies can see benefits such as increased sales, improved customer service, greater efficiency, and an overall boost in competitiveness.
Data analytics involves examining data sets to gain insights or draw conclusions about what they contain, such as trends and predictions about future activity.
Analytics can refer to basic business intelligence applications or more advanced, predictive analytics such as those used by scientific organizations. Among the most advanced type of data analytics is data mining, where analysts evaluate large data sets to identify relationships. patterns, and trends.

6. What Could That Data Be, Exactly?

it could be all the point of sale data for Best Buy. That’s a huge data set — everything that goes through a cash register. For us, it’s all of the activity on a website, so a ton of people coming through, doing a bunch of different things. It’s not really exactly cohesive and structured.
With point of sale, for example, you’re looking at what people are purchasing and what they’ve done historically. You’re looking at what they’ve clicked on in email newsletters, loyalty program data, and coupons that you’ve sent them in direct mail — have those been redeemed? All these things come together to form a data set around purchasing behavior. You can look at what “like” customers do in order to predict what similar customers will buy as well.

7. IT infrastructure to support big data

For the concept of big data to work, organizations need to have the infrastructure in place to gather and house the data, provide access to it, and secure the information while it’s in storage and in transit.
At a high level, these include storage systems and servers designed for big data, data management and integration software, business intelligence and data analytics software, and big data applications.

8. Big data skills

Big data and big data analytics endeavors require specific skills, whether they come from inside the organization or through outside experts.
Many of these skills are related to the key big data technology components, such as Hadoop, Spark, NoSQL databases, in-memory databases, and analytics software.
Others are specific to disciplines such as data science, data mining, statistical and quantitative analysis, data visualization, general-purpose programming, and data structure and algorithms. There is also a need for people with overall management skills to see big data projects through to completion.

Under the umbrella of Big Data, there are many technologies and concepts. This is not an exhaustive list!

Google File System – GFS – is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware.

Distributed File System – In computing, a distributed file system (DFS) or network file system is any file system that allows access to files from multiple hosts sharing via a computer network. The DFS makes it convenient to share information and files among users on a network in a controlled and authorized way. This makes it possible for multiple users on multiple machines to share files and storage resources.

Hadoop – Hadoop is a massive system for distributed parallel processing of huge amounts of data and provide a distributed file system to that.
Hadoop is composed of the distributed file system, Map Reduce and yarm.

MapReduce – a programming model that makes combining the data from various hard drives a much easier task. There are two parts to the programming model – the map phase and the reduce phase—and it’s the interface between the two where the “combining” of data occurs. MapReduce enables anyone with access to the cluster to perform large-scale data analysis.

HDFS – Hadoop File System

Yarn – Yet Another Resource manager – Apache Hadoop YARN is the resource management and job scheduling technology in the open source Hadoop distributed processing framework. One of Apache Hadoop’s core components, YARN is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.

Hadoop Ecosystem – Hadoop Ecosystem is a platform or framework which solves big data problems. You can consider it as a suite which encompasses a number of services (ingesting, storing, analyzing and maintaining) inside it.

Spark – is an open-source distributed general-purpose cluster computing framework with (mostly) in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages.
The 4 main part are: Spark SQL, Spark Streaming, MlLib and GraphX

Pig – Apache Pig is a platform for analyzing large data sets that consist of a high-level language for creating MapReduce programs. These programs can then be run in parallel on large-scale Hadoop clusters. Complex tasks are broken into small data flow sequences, which make them easier to write, maintain, and understand. Users are able to focus more on semantics rather than efficiency with Pig Latin, because tasks are encoded in a way that allows the system to automatically optimize the execution. By utilizing user-defined functions, users are also able to extend Pig Latin. These functions can be written in many popular programming languages such as Java, Python, JavaScript, Ruby, or Groovy and then called directly using Pig Latin.

Hive – Apache Hive is a data warehouse system for data summarization, analysis and querying of large data systems in open source Hadoop platform. It converts SQL-like queries into MapReduce jobs for easy execution and processing of extremely large volumes of data.

Hbase – Apache HBase is a highly distributed, NoSQL database solution that scales to store large amounts of sparse data. In the scheme of Big Data, it fits into the storage category and is simply an alternative or additional data store option. It is a column-oriented, key-value store that has been modeled after Google’s BigTable.

Oozie – is a server-based workflow scheduling system to manage Hadoop jobs. Workflows in Oozie are defined as a collection of control flow and action nodes in a directed acyclic graph. Control flow nodes define the beginning and the end of a workflow (start, end, and failure nodes) as well as a mechanism to control the workflow execution path (decision, fork, and join nodes). Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing task.

Kafka – is an open-source stream-processing software platform. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue designed as a distributed transaction log,” making it highly valuable for enterprise infrastructures to process streaming data.

Flume – is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

Solr – open source search engine.

Cloudera – Is the one of the most famous Hadoop distribution.

Hortonworks – Is another famous Hadoop distribution.

MapR – Is another famous Hadoop distribution.

ETL – is short for extract, transform and load.

ELT – is short for extract, load and transform.

Algorithm – A programming algorithm is a computer procedure that is a lot like a recipe (called a procedure) and tells your computer precisely what steps to take to solve a problem or reach a goal.

Analytics – Analytics often involves studying past historical data to research potential trends, to analyze the effects of certain decisions or events, or to evaluate the performance of a given tool or scenario. The goal of analytics is to improve the business by gaining knowledge which can be used to make improvements or changes.

Descriptive Analytics – is a preliminary stage of data processing that creates a summary of historical data to yield useful information and possibly prepare the data for further analysis.

Predictive Analytics – is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trends.

Prescriptive Analytics – is the area of business analytics (BA) dedicated to finding the best course of action for a given situation. Prescriptive analytics is related to both descriptive and predictive analytics.

Batch processing – is a general term used for frequently used programs that are executed with minimum human interaction. Batch process jobs can run without any end-user interaction or can be scheduled to start up on their own as resources permit.

Dark Data – is data which is acquired through various computer network operations but not used in any manner to derive insights or for decision making. The ability of an organisation to collect data can exceed the throughput at which it can analyse the data.

Data lake – is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.

Data warehouse – is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process. Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, “sales” can be a particular subject.

Data mining – is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

Data Scientist – DS – is a multidisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining.

Data analytics – DA – is the process of examining data sets in order to draw conclusions about the information they contain, increasingly with the aid of specialized systems and software.

Data Engineer – are the designers, builders and managers of the information or “big data” infrastructure. They develop the architecture that helps analyze and process data in the way the organization needs it. And they make sure those systems are performing smoothly.

Data modeling – This is a conceptual application of analytics in which multiple “what-if” scenarios can be applied via algorithms to multiple data sets. Ideally, the modeled information changes based on the information made available to the algorithms, which then provide insight to the effects of the change on the data sets. Data modeling works hand in hand with data visualization, in which uncovering information can help with a particular business endeavor.

AI – Artificial intelligence – is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning (the acquisition of information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions) and self-correction.

Machine learning – is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

Nosql – Is data model that addresses several issues that the relational model is not designed to address: Large volumes of rapidly changing structured, semi-structured, and unstructured data.

Stream processing – is a computer programming paradigm, equivalent to dataflow programming, event stream processing, and reactive programming, that allows some applications to more easily exploit a /limited form of parallel processing.

Structured data – is data that has been organized into a formatted repository, typically a database, so that its elements can be made addressable for more effective processing and analysis. A data structure is a kind of repository that organizes information for that purpose.

Unstructured Data – or unstructured information is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well.

Java – is a high-level programming language developed by Sun Microsystems. … The Java syntax is similar to C++, but is strictly an object-oriented programming language. For example, most Java programs contain classes, which are used to define objects, and methods, which are assigned to individual classes.

Scala – is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way. … Scala smoothly integrates the features of object-oriented and functional languages. This tutorial explains the basics of Scala in a simple and reader-friendly way.

Python – is an interpreted, high-level, general-purpose programming language.

R – is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.

Traditional business intelligence – BI – This consists of a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data. BI delivers actionable information, which helps enterprise users make better business decisions using fact-based support systems. BI works by using an in-depth analysis of detailed business data, provided by databases, application data, and other tangible data sources. In some circles, BI can provide historical, current, and predictive views of business operations.

Statistical applications – These look at data using algorithms based on statistical principles and normally concentrate on data sets related to polls, census, and other static data sets. Statistical applications ideally deliver sample observations that can be used to study populated data sets for the purpose of estimating, testing, and predictive analysis. Empirical data, such as surveys and experimental reporting, are the primary sources for analyzable information.

9. What Are the Most Commonly Held Misconceptions About Big Data?

In my opinion, people think it’s this magical thing. They think, “We’ll just turn that on and now things will just work and we’ll know all this stuff.” But it’s just not that simple — it’s actually really complicated and you need the right equipment and people that understand how to analyze and work with big data.
Increasingly, simplified tools are coming out for non-technical users to create dashboards and get some of the information they’re looking for, but it is a really specialized skillset. It’s not something you can just turn on and have. There’s an investment in people, time, and hard costs to make this stuff work.

10. Big Data and Cloud Computing

https://www.whizlabs.com/blog/big-data-and-cloud-computing/
https://bigdata-madesimple.com/big-data-and-cloud-computing-challenges-and-opportunities

11. Book

https://solutionsreview.com/data-management/top-25-best-big-data-books-you-should-read/

12. Influencers List

https://www.whizlabs.com/blog/top-big-data-influencers/

13. Courses

https://www.kdnuggets.com/2016/08/simplilearn-5-big-data-courses.html

14. Links

What can Big Data do – https://www.bernardmarr.com/default.asp?contentID=1076
What is Big Data? A Complete Guide – https://learn.g2crowd.com/big-data
Hadoop Ecosystem table – https://hadoopecosystemtable.github.io/

Conclusion
Even though there is no single definition for Big Data that is universally accepted, there are some common concepts that almost all seem to converge on.
This post is a simple and brief overview of Big data and the ecosystem around.

Colaboratory

How’s the craic?

We all know that deep learning algorithms improve the accuracy of AI applications to a great extent. But this accuracy comes with requiring heavy computational processing units such as GPU for developing deep learning models. Many of the machine learning developers cannot afford GPU as they are very costly and find this as a roadblock for learning and developing Deep learning applications. To help the AI, machine learning developers Google has released a free cloud-based service Google Colaboratory – Jupyter notebook environment with free GPU processing capabilities with no strings attached for using this service. It is ready to use service which requires no set at all.

Any AI developers can use this free service to develop deep learning applications using popular AI libraries like Tensorflow, Pytorch, Keras, etc.

The Colaboratory is a new service that lets you edit and run IPython notebooks right from Google Drive for free! It’s similar to Databricks – give that a try if you’re looking for a better-supported way to run Spark in the cloud, launch clusters, and much more.

Setting up colab:

1
Go to google drive → new item → More → colaboratory.

This opens up a python Jupyter notebook in the browser.
By default, the Jupyter notebook runs on Python 2.7 version and CPU processor. You can change the python version to Python 3.6 and processing capability to GPU by changing the settings as shown below:

1
Go to Runtime → Change runtime type

This opens up a Notebook settings pop-up where we can change Runtime Type to Python 3.6 and processing Hardware to GPU.
And then your python environment with the processing power of GPU is ready to use.

Google has published some tutorials showing how to use Tensorflow and various other Google APIs and tools on Colaboratory. You can have a look and play around, but for fun, let’s check how to add Spark on this environment.

Add Apache Spark on colab:

Under the hood, there’s a full Ubuntu container running on Colaboratory and you’re given root access. This container seems to be recreated once the notebook is idle for a while (maybe a few hours). In any case, this means we can just install Java and Spark and run a local Spark session. Do that by running:

1
2
3
4
apt-get install openjdk-8-jdk-headless -qq > /dev/null
wget -q http://apache.osuosl.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
tar xf spark-2.2.1-bin-hadoop2.7.tgz
pip install -q findspark

Now that Spark is installed, we have to tell Colaboratory where to find it:

1
2
3
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"

Finally (only three steps!), start Spark with:

1
2
3
4
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

And That’s all – Spark is now running in local mode on a free cloud instance. It’s not very powerful, but it’s a really easy way to get familiar with Spark without installing it locally.

You can have a look in my post about Jupyter-Notebook for more examples.

Important things to remember:

  • The supported browsers are Chrome and Firefox.
  • Currently, only Python is supported.
  • We can you use up to 12 hours of processing time in one go.

Web Summit 2018

How’s it going’ horse?

Takeaways

My first time at Web Summit, my first impression: it is massive and I say Massive regarding the numbers of stages (24), speakers (1200), partners, startups (1800)… but specifically regarding the number of people (70.000). We can say “it is crowded” to the point that it could frustrated visitors.

With nearly 70.000 attendees from over 160 countries, more than 1.200 speakers, over 1.800 of promising startups, Fortune 500 companies, 24 tracks to follow and over 200 employees working really hard to prepare it, Web Summit grew to be the largest technology conference in the world, and has been called the best one on the planet (Forbes), Davos for geeks (Bloomberg) and Glastonbury for geeks (The Guardian). The 2018 edition in Lisbon is over and it’s time for takeaways.

But what is Web Summit?

Well, in a nutshell, it’s something that started as a “simple idea” in 2010—to connect people in the tech community— evolved into the biggest technology conference in the world.
Or, as Web Summit co-founder Paddy Cosgrove puts it, “World’s largest gathering of entrepreneurs.”

websummit.com

Planning is everything

It was my first time at the Web Summit and the most important lesson I took was that planning is key. I downloaded the official app and added some items to my schedule, but there I couldn’t follow up …

It’s not a very techie tech conference

The Web Summit might be the biggest technology conference in the world, but it isn’t the most techie one for sure. The idea of this conference was to bring the tech people and industries together — and this objective is being realized year after year with great success. The Summit saw an enormous growth in attendance — from 400 people in 2010 to 70.000 this year, gathering founders, CEOs of technology companies, policymakers, heads of state and startups.

Women are more and more present in tech
This year on the Web Summit 44% of attendees were female versus 24% five years ago — this is undoubtedly an impressive and very positive change. The organizers’ attempt to make women visible as speakers and moderators were also clear, which I absolutely applaud. But does it mean almost half of the people working in tech are women? Sadly, still no. As mentioned above, the Web Summit is not the most techie conference in the world. There are attendees from fields such as web and mobile development, artificial intelligence, augmented reality, but also marketing, PR, user experience, health, tourism or project management. Some of these fields are doing much better in the parity game than tech is.

From Blockchain to AI and Shared Mobility

The range of topics was very broad: From blockchain and cryptocurrencies to artificial intelligence and machine learning, from virtual reality to autonomous vehicles and shared mobility. The conference hosted the inventor of the World Wide Web, Tim Berners-Lee, but also top managers from global tech companies, such as Apple, Google, and Netflix. Even famous politicians, such as EU Commissioner Margrethe Vestager and the United Nations Secretary-General António Guterres, were amongst the speakers. Impressed by that many inspiring talks and great conversations at the Web Summit.

So many crazy ideas were presented at the conference. Astonishingly, everything looked like it could become reality. For instance, electric aircraft, such as the Volocopter and the Lilium aircraft, which simply bypass traffic jams. Another example is robots with artificial intelligence, such as “Furhat” from Furhat Robotics and “Sophia” from Hanson Robotics, which become more and more human. Such AI-powered robots can express an increasing number of emotions and can even sense the emotions of another person. a bit scary.
Another point was the discussion about digital human rights, of which we should not lose sight. From our present point of view, we already have human rights and many countries accept them. However, with new technologies like cloud, artificial intelligence as well as autonomous systems, we not only gain lots of advantages, threats might arise as well. How can we ensure that all these technologies are used only for the good of all and that each person’s rights are accepted in a digital ecosystem? Should everyone have the right to access the Internet to educate themselves and have the same opportunities? What impact has this topic in our daily work?

Startups

Some of the world’s most influential companies have joined us at Web Summit at the beginning of their startup journey. Over the years we’ve welcomed OnePlus, Stripe, Nest, Uber, Careem, and GitLab when they were still early-stage startups, looking for funding, partnerships or figuring out their next move.

I create my list of AI and machine learning startups that I had the opportunity to see there.

LabelBox

Labelbox is an enterprise ready training data creation and management platform designed to rapidly build artificial intelligence applications.

Engineer.ai

Engineer.ai is a human-assisted AI engineering team that builds and operates technology projects; from new apps to managing cloud spend.

Ultimate.ai

Ultimate.ai is a platform that gives customer service agents the AI tools they need to provide faster, smarter responses.

DigitalGenius

DigitalGenius brings practical applications of deep learning and AI to customer service operations of leading companies.

Mobacar

Mobacar use AI and machine learning to predict what mode of transport travelers want, instead of giving them an endless stream of transfer options from an airport.

Textio

Textio, a Seattle-based startup, invented augmented writing, which is writing supported by outcomes data in real time.

Unbabel

Unbabel is an artificial intelligence powered human translation platform, with a focus on the translation of customer service communications.

But the main topic, in my opinion, was Blockchain

I list some of the notable crypto-industry related developments during the Web Summit 2018.

Blockchain Partners with Stellar Development Foundation

The world’s leading cryptocurrency wallet and Bitcoin block explorer platform, Blockchain announced its partnership with Stellar Development Foundation – the organization that issues Stellar lumens to give away $125 million worth XLM through an airdrop to event attendees. In order to receive the giveaway, attendees required to have a Blockchain wallet. The XLM fueled Stellar blockchain platform has gained a huge reputation for its speed and low-cost transactions.

eToro Partners with BTC.com

Yet another digital currency platform, BTC.com, and social trading platform eToro announced a partnership to drive cryptocurrency adoption at the event. The partnership was marked by BTC.com giving away pre-loaded cards with 0.0030 BCH (roughly $2) to attendees visiting their stall. Apart from the giveaway, BTC.com’s stall acted as an information center where people can get all their questions regarding cryptocurrencies answered. eToro also announced the launch of its eToroX wallet, which will soon get additional functionalities like support for more coins and fiat tokens, crypto-to-crypto conversion, fiat deposit, and payment in-store capabilities. The first 100,000 people downloading the new eToroX wallet get to receive 0.1 ETH (approx. $21) in their respective accounts.

Bitstamp’s Giveaway

Not to be left behind, BitStamp also promoted cryptocurrencies by educating first-time traders on how to use cryptocurrencies. The Luxembourg based company has launched ‘Cryptomyths’, a quiz designed to make the world of cryptocurrencies clearer so first-time traders, so that they can gain in-depth knowledge of how and what to trade. Summit attendees who took part in the quiz earned a $10 trading process that will be activated when they make their first trade on the BitStamp platform.

And the best … All of the talks are available online

Not everybody knows that all of the talks from the 24 paths become available online after the conference. At the moment there are almost 300 videos from last week accessible on the Web Summit’s youtube channel.

youtube