What is Big Data?

What’s the story Rory?

Table of contents

  1. What exactly is Big Data?
  2. What can Big Data do?
  3. Why Has It Become So Popular?
  4. Why Should Businesses Care?
  5. Big data and analytics
  6. What Could That Data Be, Exactly?
  7. IT infrastructure to support big data
  8. Big data skills
  9. What Are the Most Commonly Held Misconceptions About Big Data?
  10. Big Data and Cloud Computing
  11. Book
  12. Influencers List
  13. Courses
  14. Links

1. What exactly is Big Data?

Big Data started appearing in many of my conversations with many of my tech friends. So when I met this “Mr. Know It All Consultant”, I asked him ‘What is Big Data?’. He went on to explain why Big Data is the next ‘in thing’ and why everyone should know about Big Data but never directly answered my question.

At first glance, the term seems rather vague, referring to something that is large and full of information. That description does indeed fit the bill, yet it provides no information on what Big Data really is.

Big Data is often described as extremely large data sets that have grown beyond the ability to manage and analyze them with traditional data processing tools. Searching the Web for clues reveals an almost universal definition, shared by the majority of those promoting the ideology of Big Data, that can be condensed into something like this: Big Data defines a situation in which data sets have grown to such enormous sizes that conventional information technologies can no longer effectively handle either the size of the data set or the scale and growth of the data set. In other words, the data set has grown so large that it is difficult to manage and even harder to garner value out of it. The primary difficulties are the acquisition, storage, searching, sharing, analytics, and visualization of data.

I kept asking that question to various other folks and I did not get the same answer twice from any number of people. ‘Oh, it’s a lot of data’. ‘It’s variety of data’. ‘It’s how fast the data is piling up’. Really? I thought to myself but was afraid to ask more questions. As none of it made much sense to me, I decided to dig into it myself. Obviously, my first stop was Google.

When I typed ‘Big Data’ at that time, this showed up:
“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it …” Dan Ariely

One popular interpretation of big data refers to extremely large data sets but I particularly prefer the 3 Vs definition: (volume, variety and velocity)
Doug Laney from Gartner who was credited with the 3 ‘V’s of Big Data. Gartner’s Big Data is ‘high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.’ Gartner is referring to the size of data (large volume), speed with which the data is being generated (velocity), and the different types of data (variety)

Mike Gualtieri of Forrester said that the 3 ‘V’s mentioned by Gartner are just measures of data and Mike insisted that Forrester’s definition is more actionable. And that definition is: ‘Big Data is the frontier of a firm’s ability to store, process, and access (SPA) all the data it needs to operate effectively, make decisions, reduce risks, and serve customers. Forrester seems to be saying that any data that is beyond the current reach (i.e. frontier) of that firm to store (i.e. large volumes of data), process (i.e. needs innovative processing), and access (new ways of accessing that data) is the Big Data. So the question is: What is the ‘frontier’? Who defines the frontier?
I kept searching for those answers. I looked at McKinsey’s definition: “Datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” Well, similar to all the above but still not specific for me to decide when the data becomes Big Data.
IBM added ‘Veracity’ referring to the quality of data. And then several people start to add more Vs to the Big Data definition.

I found two famous one:
10 Vs of Big data

42 V’s of Big Data
That I think is too much and a little bit funny, especially “23 – Version Control and 40 – Voodoo”.

Wikipedia it said ‘Big Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing application.’ So wikipedia’s definition is focusing on ‘volume of data’ and ‘complexity of processing that data’.

O’Reilly Media it said “Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”. To some extent, Wikipedia and O’Reilly’s definitions are similar in that both refer to ‘processing capacity’ and ‘conventional database systems’ but O’Reilly media adds a new twist by mentioning ‘too big’ and ‘moves fast’.

I couldn’t find a clear answer to the question of what is the volume threshold of data that makes it Big Data, but I found some small and not famous articles that they said “some have defined big data as an amount that exceeds ten petabytes” and there’s the fact that you can use Big Data for a small amount of data as well.

Even though there is no single definition for Big Data that is universally accepted, there are some common concepts that almost all seem to converge on.
Data that is of large volume, data that is not a single type i.e. structured and a variety of structured, unstructured etc and data that requires newer ways to store, process, analyze, visualize, and integrate.
The truth is that data is being generated at a much faster rate than in the past from all kinds of sources including social media and we need one way to handle that.

What is Big data by Oracle

The popular use of big data can be traced to a single research paper published in 2004:
MapReduce: Simplified Data Processing on Large Clusters”, by Jeffrey Dean and Sanjay Ghemawat.

2. What can Big Data do?

A list of some use cases:

  • Recommendation engines
  • Fraud detection
  • Predictive analytics
  • Customer segmentation
  • Customer churn prevention
  • Product development
  • Price optimization
  • Customer sentiment analysis
  • Real-time analytics

Big Data is not just limited to software or application development. Big Data development is used in many other sectors like:

  • Fintech
  • Robotics
  • Meteorology
  • Medicine
  • Environmental research
  • Informatics and cybersecurity

3. Why Has It Become So Popular?

Big Data’s recent popularity has been due in large part to new advances in technology and infrastructure that allow for the processing, storing and analysis of so much data. Computing power has increased considerably in the past five years while at the same time dropping in price – making it more accessible to small and midsize companies. In the same vein, the infrastructure and tools for large-scale data analysis have gotten more powerful, less expensive and easier to use.
As the technology has gotten more powerful and less expensive, numerous companies have emerged to take advantage of it by creating products and services that help businesses to take advantage of all Big Data has to offer.

4. Why Should Businesses Care?

Data has always been used by businesses to gain insights through analysis. The emergence of Big Data means that they can now do this on an even greater scale, taking into account more and more factors. By analyzing greater volumes from a more varied set of data, businesses can derive new insights with a greater degree of accuracy. This directly contributes to improved performance and decision making within an organization.
Big Data is fast becoming a crucial way for companies to outperform their peers. Good data analysis can highlight new growth opportunities, identify and even predict market trends, be used for competitor analysis, generate new leads and much more. Learning to use this data effectively will give businesses greater transparency into their operations, better predictions, faster sales and bigger profits.

5. Big data and analytics

What really delivers value from all the big data organizations are gathering is the analytics applied to the data. Without analytics, it’s just a bunch of data with limited business use.
By applying analytics to big data, companies can see benefits such as increased sales, improved customer service, greater efficiency, and an overall boost in competitiveness.
Data analytics involves examining data sets to gain insights or draw conclusions about what they contain, such as trends and predictions about future activity.
Analytics can refer to basic business intelligence applications or more advanced, predictive analytics such as those used by scientific organizations. Among the most advanced type of data analytics is data mining, where analysts evaluate large data sets to identify relationships. patterns, and trends.

6. What Could That Data Be, Exactly?

it could be all the point of sale data for Best Buy. That’s a huge data set — everything that goes through a cash register. For us, it’s all of the activity on a website, so a ton of people coming through, doing a bunch of different things. It’s not really exactly cohesive and structured.
With point of sale, for example, you’re looking at what people are purchasing and what they’ve done historically. You’re looking at what they’ve clicked on in email newsletters, loyalty program data, and coupons that you’ve sent them in direct mail — have those been redeemed? All these things come together to form a data set around purchasing behavior. You can look at what “like” customers do in order to predict what similar customers will buy as well.

7. IT infrastructure to support big data

For the concept of big data to work, organizations need to have the infrastructure in place to gather and house the data, provide access to it, and secure the information while it’s in storage and in transit.
At a high level, these include storage systems and servers designed for big data, data management and integration software, business intelligence and data analytics software, and big data applications.

8. Big data skills

Big data and big data analytics endeavors require specific skills, whether they come from inside the organization or through outside experts.
Many of these skills are related to the key big data technology components, such as Hadoop, Spark, NoSQL databases, in-memory databases, and analytics software.
Others are specific to disciplines such as data science, data mining, statistical and quantitative analysis, data visualization, general-purpose programming, and data structure and algorithms. There is also a need for people with overall management skills to see big data projects through to completion.

Under the umbrella of Big Data, there are many technologies and concepts. This is not an exhaustive list!

Google File System – GFS – is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware.

Distributed File System – In computing, a distributed file system (DFS) or network file system is any file system that allows access to files from multiple hosts sharing via a computer network. The DFS makes it convenient to share information and files among users on a network in a controlled and authorized way. This makes it possible for multiple users on multiple machines to share files and storage resources.

Hadoop – Hadoop is a massive system for distributed parallel processing of huge amounts of data and provide a distributed file system to that.
Hadoop is composed of the distributed file system, Map Reduce and yarm.

MapReduce – a programming model that makes combining the data from various hard drives a much easier task. There are two parts to the programming model – the map phase and the reduce phase—and it’s the interface between the two where the “combining” of data occurs. MapReduce enables anyone with access to the cluster to perform large-scale data analysis.

HDFS – Hadoop File System

Yarn – Yet Another Resource manager – Apache Hadoop YARN is the resource management and job scheduling technology in the open source Hadoop distributed processing framework. One of Apache Hadoop’s core components, YARN is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.

Hadoop Ecosystem – Hadoop Ecosystem is a platform or framework which solves big data problems. You can consider it as a suite which encompasses a number of services (ingesting, storing, analyzing and maintaining) inside it.

Spark – is an open-source distributed general-purpose cluster computing framework with (mostly) in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages.
The 4 main part are: Spark SQL, Spark Streaming, MlLib and GraphX

Pig – Apache Pig is a platform for analyzing large data sets that consist of a high-level language for creating MapReduce programs. These programs can then be run in parallel on large-scale Hadoop clusters. Complex tasks are broken into small data flow sequences, which make them easier to write, maintain, and understand. Users are able to focus more on semantics rather than efficiency with Pig Latin, because tasks are encoded in a way that allows the system to automatically optimize the execution. By utilizing user-defined functions, users are also able to extend Pig Latin. These functions can be written in many popular programming languages such as Java, Python, JavaScript, Ruby, or Groovy and then called directly using Pig Latin.

Hive – Apache Hive is a data warehouse system for data summarization, analysis and querying of large data systems in open source Hadoop platform. It converts SQL-like queries into MapReduce jobs for easy execution and processing of extremely large volumes of data.

Hbase – Apache HBase is a highly distributed, NoSQL database solution that scales to store large amounts of sparse data. In the scheme of Big Data, it fits into the storage category and is simply an alternative or additional data store option. It is a column-oriented, key-value store that has been modeled after Google’s BigTable.

Oozie – is a server-based workflow scheduling system to manage Hadoop jobs. Workflows in Oozie are defined as a collection of control flow and action nodes in a directed acyclic graph. Control flow nodes define the beginning and the end of a workflow (start, end, and failure nodes) as well as a mechanism to control the workflow execution path (decision, fork, and join nodes). Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing task.

Kafka – is an open-source stream-processing software platform. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue designed as a distributed transaction log,” making it highly valuable for enterprise infrastructures to process streaming data.

Flume – is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

Solr – open source search engine.

Cloudera – Is the one of the most famous Hadoop distribution.

Hortonworks – Is another famous Hadoop distribution.

MapR – Is another famous Hadoop distribution.

ETL – is short for extract, transform and load.

ELT – is short for extract, load and transform.

Algorithm – A programming algorithm is a computer procedure that is a lot like a recipe (called a procedure) and tells your computer precisely what steps to take to solve a problem or reach a goal.

Analytics – Analytics often involves studying past historical data to research potential trends, to analyze the effects of certain decisions or events, or to evaluate the performance of a given tool or scenario. The goal of analytics is to improve the business by gaining knowledge which can be used to make improvements or changes.

Descriptive Analytics – is a preliminary stage of data processing that creates a summary of historical data to yield useful information and possibly prepare the data for further analysis.

Predictive Analytics – is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trends.

Prescriptive Analytics – is the area of business analytics (BA) dedicated to finding the best course of action for a given situation. Prescriptive analytics is related to both descriptive and predictive analytics.

Batch processing – is a general term used for frequently used programs that are executed with minimum human interaction. Batch process jobs can run without any end-user interaction or can be scheduled to start up on their own as resources permit.

Dark Data – is data which is acquired through various computer network operations but not used in any manner to derive insights or for decision making. The ability of an organisation to collect data can exceed the throughput at which it can analyse the data.

Data lake – is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.

Data warehouse – is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process. Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, “sales” can be a particular subject.

Data mining – is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

Data Scientist – DS – is a multidisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining.

Data analytics – DA – is the process of examining data sets in order to draw conclusions about the information they contain, increasingly with the aid of specialized systems and software.

Data Engineer – are the designers, builders and managers of the information or “big data” infrastructure. They develop the architecture that helps analyze and process data in the way the organization needs it. And they make sure those systems are performing smoothly.

Data modeling – This is a conceptual application of analytics in which multiple “what-if” scenarios can be applied via algorithms to multiple data sets. Ideally, the modeled information changes based on the information made available to the algorithms, which then provide insight to the effects of the change on the data sets. Data modeling works hand in hand with data visualization, in which uncovering information can help with a particular business endeavor.

AI – Artificial intelligence – is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning (the acquisition of information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions) and self-correction.

Machine learning – is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

Nosql – Is data model that addresses several issues that the relational model is not designed to address: Large volumes of rapidly changing structured, semi-structured, and unstructured data.

Stream processing – is a computer programming paradigm, equivalent to dataflow programming, event stream processing, and reactive programming, that allows some applications to more easily exploit a /limited form of parallel processing.

Structured data – is data that has been organized into a formatted repository, typically a database, so that its elements can be made addressable for more effective processing and analysis. A data structure is a kind of repository that organizes information for that purpose.

Unstructured Data – or unstructured information is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well.

Java – is a high-level programming language developed by Sun Microsystems. … The Java syntax is similar to C++, but is strictly an object-oriented programming language. For example, most Java programs contain classes, which are used to define objects, and methods, which are assigned to individual classes.

Scala – is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way. … Scala smoothly integrates the features of object-oriented and functional languages. This tutorial explains the basics of Scala in a simple and reader-friendly way.

Python – is an interpreted, high-level, general-purpose programming language.

R – is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.

Traditional business intelligence – BI – This consists of a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data. BI delivers actionable information, which helps enterprise users make better business decisions using fact-based support systems. BI works by using an in-depth analysis of detailed business data, provided by databases, application data, and other tangible data sources. In some circles, BI can provide historical, current, and predictive views of business operations.

Statistical applications – These look at data using algorithms based on statistical principles and normally concentrate on data sets related to polls, census, and other static data sets. Statistical applications ideally deliver sample observations that can be used to study populated data sets for the purpose of estimating, testing, and predictive analysis. Empirical data, such as surveys and experimental reporting, are the primary sources for analyzable information.

9. What Are the Most Commonly Held Misconceptions About Big Data?

In my opinion, people think it’s this magical thing. They think, “We’ll just turn that on and now things will just work and we’ll know all this stuff.” But it’s just not that simple — it’s actually really complicated and you need the right equipment and people that understand how to analyze and work with big data.
Increasingly, simplified tools are coming out for non-technical users to create dashboards and get some of the information they’re looking for, but it is a really specialized skillset. It’s not something you can just turn on and have. There’s an investment in people, time, and hard costs to make this stuff work.

10. Big Data and Cloud Computing


11. Book


12. Influencers List


13. Courses


14. Links

What can Big Data do – https://www.bernardmarr.com/default.asp?contentID=1076
What is Big Data? A Complete Guide – https://learn.g2crowd.com/big-data
Hadoop Ecosystem table – https://hadoopecosystemtable.github.io/

Even though there is no single definition for Big Data that is universally accepted, there are some common concepts that almost all seem to converge on.
This post is a simple and brief overview of Big data and the ecosystem around.


How’s the craic?

We all know that deep learning algorithms improve the accuracy of AI applications to a great extent. But this accuracy comes with requiring heavy computational processing units such as GPU for developing deep learning models. Many of the machine learning developers cannot afford GPU as they are very costly and find this as a roadblock for learning and developing Deep learning applications. To help the AI, machine learning developers Google has released a free cloud-based service Google Colaboratory – Jupyter notebook environment with free GPU processing capabilities with no strings attached for using this service. It is ready to use service which requires no set at all.

Any AI developers can use this free service to develop deep learning applications using popular AI libraries like Tensorflow, Pytorch, Keras, etc.

The Colaboratory is a new service that lets you edit and run IPython notebooks right from Google Drive for free! It’s similar to Databricks – give that a try if you’re looking for a better-supported way to run Spark in the cloud, launch clusters, and much more.

Setting up colab:

Go to google drive → new item → More → colaboratory.

This opens up a python Jupyter notebook in the browser.
By default, the Jupyter notebook runs on Python 2.7 version and CPU processor. You can change the python version to Python 3.6 and processing capability to GPU by changing the settings as shown below:

Go to Runtime → Change runtime type

This opens up a Notebook settings pop-up where we can change Runtime Type to Python 3.6 and processing Hardware to GPU.
And then your python environment with the processing power of GPU is ready to use.

Google has published some tutorials showing how to use Tensorflow and various other Google APIs and tools on Colaboratory. You can have a look and play around, but for fun, let’s check how to add Spark on this environment.

Add Apache Spark on colab:

Under the hood, there’s a full Ubuntu container running on Colaboratory and you’re given root access. This container seems to be recreated once the notebook is idle for a while (maybe a few hours). In any case, this means we can just install Java and Spark and run a local Spark session. Do that by running:

apt-get install openjdk-8-jdk-headless -qq > /dev/null
wget -q http://apache.osuosl.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
tar xf spark-2.2.1-bin-hadoop2.7.tgz
pip install -q findspark

Now that Spark is installed, we have to tell Colaboratory where to find it:

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"

Finally (only three steps!), start Spark with:

import findspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

And That’s all – Spark is now running in local mode on a free cloud instance. It’s not very powerful, but it’s a really easy way to get familiar with Spark without installing it locally.

You can have a look in my post about Jupyter-Notebook for more examples.

Important things to remember:

  • The supported browsers are Chrome and Firefox.
  • Currently, only Python is supported.
  • We can you use up to 12 hours of processing time in one go.

Web Summit 2018

How’s it going’ horse?


My first time at Web Summit, my first impression: it is massive and I say Massive regarding the numbers of stages (24), speakers (1200), partners, startups (1800)… but specifically regarding the number of people (70.000). We can say “it is crowded” to the point that it could frustrated visitors.

With nearly 70.000 attendees from over 160 countries, more than 1.200 speakers, over 1.800 of promising startups, Fortune 500 companies, 24 tracks to follow and over 200 employees working really hard to prepare it, Web Summit grew to be the largest technology conference in the world, and has been called the best one on the planet (Forbes), Davos for geeks (Bloomberg) and Glastonbury for geeks (The Guardian). The 2018 edition in Lisbon is over and it’s time for takeaways.

But what is Web Summit?

Well, in a nutshell, it’s something that started as a “simple idea” in 2010—to connect people in the tech community— evolved into the biggest technology conference in the world.
Or, as Web Summit co-founder Paddy Cosgrove puts it, “World’s largest gathering of entrepreneurs.”


Planning is everything

It was my first time at the Web Summit and the most important lesson I took was that planning is key. I downloaded the official app and added some items to my schedule, but there I couldn’t follow up …

It’s not a very techie tech conference

The Web Summit might be the biggest technology conference in the world, but it isn’t the most techie one for sure. The idea of this conference was to bring the tech people and industries together — and this objective is being realized year after year with great success. The Summit saw an enormous growth in attendance — from 400 people in 2010 to 70.000 this year, gathering founders, CEOs of technology companies, policymakers, heads of state and startups.

Women are more and more present in tech
This year on the Web Summit 44% of attendees were female versus 24% five years ago — this is undoubtedly an impressive and very positive change. The organizers’ attempt to make women visible as speakers and moderators were also clear, which I absolutely applaud. But does it mean almost half of the people working in tech are women? Sadly, still no. As mentioned above, the Web Summit is not the most techie conference in the world. There are attendees from fields such as web and mobile development, artificial intelligence, augmented reality, but also marketing, PR, user experience, health, tourism or project management. Some of these fields are doing much better in the parity game than tech is.

From Blockchain to AI and Shared Mobility

The range of topics was very broad: From blockchain and cryptocurrencies to artificial intelligence and machine learning, from virtual reality to autonomous vehicles and shared mobility. The conference hosted the inventor of the World Wide Web, Tim Berners-Lee, but also top managers from global tech companies, such as Apple, Google, and Netflix. Even famous politicians, such as EU Commissioner Margrethe Vestager and the United Nations Secretary-General António Guterres, were amongst the speakers. Impressed by that many inspiring talks and great conversations at the Web Summit.

So many crazy ideas were presented at the conference. Astonishingly, everything looked like it could become reality. For instance, electric aircraft, such as the Volocopter and the Lilium aircraft, which simply bypass traffic jams. Another example is robots with artificial intelligence, such as “Furhat” from Furhat Robotics and “Sophia” from Hanson Robotics, which become more and more human. Such AI-powered robots can express an increasing number of emotions and can even sense the emotions of another person. a bit scary.
Another point was the discussion about digital human rights, of which we should not lose sight. From our present point of view, we already have human rights and many countries accept them. However, with new technologies like cloud, artificial intelligence as well as autonomous systems, we not only gain lots of advantages, threats might arise as well. How can we ensure that all these technologies are used only for the good of all and that each person’s rights are accepted in a digital ecosystem? Should everyone have the right to access the Internet to educate themselves and have the same opportunities? What impact has this topic in our daily work?


Some of the world’s most influential companies have joined us at Web Summit at the beginning of their startup journey. Over the years we’ve welcomed OnePlus, Stripe, Nest, Uber, Careem, and GitLab when they were still early-stage startups, looking for funding, partnerships or figuring out their next move.

I create my list of AI and machine learning startups that I had the opportunity to see there.


Labelbox is an enterprise ready training data creation and management platform designed to rapidly build artificial intelligence applications.


Engineer.ai is a human-assisted AI engineering team that builds and operates technology projects; from new apps to managing cloud spend.


Ultimate.ai is a platform that gives customer service agents the AI tools they need to provide faster, smarter responses.


DigitalGenius brings practical applications of deep learning and AI to customer service operations of leading companies.


Mobacar use AI and machine learning to predict what mode of transport travelers want, instead of giving them an endless stream of transfer options from an airport.


Textio, a Seattle-based startup, invented augmented writing, which is writing supported by outcomes data in real time.


Unbabel is an artificial intelligence powered human translation platform, with a focus on the translation of customer service communications.

But the main topic, in my opinion, was Blockchain

I list some of the notable crypto-industry related developments during the Web Summit 2018.

Blockchain Partners with Stellar Development Foundation

The world’s leading cryptocurrency wallet and Bitcoin block explorer platform, Blockchain announced its partnership with Stellar Development Foundation – the organization that issues Stellar lumens to give away $125 million worth XLM through an airdrop to event attendees. In order to receive the giveaway, attendees required to have a Blockchain wallet. The XLM fueled Stellar blockchain platform has gained a huge reputation for its speed and low-cost transactions.

eToro Partners with BTC.com

Yet another digital currency platform, BTC.com, and social trading platform eToro announced a partnership to drive cryptocurrency adoption at the event. The partnership was marked by BTC.com giving away pre-loaded cards with 0.0030 BCH (roughly $2) to attendees visiting their stall. Apart from the giveaway, BTC.com’s stall acted as an information center where people can get all their questions regarding cryptocurrencies answered. eToro also announced the launch of its eToroX wallet, which will soon get additional functionalities like support for more coins and fiat tokens, crypto-to-crypto conversion, fiat deposit, and payment in-store capabilities. The first 100,000 people downloading the new eToroX wallet get to receive 0.1 ETH (approx. $21) in their respective accounts.

Bitstamp’s Giveaway

Not to be left behind, BitStamp also promoted cryptocurrencies by educating first-time traders on how to use cryptocurrencies. The Luxembourg based company has launched ‘Cryptomyths’, a quiz designed to make the world of cryptocurrencies clearer so first-time traders, so that they can gain in-depth knowledge of how and what to trade. Summit attendees who took part in the quiz earned a $10 trading process that will be activated when they make their first trade on the BitStamp platform.

And the best … All of the talks are available online

Not everybody knows that all of the talks from the 24 paths become available online after the conference. At the moment there are almost 300 videos from last week accessible on the Web Summit’s youtube channel.


ODC Appreciation Day 2018 : Tweet analysis

How’s the craic?

Last Thursday was the Oracle Developer Community ODC Appreciation Day 2018 #ThanksODC

I came across with the #ThanksODC idea Thursday afternoon and of course, I join with a blog post as well. Here

And you can check the result of #ThanksODC here

Today I decided to do my tweets analysis here and see some nice insights.

But just to clarify.

  • This is my blog, so views expressed here are my own and do not necessarily reflect the views of Oracle and I’m not contesting the result given by oracle-base.
  • The idea here is just fun.
  • I think I’m the only Brazilian and the only person in Dublin to participate. (Two points that I couldn’t validate)

You can check my old post to see an explanation about the idea and maths behind

Six Nations

Let’s check the data.

I run my script to get the tweets at 12/10/2018 14:00 Dublin time and I got 51 tweets.

13 tweets at 12/10 and 38 at 11/10.
The first tweet at 11/10 was 08:14:18 and the last one was 23:31:44.
The last tweet that I got was 12/10 at 12:31:34.

I found 46 links to a blog post
I found 8 tweets with ODC Appreciation Day 2018

Counting Terms

Top 10

[(‘Appreciation’, 31), (‘ODC’, 27), (‘Day’, 25), (‘the’, 19), (‘to’, 17), (‘for’, 14), (‘I’, 11), (‘a’, 11), (‘and’, 10), (‘in’, 9)]

without Stop-words

Top 10

[(‘Appreciation’, 32), (‘ODC’, 28), (‘Day’, 26), (‘Oracle’, 9), (‘2018’, 9), (‘Wrap’, 6), (‘posts’, 6), (‘ORDS’, 5), (‘may’, 5), (‘blog’, 5)]


All tweets are in English, but I found six blogs in Spanish.

Tweets 100% English
Blog post 40 English 6 Spanish

Sentiment Analysis

Here the API gives a value between -1 and 1.
-1 to negative, 0 to neutral and 1 to positive.

The result is 88% positive, 11% neutral and 1% negative.


Top Retweets -> 12
@oraclebase – ODC Appreciation Day 2018 : It’s a Wrap (#ThanksODC) : https://t.co/RYOeFzyHAs https://t.co/cQsj0N5xVB


Top Likes -> 28
@HeliFromFinland – Oracle SQL Developer Data Modeler is my favourite tool #ThanksODC https://t.co/ZXr3GI2LXV


If we have a look at the most common hashtags, we need to consider that #ThanksODC will be in all tweets because I used as a filter, so we can exclude it from the list. This leaves us with:

2 #ThanksOTN
2 #oow18
1 #oracle
1 #SOASuite
1 #ThanksOracleBase
1 #OrclDB
1 #ThanksTim
1 #oracleace
1 #ODCAppreciationDay
1 #orclepm
1 #orclbi
1 #Terraform
1 #ThanksTim
1 #plsql
1 #sq
1 #SeeYouAtOpenWorld

number of hashtags per tweet

1 tweet with 4 hashtags
4 tweet with 3 hashtags
3 tweet with 2 hashtags

number of @

8 @oracleace
7 @oraclebase
2 @OracleDevsLA
2 @connor_mc_d
1 @rickProdManager
1 @odtug
1 @OracleDevs
1 @joelkallman
1 @OraPubInc
1 @oravirt
1 @odevcommunity
1 @floo_bar
1 @oracle
1 @gwronald
1 @dhamijarohit
1 @wordpressdotcom
1 @FTisiot

number of @ per tweet

2 tweet with 4 @
3 tweet with 3 @
1 tweet with 2 @

Spelling mistakes

Here I remove all hashtags and I found just one mistake and was “mornging”


23 Twitter Web Client
06 TweetDeck
05 WordPress.com
03 Twitter Lite
03 Twitter for Android
03 Twitter for iPhone
02 Tweetbot for Mac
01 Hootsuite Inc.
01 Twibble.io
01 dlvr.it
01 Tweetbot for iΟS
01 Twitter Ads Composer

Number of tweets per user

4 LucasBednarek
3 oraculix
2 debralilley
2 RobbieDatabee
2 Makker_nl
2 kibeha
2 oraclebase
2 ZedDBA
2 Igfasouza
2 amitzil
1 orana
1 daanbakboord
1 orclDBblogs
1 ITProjectsToday
1 EdelweissK
1 signal006
1 ritan2000
1 Nikitas_Xenakis
1 mathiasmag
1 simon_haslam
1 PeterRaganitsch
1 HeliFromFinland
1 gassenmj
1 FranckPachot
1 SvenWOracle
1 opal_EPM
1 reguchi_br
1 rodrigojorgedba
1 svilmune
1 swesley_perth
1 dw_pete
1 oraesque
1 RonEkins
1 connor_mc_d
1 Addidici
1 dan_ekb
1 rittmanmead
1 FTisiot

My preferred tweet:
I really should go to sleep 💤, but I’m too busy reading everyone’s #ThanksODC blog posts

ODC Appreciation Day 2018: Oracle Visual Builder Cloud and Oracle JET

How’s it going horse?

Today it’s #ThanksODC day and I decide to join the idea with a post about VBCS (Oracle Visual Builder Cloud).

This post idea is going to be divided into three parts;

  1. A VBCS step by step Dublin Bus app
  2. Oracle JET step by step Dublin Bus app
  3. A Comparison of the two projects

Let’s start.

Create a Mobile Application

  1. In the web browser, log in to Oracle Visual Builder Cloud Service.
  2. On the Visual Applications page, click the New button.
  3. In the Create Application dialog box, enter DublinBus in the Application Name field and Tutorial application in the Description field.
  4. The Application ID text field is automatically populated as you type based on the value you enter in Application Name.
  5. Click Finish.
  6. Click + Mobile Application (or click the + sign at the top of the tab).
  7. In the Application name field add DublinBus and in the Navigation Style choose none.
  8. Go to Service Connections and create a new one, choose “Define by endpoint”

    In the field, URL add https://data.smartdublin.ie/cgi-bin/rtpi/realtimebusinformation
    click next;
    Go to request and URL Parameters add one query parameters:
    name: stopid, type: string, default Value: 1071, check as required and click in test and send
    copy to response body.

  9. Create a variable “stopid” as string.
  10. Add an input data go in data and change to stopid.
  11. Add a button and change the label to “search”.
  12. Add a table and choose to add data:
    choose Service Connections and choose the get for you rest API

    click on next and choose: route, Duetime and origin for Primary key;
    next and finish.

  13. Create a variable bus and change the type for Array Data Provider.
  14. Select the table and go on tab data and change for the variable bus.
  15. click on the button search and create an event

    Add “Call REST Endpoint” and choose the get rest API;
    Add “Reset Variables” and choose bus;
    Add “Assign Variables” and on the Variables assign the REST to bus;

I put everything in my Github, so you can get the code and play with.

Check Tweets Spelling

Alright Boyo?

Donald Trump has been forced to correct his tweet boasting about his writing ability after it was filled with spelling mistakes.

The US president posted on the social media website to defend his writing style and criticise the “Fake News” media for searching for mistakes in his tweets.

The tweet itself had a few errors: Instead of “pore over” Mr Trump wrote “pour over” and instead of “bestselling” he wrote “best selling”.

There are also question marks over how many books the former businessman has actually written.

The tweet received a number of mocking responses before it was deleted and reposted with the “pour over” error corrected.

With this and because I already have several projects using his tweets, I came up with the idea to analyze all Donald trump Tweets and check all spelling mistakes.

I work in a data analytics company and I decide to ask for suggestions about the idea.
Talking with colleagues here I decided to add a blog post.

Here I add a big thanks to my colleagues who helped me to do this analysis.
Brian Sullivan and Aishwarya Mundalik

I was collecting all Tweets from him even on fly. You can check in my GitHub a Python script to get all Tweets from a user account.

Just a small change in the code to save a csv file with tweet_id and word columns:

  with open('%s_tweets.csv' % screen_name, 'wb') as f:
    writer = csv.writer(f, delimiter='|')
    for items in outtweets:
      index = items[0]
      for word in word_array:
        writer.writerow((index , word))

And the R code

The code is using the Hunspell R API to analyze the words.

First I checked Donald Trump’s tweets, and then I decided to compare against some others;
I chose Leo Varadkar and Fintan O’Toole because I’m in Ireland and I choose J.K. Rowling because she is a writer and according to the news she was one of the people to make a lot of jokes about the case.

Output ----->

> mean(TT_Final$correct/TT_Final$n) (Trunmp Tweets)
[1] 0.9645933
> mean(LT_Final$correct/LT_Final$n)  (Leo Tweets)
[1] 0.9187365
> mean(RT_Final$correct/RT_Final$n)  (J_K Rolling tweets)
[1] 0.9338411
> mean(FT_Final$correct/FT_Final$n)   (Fotoole Tweets)
[1] 0.9212394

The result was impressive, Donald Trump has the best value. This means that he has fewer mistakes than the others. The API just states whether the spelling is correct or not for each word.
Unfortunately, the API just checks words and not grammar or syntax- and for some values like single characters, the result is ‘true’ when it may not make sense.

Here we can see a sample result.

1017190186269184000   but   TRUE
1017190186269184000   it    TRUE
1017190186269184000   isn   FALSE
1017190186269184000   t    TRUE
1017190186269184000   nearly    TRUE
1017190186269184000   U    TRUE
1017190186269184000   S    TRUE

This is just a basic analysis and the result are completely dependent on the API.
It would be really nice to do a grammar or syntax analysis as well.

The website politwoops.eu follows some politicians on Twitter and they show a list of deleted tweets from each one. I manage to get the last 60 tweets from Donald Trump and the result is:

> mean(DD_Final$correct/DD_Final$n)   (Deleted Donald Trump tweets)
[1] 0.958324

This proves that he actually deleted the tweet and posted it again and that he makes some mistakes.

I just analysed the last 3000 tweets for each account.

For fun, I have a look at my tweets as well:

> mean(Igor_Final$correct/IG_Final$n)
[1] 0.6915032

And here is my defence … hehehe, Apparent IT words are not correct.

956884676173553664   GDG   FALSE
956884676173553664   Hackathon   FALSE
950817131503013888   Hacktoberfest   FALSE
947550776913735680   Flume   FALSE
947550776913735680   Spark   FALSE
943640519502106624   sudo   FALSE
943640519502106624   init   FALSE
943640519502106624   Brazil2018   FALSE
936178221153910784   O'Reilly's   FALSE
936178221153910784   Hadoop   FALSE

I put everything in my Github, so you can get the code and play with. Just change the Twitter account and check yourself.

Six Nations

How heya?


A long time I came collecting Twitter data with the idea to sell this in the future or something similar. so I decided to do some kind of basic data analyses.
I have been thinking in do something like this for a while.
The idea in this post is not to be technical and show all code details, but just tell the story around the data.
Everything is based on the fact that you can find the codes to do this easily on Google.
The go here is doing some kind of basic data analyses in a json file with tweets sample.


For those of you unfamiliar with Twitter, it’s a social network where people post short, 140-character, status messages called tweets. Because tweets are sent out continuously, Twitter is a great way to figure out how people feel about current events.

Creating a Twitter App

First, we need to access our Twitter account and create an app. The website to do this is https://apps.twitter.com

The Twitter Streaming API

In order to make it easy to work with real-time tweets, Twitter provides the Twitter Streaming API.
There are a variety of clients for the Twitter Streaming API across all major programming languages. For Python, there are quite a few, which you can find here. The most popular is tweepy, which allows you to connect to the streaming API and handle errors properly.

User Case

I used the Tweepy API to download all the tweets containing the string #rbs6nations during the Saturday, last championship day. Obviously, not all the tweets about the event contained the hashtag and this API get just a part of all tweets that contain the hashtag, but this is a good baseline. The time frame for the download was from around 12:15 PM to 7:15 PM GMT, that is from about 15 minutes before the first match to about 15 minutes after the last match was over. In the end, about 20,000 tweets have been downloaded in JSON format, making about 78Mb of data. This was enough to have some performance problem with my Raspberry Pi Zero. I think is a good size to observe something possibly interesting.

Six Nations

As the name suggests, six teams are involved in the competition: England, Ireland, Wales, Scotland, France and Italy. Six Nations Championship.

Extracting information

There are a few fields that will be interesting to us:

  • The user’s location (status.user.location). This is the location the user who created the tweet wrote in their biography.
  • The screen name of the user (status.user.screen_name).
  • The text of the tweet (status.text).
  • The unique id that Twitter assigned to the tweet (status.id_str).
  • When the tweet was sent (status.created_at).
  • How many times the tweet has been retweeted (status.retweet_count).
  • The tweet’s coordinates (status.coordinates). The geographic coordinates from where the tweet was sent.

Counting Terms

The first exploratory analysis that we can perform is a simple word count. In this way, we can observe what are the terms most commonly used in the dataset.
We can easily split the tweets into a list of terms.
Split by space.

This is the list of top 10 most frequent terms

[(‘ireland’, 3163), (‘england’, 2584), (‘on’, 2271), (‘…’, 2068), (‘day’, 1479), (‘france’, 1380), (‘win’, 1338), (to, 1253), (Grand, 1221), (‘and’, 1180)]


In every language, some words are particularly common. While their use in the language is crucial, they don’t usually convey a particular meaning, especially if taken out of context. This is the case of articles, conjunctions, some adverbs, etc. which are commonly called stop-words. In the example above, we can see three common stop-words “to”, “and” and “on”. Stop-word removal is one important step that should be considered during the pre-processing stages. You can easily find a list of Stop-word for almost all languages in the world.

After removing the list of stop-words, we have a new result:

[(‘ireland’, 3163), (‘england’, 2584), (‘wales’, 2271), (‘day’, 1479), (‘Grand’, 1380), (‘win’, 1338), (‘rugby’, 1253), (‘points’, 1221), (‘title’, 1180), (‘#x2618;’, 1154)]

The #x2618; is an emoji text-based symbol that represents the Irish “Shamrock”. ☘

If we have a look at the most common hashtags, we need to consider that #rbs6nations will be in all tweets becouse I used as a filter, so we can exclude it from the list. This leave us with:

[(‘#engvirl’, 1701), (‘#walvfran’, 927), (‘#rugby’, 880), (‘#itavsco’, 692), (‘#ireland’, 686), (‘#’champions’, 554), (‘#ireland’, 508), (‘#rbs’, 500), (‘#6nation’, 446), (‘#england’, 406)]

We can observe that the most common hashtags, apart from #rugby, are related to the individual matches. In particular England v Ireland has received the highest number of mentions, probably being the last match of the day.

Detect languages

Here the idea is to try to detect which language is used in the tweet.

I divide by language and found some Irish. Italian and French tweets.
Something interesting to notice is that a fair amount of tweets also contained terms in French.
Apparently, the API did not recognize anything in Welsh and Scottish.


Sometimes we are interested in the terms that occur together. This is mainly because the context gives us a better insight about the meaning of a term, supporting applications such as word disambiguation or semantic similarity.

I build a co-occurrence matrix such that com [x ][y] contains the number of times the term x has been seen in the same tweet as the term y:

The results:

[((‘6’, ‘nations’), 845), ((‘champions’, ‘ireland’), 760), ((‘nations’, ‘rbs’), 742), ((‘day’, ‘ireland’), 731), ((‘grand’, ‘slam’), 674)]

We could also look for a specific term and extract its most frequent co-occurrences.

The outcome for “ireland”:

[(‘champions’, 756), (‘day’, 727), (‘nations’, 659), (‘england’, 654), (‘2018’, 638), (‘6’, 613), (‘rbs’, 585), (‘ireland’, 559), (‘#x2618;’, 526), (‘game’, 522), (‘win’, 377), (‘grand’, 377), (‘slam’, 377), (’26’, 360), (‘points’, 356), (‘wales’, 355), (‘ire’, 355), (‘title’, 346), (’15’, 301), (‘turn’, 295)]

The outcome for “rugby”:

[(‘day’, 476), (‘game’, 160), (‘ireland’, 143), (‘england’, 132), (‘great’, 105), (‘today’, 104), (‘best’, 97), (‘well’, 90), (‘ever’, 89), (‘incredible’, 87), (‘amazing’, 84), (‘done’, 82), (‘amp’, 71), (‘games’, 66), (‘points’, 64), (‘monumental’, 58), (‘strap’, 56), (‘world’, 55), (‘team’, 55), (‘wales’, 53)]

Sentiment Analysis

In order to add another layer to your analysis, you can perform sentiment analysis of the tweets.
Here the API gives a value between -1 and 1.
-1 to negative, 0 to neutral and 1 to positive.

The result is 68% positive, 12% neutral and 20% negative.


Using the status.retweet_count I was able to create a list of the top 10 re-tweets and get the first one.
I decide to not put here the first one because was an advertisement for an alcoholic beverage. I let you guess.

And with status.coordinates I get the location of the first one:
Some of the tweets don’t contain the coordinates because the user can mark as I don’t want to give the location, so I get the most retweeted with location and with Google maps, I discovered that the location was London stadium. This Tweet came from an Android mobile.

Extract key phrases

Extract the keywords or phrases which conveys the gist of the meaning of the tweet.

The result was:
‘Rugby’, ‘Championship’ and ‘Nations’.


Here you can find a Jupyter notebook with almost everything that you need to do something similar.
You can have a look at my post to start your jupyter environment

Under the Hood

  • All codes that I used for this post I found on Google and I basically just make small changes to fit my scenario.
  • Just with a little bit of patience, I was able to build some insights about the Six Nations.
  • I had the idea to do this blog post on Ireland Storm Emma weekend, but I had problems with energy and internet.
  • I did everything using a Raspberry PI Zero and I had performance problems because of the size of the Json file.
  • I use Microsoft cognitive API example to do the Sentiment Analysis, Detect languages and to Extract key phrases.
  • There are several similar blog post like this with same and/or differences scenarios.


Here I showed a simple example of Text Mining on Twitter, using some realistic data taken during a sports event. Using just Google examples, we have downloaded some data using the streaming API, pre-processed the data in JSON format and extracted some interesting terms and hashtags from the tweets.

I also introduced the concept of Counting Terms, Detect Languages, Co-occurrence, Stop-words, Sentiment Analysis, and create some interesting insight.

Now I challenge you to do something similar to your scenario.
I might accept my challenge and do something like this again. I really liked to do this.
I might add all code on the GitHub and create a Jupyter notebook as well.
I might make the jsons file available. Because of the size, I don’t know a place for free hosting!

Jupyter Notebook

How’s it going there?

Jupyter Notebook is a popular application that enables you to edit, run and share Python code into a web view. It allows you to modify and re-execute parts of your code in a very flexible way. That’s why Jupyter is a great tool to test and prototype programs.

Apache Spark is a fast and powerful framework that provides an API to perform massively distributed processing over resilient sets of data.

Get Started with Spark and Jupyter together.

Install Spark

visit the Spark downloads page. Select the latest Spark release, a prebuilt package for Hadoop, and download it directly.
Unzip it and move it to your /opt folder:

$ tar -xzf spark-1.2.0-bin-hadoop2.4.tgz
$ mv spark-1.2.0-bin-hadoop2.4 /opt/spark-1.2.0

Create a symbolic link:

$ ln -s /opt/spark-1.2.0 /opt/spark

This way, you will be able to download and use multiple Spark versions.

Finally, tell your bash (or zsh, etc.) where to find Spark. To do so, configure your $PATH variables by adding the following lines in your ~/.bashrc (or~/.zshrc) file:

$ export SPARK_HOME=/opt/spark
$ export PATH=$SPARK_HOME/bin:$PATH

Install Jupyter

$ pip install jupyter

You can run a regular jupyter notebook by typing:

$ jupyter notebook

There are two ways to get PySpark available in a Jupyter Notebook:

1 – Configure PySpark driver to use Jupyter Notebook: running pyspark will automatically open a Jupyter Notebook
2 – Load a regular Jupyter Notebook and load PySpark using findSpark package

Option 1:

Update PySpark driver environment variables: add these lines to your~/.bashrc (or ~/.zshrc) file.

$ export PYSPARK_DRIVER_PYTHON=jupyter
$ export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Restart your terminal and launch PySpark again:

$ pyspark

Now, this command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on ‘New’ > ‘Notebooks Python [default]’.

Option 2:

Use findSpark package to make a Spark Context available in your code.

findSpark package is not specific to Jupyter Notebook, you can use this trick in your favorite IDE too.
To install findspark:

$ pip install findspark

Launch a regular Jupyter Notebook:

$ jupyter notebook

In your python code you need to add:

import findspark

Now you can try out and see. I hope this guide will help you easily get started with Jupyter and Spark

Here is a python code example to test:

import findspark
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):    
 x, y = random.random(), random.random()
 return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples

Apache Toree is a kernel for the Jupyter Notebook platform providing interactive access to Apache Spark.

Install Toree.

$ sudo pip install toree


Set SPARK_HOME to point to the directory where you downloaded and expanded the Spark binaries.

$ SPARK_HOME=$HOME/Downloads/spark-x.x.x-bin-hadoopx.x

$ jupyter toree install \

Start notebook.

$ jupyter notebook


Point browser to http://localhost:8888.
Then open a new notebook using New > Toree.

Test notebook with simple Spark Scala code.

sc.parallelize(1 to 100).
  filter(x => x % 2 == 0).
  map(x => x * x).

Here you can use tab for auto-complete.

To run Jupyter with R
Install IRkernel

$ conda install -c r ipython-notebook r-irkernel

You can now open R and Install some necessary packages used by R kernel on ipython notebook

install.packages(c('rzmq','repr','IRkernel','IRdisplay'), repos = 'http://irkernel.github.io/', type = 'source')

After the packages are successfully downloaded and installed.
Type this and quit



Start the notebook and check new -> R

You can install Jupyter on Raspberry Pi

$ apt-get install python3-matplotlib
$ apt-get install python3-scipy
$ pip3 install --upgrade pip
$ reboot
$ sudo pip3 install jupyter

To start

$ jupyter-notebook

Simple Python example:

import pyspark
sc = pyspark.SparkContext('local[*]')

# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)

Simple R example:


as <- sparkR.session("local[*]")

# do something to prove it works
df <- as.DataFrame(iris)
head(filter(df, df$Petal_Width > 0.2))

Simple Scala example:

val rdd = sc.parallelize(0 to 999)
rdd.takeSample(false, 5)

Use the pre-configured SparkContext in variable sc.


Apache Torre

R on Jupyter

Raspberry Pi with InfluxDB and Grafana

Alright, boss?

Grafana is an open source metric analytics & visualization suite. It is most commonly used for visualizing time series data for infrastructure and application analytics but many use it in other domains including industrial sensors, home automation, weather, and process control.


InfluxDB is an open-source time series database developed by InfluxData.
it is written in Go and optimized for fast, high-availability storage and retrieval of time series data in fields such as operations monitoring, application metrics, Internet of Things sensor data, and real-time analytics.


Install InfluxDB on Raspberry Pi

You can follow the link for more details but basically, I just run this:

sudo apt-get update && sudo apt install apt-transport-https curl

curl -sL https://repos.influxdata.com/influxdb.key | sudo apt-key add -

echo "deb https://repos.influxdata.com/debian jessie stable" | sudo tee /etc/apt/sources.list.d/influxdb.list

sudo apt-get update && sudo apt-get install influxdb

For this example we don’t care about the web-based Admin user interface.

To start the server

sudo service influxdb start

Install Grafana on Raspberry Pi

You can follow the link for more details but basically, I just run this:

sudo apt-get install apt-transport-https curl

curl https://bintray.com/user/downloadSubjectPublicKey?username=bintray | sudo apt-key add -

echo "deb https://dl.bintray.com/fg2it/deb jessie main" | sudo tee -a /etc/apt/sources.list.d/grafana.list

sudo apt-get update

sudo apt-get install grafana

To start the server

sudo service grafana-server start

With everything installed you are ready to start doing some awesome things.

I going to show two examples here:
One using sense-hat to get the temperature, pressure and humidity value
and other using HS110 to get the energy consumption

Here is a good InfluxDB getting started

influx -precision rfc3339

You can test with


To run the code you need Install the python package first

sudo pip install influxdb

Both examples are quite similar. Just change what I write on the table.
This code is a hack from some Google examples.

import argparse
import time
import datetime
import sys
from influxdb import InfluxDBClient
from sense_hat import SenseHat

# Set required InfluxDB parameters.
host = "localhost" #Could also set local ip address
port = 8086
user = "root"
password = "root"
# How frequently we will write sensor data from SenseHat to the database.
sampling_period = 5

def get_args():
    '''This function parses and returns arguments passed in'''
    # Assign description to the help doc
    parser = argparse.ArgumentParser(description='Program writes measurements data from SenseHat to specified influx db.')
    # Add arguments
        '-db','--database', type=str, help='Database name', required=True)
        '-sn','--session', type=str, help='Session', required=True)
    now = datetime.datetime.now()
        '-rn','--run', type=str, help='Run number', required=False,default=now.strftime("%Y%m%d%H%M"))
    # Array of all arguments passed to script
    # Assign args to variables
    return dbname, session,runNo
def get_data_points():
    # Get the three measurement values from the SenseHat sensors
    temperature = sense.get_temperature()
    pressure = sense.get_pressure()
    humidity = sense.get_humidity()
    # Get a local timestamp
    print ("{0} {1} Temperature: {2}{3}C Pressure: {4}mb Humidity: {5}%" .format(session,runNo,
    # Create Influxdb datapoints (using lineprotocol as of Influxdb >1.1)
    datapoints = [
                "measurement": session,
                "tags": {"runNum": runNo,
                "time": timestamp,
                "fields": {
    return datapoints

# Match return values from get_arguments()
# and assign to their respective variables
dbname, session, runNo =get_args()  
print "Session: ", session
print "Run No: ", runNo
print "DB name: ", dbname

# Initialize the Influxdb client
client = InfluxDBClient(host, port, user, password, dbname)
     while True:
        # Write datapoints to InfluxDB
        print("Write points {0} Bresult:{1}".format(datapoints,bResult))
        # Wait for next sample
        # Run until keyboard ctrl-c
except KeyboardInterrupt:
    print ("Program stopped by keyboard interrupt [CTRL_C] by user. ")

The HS110 example I just change some lines:

consumption = plug.get_emeter_realtime()[“power”]
    # Create Influxdb datapoints (using lineprotocol as of Influxdb >1.1)
    datapoints = [
                "measurement": session,
                "tags": {"runNum": runNo,
                "time": timestamp,
                "fields": {
    return datapoints

To run:

python igor.py -db=logger -sn=test1

To setup Grafana:

Go to Datasource->Add New and fill in with your database details
User and psw should be: “root” and “root” by default

Create a new dashboard, choose your database and on the tab matrics configure your query.

Edit the dashboard and go to metrics.

Choose the database and then configure the query.

If you succeed in create your dashboard let a comment below about what are you doing.

Smart Plug TP-Link

How’s the form?

A smart device is an electronic device, generally connected to other devices or networks via different wireless protocols such as Bluetooth, NFC, Wi-Fi, 3G, etc., that can operate to some extent interactively and autonomously.

A smart plug gives you several degrees of added control over just about any electrical appliance in your home. For one thing, it gives you remote access to switch a device on and off. Some you can also program to do so on a set schedule. Many models go a step further to give you in-depth insights about the way you use specific devices and the power they consume.

While many people buy smart plugs for their own convenience, there are plenty of ways they can make the world a better place, as well. The main one is by limiting energy waste. Sure, some models simply allow you to turn a device on or off without physically unplugging it, but others allow you to closely monitor your power usage and make positive changes toward conservation.

I have two models from TP-Link:



I found on Google this APIs for Python and Node-Js



from pyHS100 import Discover
for dev in Discover.discover().values():


from pyHS100 import SmartPlug, SmartBulb
from pprint import pformat as pf

plug = SmartPlug("192.168.XXX.XXX")
print("Current consumption: %s" % plug.get_emeter_realtime())



const { Client } = require('tplink-smarthome-api');

const client = new Client();
const plug = client.getDevice({host: '192.168.XXX.XXX'}).then((device)=&gt;{

// Look for devices, log to console, and turn them on
client.startDiscovery().on('device-new', (device) =&gt; {


I manage to turn on or turn off my smart plug from a Tweet.

Now I can turn on or turn off something in my house just with a simple Tweet!

A post shared by igfasouza (@igfasouza) on