Last December, I was back in my beautiful Sri Lanka to spend Christmas and New year with my family and friends. During this time, I got invited to do a guest lecture about Data Science at my former alma mater, University of Peradeniya.
University of Peradeniya is one of the prettiest universities in the whole of South Asia equipped with world class faculty and resources enabling the perfect environment for academic curiosity.
Being lucky to work in the bleeding edge of revolutionizing data landscape for a few years, I felt greatly passionate about sharing my views and opinions about data driven future with the academic community of Sri Lanka.
I decided this would be a great opportunity for me to share the knowledge I have acquired working in London and inspire academics to increase their curiosity towards data science.
Because the audience of this lecture was mainly Economics, Business Studies and Statistics students, I decided to title the talk “Data Revolution, Big Data, Data Science and the future” focusing on the value return and competitive edge data science provides to a business. I decided to structure the talk in such a way that I would take the audience in a journey that asks the following questions..
- What has changed in businesses?
- Why have the businesses changed this way?
- What methods have let us adapt this change?
- What is big data?
- What is data science?
- How do we device these methods for organizational growth?
In the following sections, I will explore my talk in the avenues relating to the questions above. I am planning to structure this article in to two blog posts due to the quantity of the content.
- Post 1: Motivation for Data Science and the emergence of Big Data
- Post 2: What is data Science, and how to use it in your organization
Change of dynamics in how businesses run ….
With the advancement of technology and society, we, the mankind has moved forward adapting our way through industrial revolution and digital revolution. During industrial revolution, humans enhanced ways to produce goods efficiently. The process was very “product” focused.
Then came a series of improvements in computer science that led to more software improving our processes. All the analog machines became digital and the ways to use digitization to improve quality of life increased. The Internet came in to being !! Digital systems allowed providing services such reservation, shopping and numerous other services that enabled humans to innovate in new ways of creating value without having to produce physical goods.
Over the past decade, the business culture has gone through a drastic paradigm shift. With the digitization, we have learned that we can create value with understanding the recurring patterns in the data. We have learned that data is a good representation of the underlying process and invested time and effort in innovating ways to create value off this understanding.
The Digital data footprint has changed drastically in the last few decades. As figure 1 shows, the amount of digital data that has “potential” to be processed using computers has grown.
This leaves us a huge opportunity to use this data to harness our understanding of the world around us and use this understanding to create value.
Drivers of change
Now let us understand the drivers of this change. Over the last few decades, The cost of computation has dropped hugely.
As shown in figure 2 referenced from O’reilly radar clearly shows how the cost of CPU, Storage, and Network bandwidth has dropped over the last few decades. The Top Right plot shows how the Internet has grown from 1 node to 1Bn++ nodes.
One of the main factors taken into account when deciding when to invest on technology projects is the financial feasibility. These plots give solid evidence that the financial viability and feasibility of data processing has improved over the years. And now, it has come to a stage where businesses are actively investing in this vertical.
In addition to this, the recent advancements of technology that has enabled the following also helped..
- Better data sensors (logical and physical)
- Easier ways to set up operational systems
The Internet Happened
By looking at figure 2, you can clearly see how the Internet has grown into such a huge network over the last few decades. It has become such a vital part of our lives, we have even had several disasters relating to relying on the Internet in recent years(Y2K, dotcom bubble).
In 1973, at the inception of the Internet, Advanced Research Projects Agency NETwork (ARPANET) was built to communicate research between 15 sites across the USA. This was the first network to implement the TCP/IP communication protocol that we heavily rely on in the present day.
A few decades later, we rely on the internet to disseminate this very blog post. All the tech giants in the world build their businesses around the internet. Trends such as Mobile computing has position all sorts of hardware and software providers to create value in different layers of the Internet platform. Many people design very smart sensors (hardware and software) to capture valuable data around mobile computing and social networking. In year 2016, 7 Bn of world population (95%) live in mobile accessible geographies. 47.1% of individuals in the world and 40.1 % people in the developing world have internet access.
Infrastructure as a Service (IaaS) / Cloud services
As the scale of users increases, the businesses also need to be able to serve users in this scale. In the early days of the Internet, the entry cost for the Internet Market was extra-ordinary. I remember the days when one had to
- Buy a good uplink Internet line from an ISP
- Reserve a Static IP and a domain name through the ISP
- Buy an expensive server to run the server
- Purchase all the software to run a web server
- and … Maintain the server so that it won’t fail !!!
JUST to host a personal website…
Infrastructure-as-a-service (IaaS) completely changed the game by creating a platform to set up virtual servers on demand through a software interface. Amazon was one of the first companies who provided these services in large scale. The idea behind IaaS is that a large corporation will invest Billions of dollars to build multiple data centers. Then they would build a virtualization layer on top of the enterprise hardware to be able to create vertual infrastructure abstracted from the hardware running the infrastructure. The user will use a user interface to push down commands that will create virtual systems that will run on data center hardware underneath without exposing that complexity to the user.
The main advantage of IaaS/Cloud to the user is that there is very little risk for the user as there is no initial investment in purchasing hardware. Most of these systems are built for elasticity and hence, you pay for what you consume. And the user doesn’t have to allocate resources to maintain the server, data center space. The users also can stop worrying about the security concerns such as disaster recovery, replication and defending the systems against cyber attacks. The IaaS provider takes care of these things in a data center scale where they would have 1000s of experts working on solving these problems. This is a creative way of ensabling the users take advantage of economies of scale.
There are disadvantages of course. You systems could only use features that the IaaS provides you. The capabilities of the IaaS provider might limit your capabilities and make your business make technological compromises. Another main downside is the tendency of the business having to heavily bind itself to IaaS specific features which will make you heavily dependent on the IaaS provider.
Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy.
With the technological and social trends mentioned in the above sections, we have come to a stage where collecting data is both cheap and useful.
The improvement of digital footprint as shown in figure 1 and the cost of storage shown in figure 2 suggest, it is a convincing case to store as much data as we have. As figure 3 on the left shows, we have more 280 exabytes (280,000 Tb/ 280 Mn GB) of digital data in today’s world. With this scale of data being at hand, we have come up with a term for this type of data, Big Data.
But with storing big data, comes big problems. No data is valuable if there is no way to extract the valuable information it carries. Moving, processing data of mega scale is not easy. And although cost of computation has dropped, it makes no difference if there was no way to process this data in acceptable time lines. Traditional data management technology companies such as Oracle, SAP, Sun Microsystems tried to solve this problem by increasing the capacity of the single data processing unit. But This approach deemed to get exponentially expensive with the linear increase of data scale. So the companies like Google, Facebook that had “Web scale” data set out to find more effective solutions. And they invented technologies such as Hadoop. Dr. Ted Dunning from MapR gave an excellent explanation of this phenomenon at dotScale 2016 in Paris.
This video very well explains how using a network of compute units rather than a big single compute unit enables us to rationalize for cost for return value. Google engineers came up with the core concept behind Hadoop which is MapReduce, This research paper initially appeared at OSDI’04 conference outlining the design being MapReduce. MapReduce framework rely on the concept of processing data based on (Key, value) units.
The map phase triggers computations on lines of the big file independently. The Reduce phase is used to aggregate results based on key. To get a better understanding about how the MapReduce framework works, I highly recommend referring to Chapter 2: Map-Reduce and the new software stack of the book Mining Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman.
With the emergence of Map-Reduce, Yahoo research lab created Apache Hadoop, a java implementation of the MapReduce framework. Alot of technologies such as Apache HBase, Cassandra, Ignite came into being inspired by key value pairs and Map-reduce paradigm itself. A recent project Apache Spark simplifies using MapReduce paradigm by covering the complexities of the thought pattern by providing a more functional API to run massively parallel data pipelines. You can refer to my pySpark API Tutorial to get a quick glimpse of how to use Map-Reduce in data processing.
We have now seen how the businesses have changed, what this change can be rooted to and how the technology has adapted to these changes and leveraged solutions that can tame this big data.
I will discuss the methods invented to handle this data, aka Data Science and how to use these elements to create value and competitive edge in a business organization in my next post.
Please subscribe to my blog to stay in the loop on my latest posts about Big Data, Data Science, Application of Machine Learning and building high scale data pipelines.