The term Big Data is recently making headline almost everywhere. However not many people actually understand it beyond the fact that it is a large collection of data. In this blog we will look at this term and will try to decipher its meaning in the context of its impending implications on business and industry.
Big data is actually an combination of three separate technologies including Edge Devices, Cloud Computing, Data Analytics.
Edge Devices are sensors that collect data these sensors come in various shapes size and forms.
a temperature sensor providing data readings every five minute is a sensors;
a surveillance camera installed at a road intersection is also a sensor
a smart meter sending electricity consumption information is another type of sensors
the smart phone you carry is also a sensors
the application on your smart phone are also collecting data through your text images videos GPS etc.
the proliferation of all these edge devices and applications are collection data at a scale that has never been heard in our history.
Together this collection of data is doubling every two years. The availability of this data has opened up new vistas of scholarship that where never possible even a decade before.
The second important aspect of big data is cloud computing. The collection of data is so large that no one computer cloud process it efficiently. Therefore, parallel processing and distributed computing have created this very large virtual computer where a person sitting on a remote location has access to large scale data processing power available. Of course both for cloud computing and edge devices high speed connectivity is a must.
Fiber backbone and recent developments is high speed connectivity has made 3G/4G accessible to even the remotest places on the planet.
The third aspects of big data analytics tools and techniques that has made it possible to analyze and interpret the vast quantities of data. Techniques like linear regression, classification, boosting, random forest and other provide newer ways of looking of data using the processing power of cloud computing.
Beyond data analytics, newer artificial intelligence method such as a machine learning is creating a whole new generation of learning from data. This learning is so powerful that machine learning makes model for prediction and forecasting that humans cannot complete with.
Big data is changing the world where we live in and soon the business that does not adopt big data will have a hard time competing with the ones that adopt big data. The impact of big data could be seen in almost all field including health environment, agriculture, urban planning, astronomy geology etc.
Big data has transformed electric power grid to smart grid and traditional manufacturing industry to industry, Not only it has transformed many field it has created new job as data scientists, and data engineers at the cross section of mathematics, statistics, and computer science.
It is estimated that by 2020 the market of big data and cloud services will reach US$210 Billion. Further it is estimated that only 0.5% of all data collected is analyzed for any purpose. this leave so much opportunity to create new types of businesses, entrepreneurship opportunities for large as well as small new and traditional businesses.
Use Of Big Data For Business Expansion
Many traditional businesses sometime find it hard to use big data for their business expansion.
First Step Understanding The Five Vs Of Big Data
The first step in using big data for any such expansion is to know the 5 Vs of big data these five vs include variety value velocity volume and vercity
a) Variety is the type of data that should be collected. Some data may already be available in business databases. Some is available but not recorded. Some is available but may be on printed forms and sheets. Some data needs to be collected. One has to look at all sources of data and categorize them in three levels.
The first level of data that is readily available in soft form in a database or raw files.
The second level of data is the one that could be collected easily through installing few edge devices.
The third one is a wishlist of data that require some sort of investment in edge devices or software.
Moreover data can be structured or unstructured. It is much easier to process structured data like ledger tables but much difficult to process unstructured data such as natural language, tweets or facebook posts
b) The second V of Value culls data that is needed for the core business of the organization. This is the data that defines the business and must be available for making any data analytics exercise later. This step requires an out-of-box thinking as newer types of unstructured data such as YouTube comments, Facebook posts and other may give a new insight into the business.
c) The third V of Velocity entails the data collection speed. A smart electricity meter for example is capable of sending the data once a month; once a week; once a day; once every hours; once every half an hours; once every fifteen minutes; once very minute; once every second and so on. Of course, this velocity of data requires adequate communication channel. The communication cost and availability for data transfer must be taken into account when deciding the velocity of data.
d) The fourth V of Volume is a factors of Velocity. The more the velocity, the higher the volume. For example a smart meter with once a day reading for year for the one million customers requires 1.82 TB of disk space, while one reading every fifteen minutes the disk requirement grow to 2920 TB. If one decides to collect this much data then the data space requirement cannot be handled by a single database. Special databases such as ‘time series’ database may be needed to scale up the system in a modular way.
e) The final V of Veracity refers to the trustworthiness of the data. Edge devices may send data values that may be missing in communication or else wrong value may be recorded in the first place. What kind of measures must be in place to check for quality of collected data as any data analytics requires quality data as input.
Second Step: Data Pre-Processing
The second step is data pre-processing. The pre-processing is further divided into data integration. data cleaning and data transformation. Data integration technique integrate data from disparate sources and integrates it into a unified view. For example, date for weather and energy utilized may be integrated using the same timescale. Data cleaning is the process of cleansing the date from abnormal values. For example a temperature value of 100 Celsius for a city may be an erroneous value and need to be corrected or extrapolated from preceding and succeeding value. The data transformation step may be needed if the data is needed to be out into a standard form.
Thired Step: Selecting Data Analytics Techniques
The third step is the process of selecting data analytics techniques. Various techniques are available such as
supervised learning which includes decision tree; naives bayes; support vector machines; random forest and many others. Similarly unsupervised techniques like K-means, hierarchical clustering etc. may be used to find patterns or clusters in the data. Another category of techniques use correlation like FP-growth etc.
Big data and its associated analytics requires a new kind of expertise in any organization. This new role is combination of mathematics and statistics, computer science and business domain knowledge. As big data is a growing field, data science experts are in shorts supply and it is really hard to find them as every types of business is looking for them these days. However, one can learn fundamental of data science and associated analytics through many online resources.