Big Data : Simple Definition

Simply put, Big Data is the collection and storage of massive amounts of data. IDC defines Big Data as projects collecting 100 terabytes of data, comprising two or more data formats.

Now comes the most-important and hardest part: Finding meaning out of that geyser of data that companies can act on and — this is key — converting into revenue.

By analyzing all the information, sales managers can quickly understand sales reps’ results, view new contracts lost or signed and react to how actual performance compares to plans set months earlier. Help-desk staff can see how individual customers affect sales and profit, showing them which customers to focus on and which cost too much to support.

Business Intelligence

1.    Business Intelligence and Data Mining

Volume, velocity and variety (3Vs) that are commonly used to characterize different aspects of big data. Volume – the ability to process large amounts of information. Velocity – the increasing rate at which data flows into an organization. Variety – A common theme in big data systems is that the source data is diverse, and doesn’t fall into neat relational structures. A common use of big data processing is to take unstructured data and extract ordered meaning, for consumption either by humans or as a structured input to an application.

1.1    Challenge


Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, we must choose an alternative way to process it.


Most of the Big Data surge is data is unstructured and its not hosted by the traditional databases. It’s a mix of words, images, and videos on the web and streams of data.


Organizations today find themselves with a new set of challenges. Business executives need to exercise real control over business operations, even when those operations are distributed and complex. They also need to empower and support their people, ensuring that the staff they have can do their job and do it well. They must do all this while delivering more personalized, more customized products and services to meet the increasing demands of consumers.


1.2    Opportunity


Big Data can be used to create use statistical models. These models are useful for understanding, but they might spot a correlation and draw a statistical inference that is unfair or discriminatory affecting the outcome of an analysis or a decision regarding products, bank loan or a health insurance offered to a person.  Nevertheless, data is the driver for continuous improvement process. Research studies show that data-driven decision making achieved productivity gains that were 5 -6 percent higher than intuitive models.


The value of big data to an organization falls into two categories: analytical use, and enabling new products. Big data analytics can reveal insights hidden previously by data too costly to process, such as peer influence among customers, revealed by analyzing shoppers’ transactions, social and geographical data. Being able to process every item of data in reasonable time removes the troublesome need for sampling and promotes an investigative approach to data, in contrast to the somewhat static nature of running predetermined reports.



Enterprise applications generate a lot of marketing sales and inventory data. It interacts with various types of data when the systems interact with vendors, suppliers and distributors. This along with data from the social interactions, perceptions and general mobile and web applications generates a huge data set comprising of structured and unstructured data.


Explosion of data generated by web and mobile interactions along with company’s data from its ERP, SCM, CRM and the WFP is creating an opportunity. Retailers, analyze sales, pricing and economic, demographic and weather data to determine product selections and timing for markdowns at particular stores. Shipping companies, fine-tune routes based on truck delivery times and traffic patterns. Social sites refine their algorithms to analyze personal characteristics, reactions and communications.


Global Pulse, a United Nations initiative is designed utilizing Big Data for global development. Sentiment analysis of messages in social networks and text messages using natural-language processing to help predict job losses, spending reductions or disease outbreaks in a given region.


The emergence of big data into the enterprise brings with it a necessary counterpart: agility. Successfully exploiting the value in big data requires experimentation and exploration. Whether creating new products or looking for ways to gain competitive advantage, the job calls for curiosity and an entrepreneurial outlook


1.3    Creating the right model


Data is not only becoming more available but also more understandable to computers. At the forefront are the rapidly advancing techniques of artificial intelligence like natural language processing, pattern recognition and machine learning. The wealth of new data, in turn, accelerates advances in computing — a virtuous circle of Big Data. Machine-learning algorithms, for example, learn on data, and the more data, the more the machines learn. Today, social-network research involves mining huge digital data sets of collective behavior online. Researchers can see patterns of influence and peaks in communication on a subject such as by following trending hashtags on Twitter. Big data has become viable, as cost-effective approaches have emerged to streamline the volume, velocity and variability of massive data. Within this data lie valuable patterns and information, previously hidden because of the amount of work required to extract them.

2.    Data Mining

2.1    Data, Information and Knowledge


Companies with a strong consumer focus – retail, financial, communication, and marketing organizations, primarily use data mining today. It enables these companies to determine relationships among “internal” factors such as price, product positioning, or staff skills, and “external” factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to “drill down” into summary information to view detail transactional data.

2.1.1    Data

  • Operational, transactional data – sales, cost, inventory, payroll, and accounting
  • Nonoperational data, such as industry sales, forecast data, and macro economic
  • Financial data such as earnings per share
  • Meta data – data about the data itself, such as logical database design or data dictionary definitions

But now it can be any of the following in addition to the above mentioned:

  • Perception data – gathered by polling or surveying mechanism
  • Sentiment data – gathered by user interactions with social networking and blogging sites.

2.1.2    Information

It used to be that patterns, associations, or relationships among all this data could provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling but now we may want to see correlations in data. Traditionally it was the process of finding correlations and patterns among fields in large relational databases. With the introduction of Big Data it is making it more complex as organizations are accumulating volume and variety of data.


2.1.3    Knowledge

It may also be possible for us to be predictive using this Information. Knowledge about historical patterns and future trends can be created to assist in decision-making process. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine the demand and the promotional prices.

2.1.4    Classes, Clusters, Associations, Sequential patterns

While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought:

Classes: Stored data is used to locate data in predetermined groups for example customer sentiment

Clusters: Data items are grouped according to logical relationships or consumer preferences. For example market segments or consumer affinities.

Associations: Data can be mined to identify associations.

Sequential patterns: Data is mined to anticipate behavior patterns and trends.

3.    Service Oriented Architecture Wrapping Legacy Applications with Web Services

3.1    A simple definition for SOA

Microsoft’s definition:

A loosely coupled architecture designed to meet the business needs of the organization.

In the past, loosely coupled architectures have relied upon not just SOAP or web services but various other technologies for transport and remote method invocation. Many of these technologies are still in widespread use and are being augmented, replaced or extended with Web Services. SOA borrows some of the concepts of remote object invocation; services, discovery and late binding as well as OOA/OOD techniques based upon encapsulation, abstraction and clearly defined interfaces. The metaphor that helps achieve these benefits changed. Instead of using method invocation on an object reference, service orientation shifts the conversation to that of message passing – a proven metaphor for scalable distributed software integration.

Service Orientation does not necessarily require rewriting functionality from the ground up.  Designing a Business process by reuse of existing IT assets by wrapping them into modular services by either connecting into business process management and collaborative workflows layer.  and reporting on top of existing IT assets.

3.2    Leverage existing applications

Extend and evolve what we already have – Create IT support for new cross-functional business processes that extend beyond the boundaries of what the existing applications were designed to do.

Fundamental to the service model is the separation between the interface and the implementation. Use a resource is only through its published interface and not by directly invoking the implementation behind it. A managed consistent interface could have many alternative implementations of the same service without modifying their requesting application.  Services consumer and provider do not have to have the same technologies for the implementation, interface, or integration when. The architectural concepts associated with SOA enable loose coupling. Loose coupling is the fundamental principle behind SOA, enabling us to summarize up the benefit of SOA in a single word: agility.

3.3    Services Model

The service model is “fractal” in nature. A Service in SOA is exposed by its interface; the implementation can evolve over time without disturbing the clients of the service. This abstraction of the capability from how the capability is delivered is key to the concept. We can extend OO principles into the world of Web Services by further understanding the frequently cited “four tenets” of Service Orientation:

Services interact through explicit message-passing over well-defined boundaries thus Service Boundaries are Explicit. The Service Oriented Integration Pattern permits the trust model, security, network latency and distributed system failures with each boundary crossing. Implementation details should not be compromised outside of a service boundary to avoid tighter coupling between the service and the service’s consumers. While boundaries of a service are fairly stable, the service’s deployment options regarding policy, physical location or network topology is likely to change.

Services Are Autonomous entities that are independently deployed, versioned, and managed. The keys to realizing autonomous services are isolation and decoupling. Services are designed and deployed independently of one another and may only communicate using contract-driven messages and policies.


Services share schema and contract so service interaction should be based upon a service’s policies, schema, and contract-based behaviors.

Lastly Service compatibility is based upon policy assertions that are as explicit as possible regarding service expectations and service semantic compatibilities.


3.3.1    An Abstract SOA Reference Model

A holistic approach to understanding is as follows. SOA is fractal in nature. Services are the fundamental building blocks of SOA, although services do not necessarily need to be web services. Ideally services should follow the four service design principles, which describe a set of best practices for service scopes, dependencies, communications and policy-based configuration. Borrowed concept for abstract SOA reference model provides three fundamental concepts to define the role that services in existing architectures:

3.3.2    Logical Architecture

As stated earlier, the SOA architectural model is fractal and not layered model. This means that a service can be used to Expose IT infrastructure and be Composed into workflows or Business Processes. The resulting service then can be Consumed by end users, systems or other services. The model is a set of independent architectural initiatives referred to as a Service Implementation Architecture  (Expose), a Service Integration Architecture (Compose) and an Application Architecture (Consume). While each of these architectures is designed to be independent of one another, they share a set of five common capabilities.

Each aspect of the Expose / Compose / Consume abstract reference model encompasses a set of five recurring architectural capabilities: Messaging and Services, Workflow and Processes, Data, User Experience and Identity and Access. The five architectural capabilities serves as a set of views to better understand the challenges associated with Exposing existing IT investments as services, Composing the services into business processes and Consuming these processes across the organization.


4.    WBPWS, CRM, SCM and ERP

Today’s business platforms can benefit from integration with the proposed decision-oriented data-platform that is designed to help analysis by relaying on extensible architecture (SOA) and enhanced data mining techniques (Big Data Analysis).

  • Workflow and Business Process Management Systems
  • Customer Relationship Management Systems
  • Supply Chain Management Systems
  • Enterprise Resource Planning Systems

Data Platform

Professor Brynjolfsson says:

“Decisions will increasingly be based on data and analysis rather than on experience and intuition. “We can start being a lot more scientific,” he observes.”


Mr. Smolan an enthusiast says that:

“Big Data has the potential to be “humanity’s dashboard,” an intelligent tool that can help combat poverty, crime and pollution.”


The paper is to propose the creation of a SOA and Big-Data enabled data-platform. The platform will allow us to achieve continuous improvement through regulating persistent and predictive statistics associated with activities. Strategize, align, govern, execute, and optimize define the governing objectives. The selection of these activities is determined by the theory of cause and effect to improve on the drivers that govern the governing objectives. This is assuming that business-governing objectives are close to the allocation of capitals for various activities. A data-platform that regulates the governing objectives by reevaluating selected statistics is claimed to have increase the productivity by 5 to 6%.


Creation of data-centric and extensible architectures based on SOA to expose existing infrastructure is key to creating a data-platform. Traditional data mining and business analytics techniques can be improved upon using concepts such as Big Data and SOA to create an agile data-platform. The predictive power of Big Data needs to be explored and considered economic development and economic forecasting.


This agile data-platform can be key in monitoring performance indicators including sales growth and earnings per share as well as non-financial measures such as customer loyalty, perceptions and product quality are can be used to measure and mange results to implement continuous improvement process.



IT’S IMPORTANT TO recognize that not all high-performance computing systems work the same way. Indeed, choosing between a cloud-based or in-house HPC solution may well depend on the kind of processing work that needs to be done. Dennis Gannon, director of cloud research strategy for the Microsoft Research Connections team, analyzed the work performed by about 90 research groups that were given access to Microsoft Azure cloud resources over the last two years. He concluded that four major architectural differences between cloud clusters and supercomputers—machines running thousands, even tens of thousands of processors—determine which types of high-performance computing should be done where:

Each server in a data center hosts virtual machines, and the cloud runs a fabric scheduler, which manages sets of VMs across the servers. This means that if a VM fails, it can be started up again elsewhere. But it can be inefficient and time- consuming to deploy VMs for each server when setting up batch applications common to HPC.
Data in data centers is stored and distributed over many, many disks. Data is not stored on the local disks of supercomputers, but on network storage.

Clouds are perfect for large-data collaboration and data analytics like MapReduce (a strategy for dividing a problem into hundreds or thousands of smaller problems that are processed in parallel and then gathered, or reduced, into one answer to the original question).

THE GROWTH IN THE VOLUME of the world’s data is currently outpacing Moore’s Law,

Hadoop distribution, dubbed the Cloudera Distribution Including Apache Hadoop (CDH), is an example of data-manage- ment platform that combines a number of components, including support for the Hive and Pig languages; the HBase database for random, real-time read/write access; the Apache ZooKeeper coor- dination service; the Flume service for collecting and aggregating log and event data; Sqoop for relational database integration; the Mahout library of machine learning algorithms; and the Oozie server-based workflow engine, among others.

The sheer volume of data is not why most customers turn to Hadoop. Instead, it’s the flexibility the platform provides.

Hadoop is just one of the technologies emerging to support Big Data analytics, according to James Kobielus, IBM’s Big Data evan- gelist. NoSQL, which is a class of non-relational database-manage- ment systems, is often used to characterize key value stores and other approaches to analytics, much of it focused on unstructured content. New social graph analysis tools are used on many of the new event-based sources to analyze relationships and enable cus- tomer segmentation by degrees of influence. And so-called semantic web analysis (which leverages the Resource Description Framework specification) is critical for many text analytics applications.


transistors on integrated circuits doubles approximately every two years. If this is indicating the computer chi innovations then it is not keeping up with the rate at which data is being created.

At its core, Hadoop is a combination of Google’s MapReduce and the Hadoop Distributed File System (HDFS). MapReduce is a programming model for processing and generating large data sets. It supports parallel computations on so-called unreliable computer clusters. HDFS is designed to scale to petabytes of storage and to run on top of the file systems of the underlying operating system. Yahoo released to developers the source code for its internal distribution of Hadoop in 2009.

“It was essentially a storage engine and a data-processing engine combined,” explains Zedlewski. “But Hadoop today is really a constellation of about 16 to 17 open source projects, all building on top of that original project, extending its usefulness in all kinds of different directions.”