  June 15, 2019
What is Spark –

Apache Spark is an open source framework for big data processing which is built for speed, easy use, and detail analysis. It was originally developed in 2009 in UC Berkeley’s AMP Lab and then become open sourced project in 2010 by Apache. Apache Spark has several features and advantages compared to other big data technologies like hadoop, mapreduce, sqoop and flume.

Spark is a comprehensive framework to manage variety of big data requirement for real time data processing. Such data is diverse in nature like structured, semi-structured, unstructured and also have different types of sources generating data. Spark also support batch streaming of data for real time logging from different data generating sources.

Using spark hadoop cluster applications can run up to 100 times faster in memory and 10 times faster running on disk. Spark have in memory computing engine to fulfill above scenario and work in parallel and distributed manner.

Spark application can be written in programming languages like Java, Scala and Python. Spark framework is based on Scala as originally it has been developed using scala language.

Spark additionally gives an ability to write streaming, sql queries, data frames in code with more optimization way. It also have libraries for graphics and machine learning.

Big Data Hadoop and Spark –

Hadoop is an open source software framework designed for storage and processing of large scale variety of data on clusters of commodity hardware.

The Apache Hadoop software library is a framework that allows the data distributed processing across clusters for computing using simple programming models called as Map Reduce. It is designed to scale up from single servers to cluster of machines and each offering local computation and storage in efficient way.

Hadoop solutions normally include clusters that are hard to manage and maintain. In many scenarios it requires an integration with other tools like mysql, mahout etc.

It works in series of map reduce jobs and each of these jobs are high-latency and depend with each other. So no job can start until previous job has been finished and successfully completed.

Apache Spark allows software developers to develop complex, multi-step data pipelines application pattern. It also supports in-memory data sharing across DAG (Directed Acyclic Graph) based applications, so that different jobs can work with the same shared data.

Spark runs on top of Hadoop Distributed File System (HDFS) of hadoop to enhance functionality. Spark does not have its own storage so it uses other supported storage.


Spark Features

With capabilities of in-memory data storage and data processing, the spark application performance is more time faster than other big data technologies or applications.

Spark have lazy evaluation which helps with optimization of the steps in data processing and control. It provides a higher level API for improving productivity and consistency.

Spark is designed to be an fast real time execution engine that works both in memory and on disk.

Spark is originally written in Scala language and it runs on same Java Virtual Machine (JVM) environment. It currently supports   java, scala, clojure, R, python, sql for writing applications.

Spark Ecosystem


Spark Components:

Spark Core – Its a general execution engine and platform where    all functionalities are built on.

Spark SQL – Runs on top of spark core that can create schema    for RDD’s and define schema also perform sql queries.

Spark Streaming – It is for data ingest in real time. It ingest data in mini-batches and perform RDD (Resilient Distributed Data Sets        is a basic data unit of spark) transformations on these batches. It        perform streaming analytics

Mlib – Machine Library for Machine Learning algorithms.

