DataFrame: A DataFrame is a new feature that has been exposed as an API from Spark 1.3.0. A DataFrame is a distributed storage of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations...
This is a first part of the series of posts, which will outline the importance of Spark in solving Machine Learning problems. This series covers complete steps that are necessary for a Data Science project. This series of steps commonly known as data pipeline in the industry consists of...
In this post, we will be discussing the new features of Spark 2.0.0 and its installation in Hadoop 2.7. We highly recommend our readers to go through the below posts on Spark, to get a clear idea of what Spark is and the reasons behind its popularity. Beginner’s Guide...
In this post, we will be looking at a case study to calculate the average number of friends based on their age, on a social media website using Apache Flink in Scala. In our previous post, we had a brief introduction to Flink. Hence, we request you to go...
In this post, we will be discussing Apache Flink, its installation in a single node cluster and how it is a contender for the present Big Data frameworks. Let’s begin with the basics. What is Apache Flink? Apache Flink is an open-source platform for distributed stream and batch data...
In this post, we will be discussing how to stream Twitter data using Spark Streaming. Let’s begin with what Spark Streaming is. Before going to spark streaming, we recommend our users to get some idea on Spark core and RDD’s. Spark RDD’s in Scala part-1 Spark RDD’s in Scala...
In this blog, we will work on a case study to find the list of most popular movies. We will perform various transformations and actions to display a list of movies with maximum occurrence in the given data set. Let’s start our discussion with the data definition by considering...
In this post, we will work on a case study to find the minimum temperature observed in a given weather station in a particular year. Let’s begin by considering a sample of four records. Data Definition: Column 1: Weather Station Column 2: Date(year/Month/Day) Column 3: Observation Type Column 4:...
In this post, we will work on a case study to calculate the average number of friends based on their age, on a social media website. Let’s begin by considering a sample of four records. Column 1: User ID Column 2: User Name Column 3: Age of the User...
In this post, we will be discussing how to implement custom input format in Spark. In Spark, we will implement the custom input format by using Hadoop custom input format. You can refer to our previous post to get an idea of how custom input format has been implemented...