We all have heard of the term, ‘cache’ and is familiar with its use. But, what is its importance when it is collectively operated in a distributed system? We will get to the answer and also go through sample Map-reduce code execution in this blog.
In computing, a distributed cache is an extension of the traditional concept of cache used in a single location. A distributed cache may span multiple servers so that it can grow in transactional capacity and in size. It is mainly used to store application data present in database and web session data.
The idea of distributed caching has become feasible now because main memory has become very cheap and network cards have become very fast. 1 Gbit is the now the standard everywhere and 10 Gbit is gaining popularity. Also, a distributed cache works well on lower cost machines that are usually employed for web servers as opposed to database servers which require expensive hardware.
The diagram below gives an architectural view for distributed cache.
Terminologies
SDS: Small Data Set
LDS: Large Data Set
DN1, DN2, DN3, DN4: Data Node1, Data Node2, Data Node3, Data Node4 respectively.
First, the LDS and SDS will be copied to Distributed File System and later during MapReduce execution, the setup method carries SDS to memory which is indicated as star in the above diagram.
What if a server goes down, and it is a cache server? In this situation, the data would have to come from its original source for execution to continue.
A cache is usually a memory based database designed for quick retrieval. The data in the cache sticks around only so long as it’s being used regularly, and eventually will be purged. But for distributed systems where you need persistence, a common technique is to have multiple copies.
Example: You have servers U, V, W, X, Y, and Z. For data 1, you would put it on U, and then a copy on V and W. For data 2, it could be on V, and then copies on W and X. This way if any one server goes down you still have two copies.
Let’s consider an example code with Map-Reduce and see how the distributed caching concept is implemented.
Data Definition for SDS
Cust ID : 101
Name : Stephen,
City : Portland
Data Definition for LDS
Cust ID : 101
TransactionID : 501
Date : 07/17/15
No. of transaction and amount : [93;46;36;96]
We will be now performing the Map-Side join on these two input files where the SDS will be stored in the cache and LDS will be placed in HDFS.
In this example, we have Hadoop Map-Reduce as reference code where only mapper has to work and not the reducer. Hence, it is also called as Map-side Join using distributed cache.
Source code will perform join operation on Name and No. of transaction and amount from both dataset using Cust ID to flush result as Name, No. of Transaction and Total amount.
For source code and dataset please follow the link below.
Steps for Execution
Refer the screenshot for step by step execution for this example.
- First, make sure all your daemons are up and running.
Also, put your datasets within the HDFS for further operation.
- Once loaded, check for the data inside for confirmation.
We use cat command on Transactions.csv and Customers.csv file.
Transactions.csv
Customers.csv
Now, execute the jar exported from eclipse in Hadoop to perform distributed cache join/Mapside Join.
In the above screenshot, note that the counters show some byte value which confirms that our code is executed successfully.
Now, by using the cat command, check the content of the output file.
We have now executed the Map Side Join using distributed cache.
In our next Blog we will be discussing the internal working of the Map Side join discussed here.
For other BigData trending blogs and use cases please visit ACADGILD.
Leave a Reply