In this first Part of Hadoop interview Questions, we would be discussing various questions related to Big Data Hadoop Ecosystem.
We have given relevant posts with most of the questions which you can refer for practical implementation.
- What are the different types of File formats in hive?
Ans. Different file formats which Hive can handle are:
- TEXTFILE
- SEQUENCEFILE
- RCFILE
- ORCFILE
For more detailed explanation, click here
2. Explain Indexing in Hive.
Ans. Index acts as a reference to the records. Instead of searching all the records, we can refer to the index to search for a particular record. Indexes maintain the reference of the records. So, it is easy to search for a record with minimum overhead. Indexes also speed up data searching.
For more detailed explanation click here
3. Explain about Avro File format in Hadoop.
Ans. Avro is one of the preferred data serialization system because of its language neutrality.
Due to lack of language portability in hadoop writable classes, avro becomes a natural choice because of its ability to handle multiple data formats which can be further processed by multiple languages.
Avro is most preferred for serializing the data in Hadoop.
It uses JSON for defining data types and protocols. It serializes data in a compact binary format.
Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.
By this we can define Avro as a file format introduced with Hadoop to store data in a predefined format.This file format can be used in any of the Hadoop’s tools like Pig and Hive
For more detailed explanation, click here
- Does Hive support transactions?
Ans. Yes, Hive supports transactions from hive-0.13, with some restrictions.
For detailed information, click here
5. Explain about Top-k Map-Reduce design pattern.
Ans. Top-k Map-reduce design pattern is used for find the top k records from the given dataset.
This design pattern achieves this by defining a ranking function or comparison function between two records that determines whether one is higher than the other. We can apply this pattern to use MapReduce to find the records with the highest value across the entire data set.
For more detailed explanation, click here
- Explain about Hive Storage Handlers.
Storage Handlers are a combination of Input Format, Output Format, SerDe, and specific code that Hive uses to identify an external entity as a Hive table. This allows the user to issue SQL queries seamlessly, whether the table represents a text file stored in Hadoop or a column family stored in a NoSQL database such as Apache HBase, Apache Cassandra, and Amazon Dynamo DB. Storage Handlers are not only limited to NoSQL databases; a storage handler can be designed for several different kinds of data stores.
For practical implementation of this concept, click here
- Explain partitioning in Hive.
Ans. Table partitioning means dividing table data into some parts based on the values of particular columns, thus segregating input records into different directories based on that column for practical implementation on partitioning in Hive, click here
8. What is the use of Impala?
Ans. Cloudera Impala is a massively parallel processing (MPP) SQL-like query engine that allows users to execute low latency SQL Queries for the data stored in HDFS and HBase, without any data transformation or movement.
The main goal of Impala is to make SQL on Hadoop operations, fast and efficient to appeal to new categories of users and open up Hadoop to new types of use cases. Impala makes SQL queries simple enough to be accessible to analysts who are familiar with SQL and to those using business intelligence tools that run on Hadoop.
For more detailed explanation on Impala, click here
9. Explain how to choose between Managed & External tables in Hive.
Ans. Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing to multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the life cycle of the table and data.
For more detailed information, click here
What are the different methods in Mapper class and order of their invocation?
Ans. There are 3 methods in Mapper.
*map () –> executes for each line of the input file (In text input format)
*setup () –> Executes once per input split at the beginning of the program
*clean up () –> Executes once per input split at the end of the program
order of invocation:
setup () –1
map () –2
clean up –3
- What is the purpose of Record Reader in Hadoop?
Ans. In MapReduce, data is divided into input splits. Record Reader, typically, converts the input, provided by the Input Split, and presents a record-oriented view for the Mapper & Reducer tasks for processing. It thus assumes the responsibility of processing record boundaries and presenting the tasks with keys and values.
- What details are present in FSIMAGE?
Ans. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored FsImage. The FsImage is stored as a file in the Name Node’s local file system too.
The Name Node keeps an image of the entire file system namespace and file Block map in memory. This key metadata item is designed to be compact, such that a NameNode with 4GB of RAM is sufficient to support a huge number of files and directories. When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new FsImage on disk.
It can then truncate the old EditLog because its transactions have been applied to the persistent FsImage. This process is called a checkpoint. In the current implementation, a checkpoint only occurs when the NameNode starts up.
- Why do we need bucketing in Hive?
Ans. Bucketing is a simple idea if you are already aware. You create multiple buckets. You read each record and place it into one of the buckets based on some logic mostly some kind of hashing algorithm. This allows you to organize your data by decomposing it into multiple parts. You might wonder if we can achieve the same thing using partitioning then why to bother about bucketing. There is one difference. When we do partitioning, we create a partition for each unique value of the column. This may give rise to a situation where you might need to create thousands of tiny partitions. But if you use bucketing, you can limit it to a number which you can choose and decompose your data into those buckets. In Hive, a partition is a directory but a bucket is a file.
For more detailed explanation, click here
15. What is a Sequence File in Hadoop?
Ans. In addition to text files, Hadoop also provides support for binary files, out of these binary file formats, Hadoop specific file format stores serialized key/value pairs.
- How do you copy files from one cluster to another cluster?
Ans. With the help of DistCp command, we can copy files from one cluster.
The most common invocation of DistCp is an inter-cluster copy:
bash$ Hadoop DistCp
hdfs://nn1:8020/foo/bar
hdfs://nn2:8020/bar/foo
We will be coming up with more questions and detailed explanations in the next posts.
Keep visiting our website acadgild.com for more blogs and posts on Trending Big Data Topics.
To learn more about Big Data Hadoop click here.
Leave a Reply