1.What is Big Data?
Big data is a group of huge and complex data from various sources which is very tedious to store, manage, capture, process, retrieve and analyze it with the help of traditional data processing techniques or on hand database management tools.
2.What are the four V’s stand for in Big Data?
According to the IBM, four V’s has a simple definition of the four critical features of big data:
Velocity-Different forms of data
Volume-Scale of data
Veracity – Uncertainty of data
Variety -Analysis of streaming data
3.Name some Top companies that use Hadoop?
Facebook, Adobe, eBay, Twitter, Yahoo, Netflix, Rubikloud, Hulu, Amazon, Spotify
4.Ways to increase business revenue by analyzing big data, with an example?
Big data analysis helps business to stand out themselves- for example Rubikloud is the world’s most used software that empowers retailers with data driven decision. Has increased its revenue by $7 million in the year 2015. There are many other Top companies that make use of Big data for boosting their revenue such as Facebook, Twitter, LinkedIn, JPMorgan Chase, Pandora, Bank of America, etc.
5.What is Hadoop?
Hadoop is an open source software, which is written in Java. It consists of two main components which stores and process the huge data by the Google File System and MapReduce.
6.On Which Platform and Version Hadoop supports to Run?
Java 1.6.x or higher version is used for Hadoop preferably from Sun. Windows and Linux are the Supported OS for Hadoop, but Mac OS/X, BSD and Solaris are used generally to work.
7.What are the input formats supported by Hadoop?
These are the input format supported by Hadoop are:
Key Value Input Format
Text Input Format
Sequence File Input Format
Text Input Format is a by default input format.
8.For what purpose RecordReader used in Hadoop?
InputSplit doesn’t know to access the work assign to it. The RecordReader class is totally subject for loading data from InputSplit source and change over it into a key pair suitable for reading by the mapper. The RecordReader’s instances are introduced by the Input Format.
9.What is Heartbeat in HDFS?
Hadoop Distributed File System is responsible for the storage of huge data in the Hadoop. In HDFS heartbeat is the signal between a Name and Data Node and in between Job Tracker and Task Tracker. If the Job Tracker or Name Node does not react to the signal, then it’s assumed that there is an issue exist with the Task Tracker and Data Node.
10.What are the main elements/components of a Hadoop Application?
Hadoop supports wide range of technologies in order to store the huge and complex data and process it.
Core components of the Hadoop are- HDFS for Storage and MapReduce for processing. Other components which assist to perform the complete task successfully are: Hadoop Common, Yarn, Data Access elements are – Pig and Hive, Data Integration elements are – Apache Flume, Chukwa, Sqoop, Data Storage Component is – Hbase, Data Serialization elements are – Thrift and Avro, Data Handling and Monitoring elements are – Ambari, Zookeeper and Oozie, Data Intelligence elements are – Apache Mahout and Drill.
11.What is Hadoop streaming?
Hadoop distribution has a generic API for coding Reduce and Map jobs in language of choice like python, Ruby, Perl, etc. This is called as Hadoop Streaming. It provides full freedom to the user to create and run jobs with any kind of executable or shell Scripts as the Reducer or Mapper.
12.What happens when a Data node fails?
When Data Node fails:
Name node and Job tracker detect the failure
Completed tasks are re-scheduled On the failed node
Name node replicates the user data to another node
13.Explain what is WebDav in Hadoop?
WebDAV is an extension to HTTP to keep updating and editing of files. WebDav shares can be mounted as file systems in most of the OS, So it’s possible to access HDFS as a standard file ystem by Exposing HDFS over WebDav
14.Explain how JobTracker schedules a task?
Job tracker gets the regular heartbeat message from the task tracker in order to ensure that JobTracker is actively functioning. The Message also delivers JobTracker associated number of available slots, so the JobTracker amended with where in the cluster work can be delegated.
15.How to debug Hadoop code?
Hadoop can be debugged in different ways, but the most popular practiced ways are:
Web interface provided by Hadoop Framework
By using Counters
16.Can we use Non Java Code for writing jobs for Hadoop?
Yes, Hadoop Streaming allows any shell command to be used as a Reduce or map function
17.Explain what is SQOOP in Hadoop?
SQOOP act as a tool to transfer the data to and fro between RDBMS and Hadoop HDFS
18.What is distributed cache in Hadoop?
Distributed Cache is the facility provided to catch files at the time of execution of the job by MapReduce Framework. The framework copies relevant files to the slave node before the execution of any task at that node.
19.What is rack awareness?
Name node places the blocks based on the rack definition called as rack awareness. Hadoop will try to minimize the network traffic between data nodes within the existing rack and will only contact remote racks if necessary.
20.Relationship between Task and Job in Hadoop?
All tasks combine to form the Job in Hadoop
21.List some Hadoop tools that are required to work on Big data.
Hbase, Hive, Ambary, Apache Sqoop, Mahout, Lucene, Avro, Spark, Apache Drill, Zookeeper, etc.
22.The reason for displaying a report as “file could only be replicated to 0 nodes, instead of 1” error?
Because the “Namenode” doesn’t have any available DataNodes.
23.The way to switch off the “SAFEMODE” in HDFS?
By using the command- Hadoop dfsadmin -safemode leave
24.Discuss the indexing process in HDFS.
Depending on the block size, HDFS stores the data sequentially in the last part of the data and it also specifies the location of the next part of the data.
25.What is speculative execution in Hadoop?
If one node run task slower then the master node can constantly execute another instance of the same task on another node. It follows rule that is – the task which completed first will be accepted and the other node is killed. This procedure is known as “ Speculative Execution”.
26.Mention the ways to achieve High Availability in a Hadoop Cluster?
HA can set up in two different ways: with NFS for the shared storage or using the Quorum Journal Manager
27.Mention the three modes in which Hadoop can run.
The three modes in which Hadoop run are:
Pseudo distributed mode
Standalone or Local mode
Fully distributed mode
28.Determine the role of ZooKeeper in a Hadoop Cluster?
The main goal of ZooKeeper is cluster management. It helps to achieve coordination between Hadoop nodes. And also assist handle configuration across nodes, Implement redundant services, Implement reliable messaging, Manage configuration across nodes, Synchronize process execution
29.What is Commodity Hardware? whether it includes RAM?
Commodity hardware is a cheap system which doesn’t ensure any high quality or high availability. Hadoop can be installed on an any average commodity hardware and Hadoop doesn’t require supercomputers or high end hardware. Yes, commodity hardware contains RAM because some services might still be running on RAM.
30.The difference between an Input split and HDFS block.
Input Split is the logical division of the data while, HDFS Block is the physical division of the data.
31.Point out the difference between Hadoop 1 and Hadoop 2.
In Hadoop 2.x, have two Namenodes active and passive. If one active fails, passive will take charge and achieves high availability whereas in Hadoop 1.x, we have only one Namenode and single point of failure.
In Hadoop 2.x, has YARN which delivers a central resource manager. In YARN, all shares a common resource and we can run multiple applications in Hadoop. On top of a YARN MapReduce framework runs with the help of particular type of distributed application called MR2. Other tools can also perform data processing with the help of YARN, which was a issue in Hadoop 1.x.
32.What are active and Passive Namenodes?
In Hadoop 2.x, have two nodes Active and Passive Namenodes. Passive Namenode is a standby NAmenode and has dat asimilar to active Namenode and active NAmenode is the Namenode which works and runs in the cluster. When ever the active Namenode fails, Passive Namenodes takes the charge and replaces the active Namenode in the cluster. Hence, the cluster is never without a Namenode and so there is no chance of getting fail.
33.How to add or remove nodes in a Hadoop Cluster?
Hadoop framework utilizes commodity hardware, which leads to frequent DataNodes crashes in a Hadoop Cluster. Hadoop framework provides ease of scale specific to the rapid growth in data volume. Because of the two features of Hadoop, one of the most common task of a Hadoop administrator is to commission and decommission Data Nodes in a Hadoop Cluster.
34.what will happen, when two clients try to use the same file on the HDFS?
It supports exclusive writes only. Because of this property, whenever first client have access to the Namenode for writing, it will not provide the access to the second client for writing, when it request for Namenode and reject the open request for the second client.
35.Define block in HDFS? Specify the block sizes in Hadoop 1 and Hadoop 2. Can it be changed?
Block is the minimum amount of data that can be read or written. HDFS files are divided into block sized chunks, which are stored asindependent units.
Hadoop 1 default block size: 64 MB
Hadoop 2 default block size: 128 MB
the size of th eblock can be changed by using the dfs.block.size element in the hdfs-site.xml file to fix the size of a block in a Hadoop environment.
36.Reason for using HDFS for applications having large data sets and not for when there are a lot of small files?
HDFS is best fit for large amount of data sets in a single file instead of small amount of data spread across multiple files. Because Namenode is costly, high performance system and it is not unnecessary to occupy the space in the Namenode by avoidable amounts of metadata that are generated for multiple small files. So, Namenode will take less space, whenever there is a huge data in a single file. In order to get optimized performance, HDFS supports large data sets instead of multiple small files.
37.Discuss the indexing process in HDFS.
Hadoop has own way of indexing data. Depending on the block size, HDFS stores last part of the data and also inform where to place the next part of the data.
38.Reason for reading is performed in parallel but not writing in HDFS?
By utilizing MapReduce Program, files can be splitted into blocks while reading but while writing, program cannot be applied and there is no option for writing parallel in HDFS. Hence, the incoming values are not yet known to the system.
39.Reason for getting the error “ connection refused java exception” when accessing HDFS or its corresponding files?
Because the Namenode may be in safemode or the IP address of Namenode may have changed and NAmenode is not working on your VM.
40.Describe the features of a Fully Distributed mode?
Fully distributed mode is generally used in production environment and in this mode we have n number of machines forming a Hadoop Cluster. Hadoop daemons run on a cluster of machines on which Task Tracker or NodeManager runs. there are separate Nodes on which DataNode and NameNode runs. It has separate masters and slaves in this type of distribution.
41.What is MapReduce?
MapReduce is a programming model or framework employed for processing large data sets upon clusters of computers utilizing distributed programming.
42.How to debug a Hadoop code?
Many methods are available for debugging the code, but the most popular methods are:
Using the web interface presented by the Hadoop framework.
Using Counters.
43.list the significant configuration parameters in a MapReduce Program?
Users of the MapReduce framework requires to specify these parameters:
Input format
Job’s input locations in the distributed file system
Output format
Job’s output location in the distributed file system
Class containing the “reduce” function
Class containing the “map” function
44.What is the default input format in MapReduce?
By default, input format in MapReduce is Text.
45.Reason for not performing aggregation in a mapper? And why we need reducer for this?
Sorting occur only on the reducer side and doesn’t occurs on the mapper side. In Mapper method initialization depends on the input split. During aggregation, for each row, a new mapper will get initialized and lose the value of the previous instance. In each row input split gets divided into the mapper. Because of that, we can’t have a track of the previous row value.
46.Explain Distributed cache in a MapReduce Framework.
Distributed cache is significant feature of the MapReduce Framework. It is used, when you need to share files across multiple nodes in a Hadoop cluster. The files could reside as simple properties files or executable jar files.
47.How reducer communicate with each other?
It a tricky question, In MapReduce Programming model, it will not allow reducers to communicate with each other. They run in isolation.
48.what is the role of MapReduce Partitioner?
MapReduce partitioner is responsible for assuring that all the keys values of a one key goes to the same reducer thus assures even distribution of the map output over the reducers. It sen back the mapper output to the reducer by examining which reducer is responsible for the specific key.
49.What is a Combiner?
Combiner is a mini reducer that handles the local reduce task. It receives the input from the mapper on a specific node and sends the output to the reducer. It assist in increasing the efficiency of MapReduce by decreasing the quantum of data that is needed to be sent to the reducers.
50.List the different Relational operations in Pig Latin?
Different Relational operations in Pig Latin are:
for each
filters
order by
distinct
group
limit
join