You are probably attending your first-ever interview as a big data professional after completing your data engineer certification program and practicing a bit in internship and volunteer roles. You have a humble beginner’s portfolio to show as proof of your experience but are not sure which questions you will be asked in the interview room.
It is vital to have an idea of how the interview questions are framed and the kind of answers expected by your panelists. Below we highlight common big data questions that are likely to come up in an interview session.
What is big data?
The term big data refers to very large complex structured, semi-structured, and unstructured datasets that require special management tools and techniques. This is because relational database systems are limited in capacity to handle big data. Big data analysis is important to businesses as it holds insights and trends that enable owners to understand their business better and make better data-driven decisions.
What are the core dimensions of big data?
Big data is defined by core characteristics known as the Vs of big data. These are:
- Volume which refers to the amount of data usually in Petabytes and Exabytes
- Velocity refers to the always increasing speed with which data is being generated for instance conversations on social media platforms, forums, and other sites.
- Variety refers to the various forms in which data comes for example audios, videos, and text data.
- Veracity refers to the level of accuracy of the data that is available
- Value refers to the valuable insights relevant to the business that is present in collected data.
What is Hadoop ?
Hadoop is an open-source framework written in Java and developed by Apache Software Foundation that enables distributed storage and processing of large data sets across clusters of computer nodes using simple programming models.
What is the connection between Hadoop and big data?
Hadoop framework is the most viable solution for big data storage and processing to discover hidden trends, patterns, customer behavior, and useful insights required to make informed data-driven decisions. Hadoop provides distributed storage and parallel data processing that is effective for big data.
What are the core features of Hadoop?
Some important features of Hadoop are:
Hadoop is available at no cost. Also, its source code is open to users across the globe for modification to meet the user’s data analytics requirements.
Runs on commodity hardware.
Hadoop requires a minimum of about 64-512 GB RAM to run computations which are quite economical compared to other options like ***. This is equal to single commodity hardware. Commodity hardware refers to all the resources and components required to run the framework.
Hadoop can be scaled up easily to increase performance and take care of increased demand by simply adding up to thousands of nodes to clusters.
Hadoop supports the distributed processing of data. Data is stored in the Hadoop Distributed File System (HDFS) in which data is distributed across clusters of computer nodes to allow for faster parallel processing by MapReduce.
Hadoop automatically creates three replicas of each block of data across nodes. In the event of a hardware failure of one of the nodes, data can be recovered automatically from the other nodes hence data is highly available in Hadoop.
Simple user interface.
Hadoop’s user interface is easy to use as the framework handles all distributed computing processes without much involvement of the users.
Rather than data being moved to computation functions as is common in other systems, in Hadoop computations move MapReduce algorithms to the data location in a cluster. This principle makes it easy for Hadoop to process big data.
What are the components of the Hadoop ecosystem?
The components of a Hadoop ecosystem include:
HDFS meaning Hadoop distributed file system is the schemaless storage function of Hadoop. In the HDFS a data file is split into blocks each 64 MB in size. Each block is replicated across different nodes. This makes it a fault-tolerant system. Each Hadoop cluster consists of one NameNode and several DataNodes which form a Hadoop cluster.
YARN meaning Yet Another Resource Negotiator is the processing framework of Hadoop that provides multiple data processing engines like real-time streaming and batch processing while also managing resources.
MapReduce is the programming function of Hadoop. It writes applications that divide structured and unstructured data in HDFS into tasks and then runs the processes in parallel in two phases, Map and Reduce making Hadoop computing fast and reliable.
Explain the functions of the two HDFS components.
HDFS is made up of the following components.
- NameNode. This is the master node that holds the metadata information for all data blocks in the HDFS.
- DataNodes. Also known as the slave nodes, DataNodes are responsible for storing actual data.
- Name and explain the roles of YARN components
YARN is made up of two main components which are:
- ResourceManager. The ResourceManager allocates resources to NodeManagers based on their requirements.
- NodeManager. NodeManagers are responsible for the execution of tasks on the DataNodes.
- What are the steps taken when deploying a big data solution?
- Ingestion. This refers to the extraction of data from various sources such as RDBMS, SAP, CRM systems, social media feeds, documents, and other logs. This data can either be ingested as streaming or batch data.
- Data storage. After data ingestion is data storage. Data can either be stored in the HDFS or in NoSQL database systems like HBase.
- Data processing. Data processing frameworks like MapReduce and Spark come in handy at this point.
- Differentiate between structured, semi-structured, and unstructured data
- Unstructured data, derived from various sources and coming in different forms, cannot be organized into tables. Popular storage solutions for unstructured data include MongoDB, Cassandra, and HBase. How is a NameNode recovered when it is down?
The FsImage, the metadata replica file system of the NameNode is used to start a new. DataNodes and clients for the new NameNode are then configured to be compatible with each other. Checkpoints are then loaded and block reports transmitted to the FsImage after it is now ready to serve its client.
What is FSCK and what does it do?
FSCK referring to File System Check is a command used to discover inconsistencies in files.
What are the differences between Hadoop and RDBMS?
Hadoop’s schema is based on reads while that of RDBMS is based on writes.
Hadoop is designed to handle structured, semi-structured, and unstructured data types.
RDBMS only handles structured data.
Hadoop has a faster processing time, particularly for writes.
While RDBMS performs relatively fast reads, it is slower than in Hadoop
Hadoop is an open-source framework thus available at no cost to users across the globe.
RDBMS, on the other hand, is licensed software that has to be purchased at a cost by users.
Given that Hadoop is an open-source framework, its source code is available for customization to meet users’ requirements.
RDBMS cannot be customized.
What are the different modes on which Hadoop runs
Standalone (local) mode. It is the default Hadoop mode and uses a local file system for input and output operations. Its function is exclusively debugging. This mode does not require any configuration. Also, it does not have HDFS and lacks custom configuration options for the mapred-site.xml, core-site.xml, and hdfs-site.xml files.
Pseudo distributed (Single node cluster) mode. This mode includes NameNodes and DataNodes. All Hadoop services are executed in this mode on a single node thus the master and the slave nodes are the same.
Fully distributed (multi-node cluster) mode. In the fully distributed mode, the master and slave nodes run independently on separate nodes.
What are the common input formats in Hadoop?
Text input format which is the default input format
Key-value input format that reads files in plain text format
Sequence file input format that reads files that are in the sequence format
This list of interview questions is not exhaustive. The topic of big data is broad and the more you take your time to familiarize yourself with the interview questions the better prepared you will be to tackle them in front of a panel. Further, don’t forget to rehearse the more personal questions that border on your experience and background.
Follow Techdee for more informative articles.