# Data Science Interview Questions with Answers

• By
• June 14, 2021
• Data Science

With high demand and low availability of those professionals, Data Scientists are among the highest-paid IT professionals. This Data Science Interview preparation blog includes most often asked questions in Data Science job interviews.

1. ## What is linear regression?

Linear regression helps in understanding the linear relationship between the dependent and therefore the independent variables. Linear regression may be a supervised learning algorithm, which helps find the linear relationship between two variables. One is that the predictor or the experimental variable and therefore the other is that the response or the variable. In linear regression, we attempt to understand how the variable changes w.r.t the experimental variable. If there’s just one experimental variable, then it’s called simple linear regression, and if there’s more than one experimental variable then it’s referred to as multiple linear regression.

1. ## What is logistic regression?

Logistic regression is a classification algorithm which may be used when the variable is binary. Let’s take an example. Let’s suppose that x-axis represent the runs scored by Sachin Tendulkar and y-axis represent the probability of team India winning the match. From this graph, we can say that if Sachin Tendulkar scores more than 75 runs, then there is a greater probability for team India to win the match. Similarly, if he scores less than 75 runs then the probability of team India winning the match is less than 50 percent. So, basically in logistic regression, the y value lies within the range of 0 and 1.

1. ## What is a confusion matrix?

Confusion matrix is a table which is records to estimate the performance of a model. It tabulates the particular values and therefore the predicted values form a 2×2 matrix.

True Positive (d): This denotes all of these records where the particular values are true and therefore the predicted values also are true. So, these denote all of the true positives.

False Negative (c): This denotes all of these records where the particular values are true, but the anticipated values are false.

False Positive (b): during this, the particular values are false, but the anticipated values are true.

True Negative (a): Here, the particular values are false and therefore the predicted values also are false. So, if you would like to urge the right values, then correct values would basically represent all of truth positives and therefore the true negatives.

1. ## What does one understand by true positive rate and false positive rate?

True positive rate: In Machine Learning, true positives rates, which also are mentioned as sensitivity or recall, are wont to measure the share of actual positives which are correctly identified.

Formula: True Positive Rate = True Positives/Positives

False positive rate: False positive rate is essentially the probability of falsely rejecting the null hypothesis for a specific test. The false positive rate is calculated because the ratio between the quantity of negative events wrongly categorized as positive (false positive) upon the whole number of actual events. Formula: False Positive Rate = False Positives/Negatives.

1. ## What is Data Science?

Data Science may be a field of computing that explicitly deals with turning data into information and extracting meaningful insights out of it. The reason why Data Science is so popular is that the type of insights it allows us to draw from the available data has led to some major innovations in several products and companies. Using these insights, we are ready to determine the taste of a specific customer, the likelihood of a product succeeding during a particular market, etc.

1. ## How is Data Science different from traditional application programming?

Data Science takes a fundamentally different approach to putting together systems that provide value than traditional application development.

In traditional programming paradigms, we won’t to analyze the input, find out the expected output, and write code, which contains rules and statements needed to rework the provided input into the expected output. As we will imagine, these rules weren’t easy to write down , especially for those data that even computers had a tough time understanding, e.g., images, videos, etc.

Data Science shifts this process a little bit. In it, we’d like access to large volumes of knowledge that contain the required inputs and their mappings to the expected outputs. Then, we use Data Science algorithms, which use mathematical analysis to get rules to map the given inputs to outputs. This process of rule generation is called training. After training, we use some data that was put aside before the training phase to check and check the system’s accuracy. The generated rules are a sort of a recorder, and that we cannot understand how the inputs are being transformed into outputs. However, if the accuracy is good enough, then we can use the system (also called a model).

As described above, in traditional programming, we had to write down the principles to map the input to the output, but in Data Science, the principles are automatically generated or learned from the given data. This helped solve some really difficult challenges that were being faced by several companies.

1. ## Explain the differences between supervised and unsupervised learning.

Supervised and unsupervised learning are two sorts of Machine Learning techniques. They both allow us to build models. However, they’re used for solving different sorts of problems.

Supervised Learning works on the data that contains both inputs and the expected output, i.e., the labeled data

Unsupervised Learning works on the data that contains no mappings from input to output, i.e., the unlabeled data

Commonly used supervised learning algorithms: linear regression, decision tree, etc. Commonly used unsupervised learning algorithms: K-means clustering, Apriori algorithm, etc.

1. ## What is bias in Data Science?

Bias may be a sort of error that happens during a Data Science model due to using an algorithm that’s not strong enough to capture the underlying patterns or trends that exist in the data