Data Analytics-Interview Question and Answer
Hello Readers, This is Ruby Jain, technical trainer in Sevenmentor Pvt Ltd. Today, we are in the era of big data, where everyone is talking about management and handling such big data. Some coined terms you must have heard are data profiling, data mining, data analysis, business analysis, data science as so on. All these terms are very closely related to each other and this blog will definitely help to understand the exact meanings of them. Data Analytics is in boom nowadays. The types of analytics available, skill sets and techniques a data analyst must have are discussed here. After getting basic understanding, this blog also contain 25 frequently asked questions along with suitable answers to be data analytics.
Glance on Analytics:
Analytics brings collectively presumption and practice to recognize and converse data-driven insights that permit managers, stakeholders, and other executives in a business to build additional educated decisions.
Knowledgeable data analysts believe their work in a superior circumstance, within their business and in deliberation of a range of exterior factors. Analysts are moreover capable to report for the cutthroat competitive surroundings, interior and exterior business benefits, and the deficiency of assured data sets in the data-based recommendations that they make to stakeholders.
Data Analysis vs. Data Science vs. Business Analysis
To understand the differences let’s see them one after other.
The data analyst works as a caretaker for an organization’s data so stakeholders can understand data and use it to make deliberate business decisions. It is a technical position that requires an undergraduate degree or master’s degree in analytics, computer modeling, science, or math.
The business analyst serves in a strategic position paying attention on by means of the information that a data analyst uncovers to recognize problems and recommend solutions. These analysts classically receive a degree in a major such as business administration, economics, or finance.
The data scientist takes the data visualizations shaped by data analysts a step further, sifting all the way through the data to recognize weaknesses, trends, or opportunities for an association. This position also requires knowledge in math or computer science, along with some learning or approaching into human behavior to help make educated predictions.
I hope that all three terms are very much clear now. Let’s go in details of Data analyst.
Who is a Data Analyst?
A Data Analysts convey assessment to their companies by captivating information on the subject of particular topics and then interprets, analyzes, and presents findings in comprehensive reports. As experts, data analysts are frequently called on to utilize their skills and tools to grant competitive analysis and recognize trends within industries.
Types of Data Analytics:
- Descriptive analytics examines what happened in the past: Monthly revenue, quarterly sales, yearly website traffic, and so on. These findings permit a business to spot trends.
- Diagnostic analyticsconsiders why something happened by comparing expressive data sets to recognize dependencies and patterns. This helps business conclude the grounds of a positive or negative result.
- Predictive analyticsseeks to decide probable outcomes by detecting tendencies in descriptive and investigative analyses. This allows business to obtain proactive action, for example- like reaching out to a customer who is unlikely to renew an agreement.
- Prescriptive analyticsattempts to recognize what business action to obtain. While this type of analysis brings considerable value in the ability to deal with potential problems or stay ahead of industry trends, it often requires the use of complex algorithms and advanced technology such as machine learning.
Key requirements for becoming a data analyst:
- Be able to analyze, organize, collect and disseminate Big Data efficiently.
- Have substantial technical knowledge in fields like database design, data mining, and segmentation techniques.
- Have a sound knowledge of statistical packages for analyzing massive datasets such as SAS, Excel, and SPSS, to name a few.
Next, I will talk about the most frequently asked questions generally in various interviews.
1. What are the challenges that you face as a data analyst?
The biggest challenge is the data, if the data is not sufficient or not in proper format then decisions might not be adequate. Sometimes cleaning process also makes data worse for use.
- Mention what are the various steps in an analytics project?
Problem definition, Data exploration, Data preparation, Modeling, Validation of data and Implementation and tracking are the various steps in an analytics project.
3. What are the data validation methods used in data analytics?
- Form Level Validation – In this method, validation is complete once the customer completes the form before a save of the information is needed.
- Search Criteria Validation – This type of validation is applicable to the customer to contest what the customer is looking for to an assured extent. It is to make sure that the results are really returned.
- Field Level Validation – Validation is done in each and every field as the customer enters the data to keep away from errors caused by human interaction.
- Data Saving Validation – This type of validation is performed throughout the saving procedure of the actual file or database record. This is typically prepared when there are multiple data entry forms.
- What is good to have skills for an individual to be a value-added data analyst to the organization?
Predictive Analysis: This is the most important turning point within process improvisation.
2. Presentation Skills: This is crucial for an individual to ensure that they are clever to demonstrate a face to their data analysis. This can be done by means of several of the reporting tools.
- Database knowledge: This is necessary since it is widely used in daily operational tasks for the data analyst.
5. What qualities a good data model should have?
- It should be spontaneous.
- The model should have predictable outcomes.
- Its information should be effortlessly consumed.
- The data should be scalable to any changes in business requirements.
- New business cases should be developed and sustained.
- How frequently should you retrain a data model?
A good quality data analyst is the individual who identifies the efficiency of a predictive model by changing business dynamics. The structure of data remains unchanged instead of changing data in daily basis. Nevertheless, I would refresh or retrain a model when the company enters a new market, consummate an acquisition or is facing emerging competition. As a data analyst, I would retrain the model as quick as possible to adjust with the changing behavior of customers or change in market conditions.
- Differentiate between univariate, bivariate and multivariate analysis.
These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.
If the analysis attempts to understand the difference between 2 variables at time as in a scatterplot, then it is referred to as bivariate analysis. For example, analysing the volume of sale and a spending can be considered as an example of bivariate analysis.
Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.
- Consider the old story of boy and wolf and Summarize this “wolf-prediction” model using a 2×2 confusion matrix that depicts all four possible outcomes:
A shepherd boy gets bored tending the town’s flock. To have some fun, he cries out, “Wolf!” even though no wolf is in sight. The villagers run to protect the flock, but then get really mad when they realize the boy was playing a joke on them.[Iterate previous paragraph N times.]
One night, the shepherd boy sees a real wolf approaching the flock and calls out, “Wolf!” The villagers refuse to be fooled again and stay in their houses. The hungry wolf turns the flock into lamb chops. The town goes hungry. Panic ensues.
|True Positive (TP):
Reality: A wolf threatened.
Shepherd said: “Wolf.”
Outcome: Shepherd is a hero.
|False Positive (FP):
Reality: No wolf threatened.
Shepherd said: “Wolf.”
Outcome: Villagers are angry at shepherd for waking them up.
|False Negative (FN):
Reality: A wolf threatened.
Shepherd said: “No wolf.”
Outcome: The wolf ate all the sheep.
|True Negative (TN):
Reality: No wolf threatened.
Shepherd said: “No wolf.”
Outcome: Everyone is fine.
9. What is the difference between true positive rate and recall?
There is no difference, they are the same, with the formula:
(true positive)/(true positive + false negative)
10. What are the important responsibilities of a data analyst?
- Arrange records by dissimilar attributes.
- For huge datasets clean it step by step and progress the data with each step in anticipation to attain a good data quality.
- For huge datasets, split them into tiny data. Functioning with a smaller amount data will raise your iteration rate.
- A set of utility functions/tools/scripts should be created to handle frequent cleansing.
- For issues with data cleanliness, put together them by projected occurrence and attack the most frequent problems.
- Investigate the abstract statistics for all column ( standard deviation, mean, number of missing values,)
Every date cleaning operation should be kept in track, so you can modify changes or eliminate operations if required.
11. Explain what is Data Profiling?
The data profiling is a procedure of validating or investigative the data that is previously presented in an accessible data source, so the data source can be an existing database or it can be a file. The prime purpose is to recognize and take an executive decision whether the data that is available is readily used for other purposes.
12. What is the difference between data profiling and data mining?
Data Profiling emphasizes on analyzing individual attributes of data, thereby providing precious information on data attributes such as data type, frequency, length, along with their discrete values and value ranges. On the contrary, data mining aims to categorize abnormal records, analyze data clusters, and pattern discovery, to name a few.
13. What is the difference between Data Analysis and Data Mining?
Data analysts must generate their equations based on an assumption, but data mining, algorithms mechanically build up these equations. The data analysis procedure begins with an assumption, but data mining does not.
14. Name the best tools used for data analysis.
The most useful tools for data analysis are:
• Google Fusion Tables
• Google Search Operators
15. What are various steps involved in an analytics project?
• Recognize the problem statement of business.
• Investigate the data and develop familiarity with it.
• Organize the data for modeling by pre-processing it for example: detecting outliers, treating missing values, transforming variables, etc.
• Subsequent to data preparation, execute the model, analyze the outcome and squeeze the approach. This is a repetitive step till the best possible outcome is achieved.
• Authenticate the model using a various data sets.
• Launch implementing the model and keep track on the result to analyze the performance of the model over the time period.
16. Define “Outlier”.
An outlier is a term normally used by data analysts when referring to a value that appears to be far distant and conflicting from a set prototype in a sample. There are two kinds of outliers – Univariate and Multivariate.
17. What are the two main methods two detect outliers?
1. Box plot method: if the value is higher or lesser than 1.5*IQR (inter quartile range) above the upper quartile (Q3) or below the lower quartile (Q1) respectively, then it is considered an outlier.
2. Standard deviation method: if value higher or lower than mean ± (3*standard deviation), then it is considered an outlier
18. Give some situations where you will use an SVM over a Random Forest Machine Learning algorithm and vice-versa.
SVM and Random Forest are together belongs to classification problems.
a) SVM is used when data is outlier free and clean. Random forest could be used, if your data might contain outliers.
b) Random Forest machine learning algorithm is more memory efficient than SVM.
c) Random Forest gives you a very high-quality thought of variable significance in your data, so if you want to have variable significance then choose Random Forest machine learning algorithm.
d) Multiclass problems are preferably handled by Random Forest machine learning algorithms, whereas SVM is preferred in multi-dimensional problem set – like text classification.
19. What is the difference between linear regression and logistic regression?
|It requires independent variables to be continuous
|It can have dependent variables with more than two categories
|Based on least-square estimation
|Based on maximum likelihood estimation
|Requires 5 cases per independent variable
|Requires at least 10 events per independent variable
|Aimed at finding the best fitting straight line where the distance between the points and the regression line are errors
|As it is used to predict a binary outcome, the resultant graph is an S-curved one.
20. What is the difference between R-squared and adjusted R-squared?
Adjusted R-squared gives the proportion of dissimilarity explained by the independent variables that in certainty have an effect on the dependent variable.
R-squared deals with the percentage of dissimilarity in the dependent variables explained by the independent variables.
R-squared will increase if any column is added in training data set and at the same time it is not affecting y, but this is not the case of adjusted R-squared, so when we add any insignificant variable which is not affecting y the adjusted R-squared will decrease.
21. What should you do with missing or suspected data?
In such a case, a data analyst needs to:
• Use single imputation methods, deletion method, and model-based methods can be used to detect missing data.
• Arrange a justification report containing all information on the subject of the suspected or missing data.
• Examine the doubtful data to evaluate their validity.
• Substitute all the unacceptable data (if any) with a suitable validation set of laws.
22. Name the data validation methods used by data analysts.
• Data screening – Screening or inspecting the data for any probable errors and removing them earlier to conducting data analysis.
• Data verification – Past the completion of data relocation, data verification is prepared to ensure the accuracy of data and eliminate any inconsistencies, if any.
23. What is “Clustering?” Name the properties of clustering algorithms.
Clustering is a technique during which data is divided into clusters and groups. A clustering algorithm has the following properties:
• Hard and soft
• Hierarchical or flat
24. What is K-mean Clustering?
In K-mean is a partitioning technique objects are categorized into K groups. In this algorithm, the clusters are spherical with the data points are associated just about that cluster, and the variance of the clusters is similar to one another.
25. Name the statistical methods that are highly beneficial for data analysts?
The statistical methods that are mostly used by data analysts are:
• Simplex algorithm
• Bayesian method
• Markov process
• Rank statistics, percentile, outliers detection
• Mathematical optimization
• Spatial and cluster processes
Name – Ruby Jain
Designation – Data Analytics Trainer
Call the Trainer and Book your free demo Class now!!!
© Copyright 2019 | Sevenmentor Pvt Ltd.