Data Analysis

What are the various steps involved in any analytics project?
The following are the many steps involved in any typical analytics project:
a) Understanding the Problem- Recognize the issue facing the company, specify its objectives, and make plans for a successful resolution.
b) Gathering data- Depending on your priorities, get the pertinent information from a variety of sources.
c) Cleaning Data- Make the data suitable for analysis by cleaning it to get rid of unnecessary, redundant, and missing values.
d) Data exploration and analysis- To evaluate data, employ data mining methods, business intelligence tools, and predictive modelling methodologies.
e) Interpreting the results- Analyse the data to uncover underlying trends and patterns and obtain new perspectives.

What are the common problems that data analysts encounter during analysis?
Any analytics project will typically involve these phases to solve problems:
-dealing with duplication
-gathering valuable data at the appropriate time and place
-addressing storage and data erasure issues
-securing data and addressing compliance challenges

What is the significance of Exploratory Data Analysis (EDA)?
-EDA (exploratory data analysis) aids in better understanding the data.
-It aids in building your data’s confidence to the point where you are prepared to use a machine learning algorithm.
-You can use it to improve the feature variables you choose to include in your model.
-The data might help you find hidden trends and insights.

Explain descriptive, predictive, and prescriptive analytics.

Descriptive Predictive Prescriptive
It provides insights into the past to answer “what has happened” Understands the future to answer “what could happen” Suggest various courses of action to answer “what should you do”
Uses data aggregation and data mining techniques Uses statistical models and forecasting techniques Uses simulation algorithms and optimization techniques to advise possible outcomes
Example: An ice cream company can analyze how much ice cream was sold, which flavours were sold, and whether more or less ice cream was sold than the day before Example: An ice cream company can analyze how much ice cream was sold, which flavours were sold, and whether more or less ice cream was sold than the day before Example: Lower prices to increase the sale of ice creams, produce more/fewer quantities of a specific flavour of ice cream

What are the different types of sampling techniques used by data analysts?
There are majorly five types of sampling methods:
-Simple random sampling
-Systematic sampling
-Cluster sampling
-Stratified sampling
-Judgmental or purposive sampling

Describe univariate, bivariate, and multivariate analysis.
Univariate analysis is the simplest and easiest form of data analysis where the data being analysed contains only one variable. Example – Studying the heights of players in the NBA.
Univariate analysis can be described using Central Tendency, Dispersion, Quartiles, Bar charts, Histograms, Pie charts, and Frequency distribution tables.
The bivariate analysis involves the analysis of two variables to find causes, relationships, and correlations between the variables. Example – Analysing the sale of ice creams based on the temperature outside.
The bivariate analysis can be explained using Correlation coefficients, Linear regression, Logistic regression, Scatter plots, and Box plots.
The multivariate analysis involves the analysis of three or more variables to understand the relationship of each variable with the other variables. Example – Analysing Revenue based on expenditure.
Multivariate analysis can be performed using Multiple regression, Factor analysis, Classification & regression trees, Cluster analysis, Principal component analysis, Dual-axis charts, etc.

How should missing values be handled in a dataset?
Listwise Removal- If even one value is absent, the listwise deletion approach excludes the entire record from examination.
Average Imputation- Fill up the missing value by using the average of the responses from the other participants.
Regression Substitution- Multiple-regression analyses can be used to guess a missing value.
Multiple Imputations- It then averages the simulated datasets by adding random errors to your predictions, creating believable values based on the correlations for the missing data.

How is Overfitting different from Underfitting?

Overfitting Underfitting
The model trains the data well using the training set. Here, the model neither trains the data well nor can generalise to new data.
The performance drops considerably over the test set. Performs poorly both on the train and the test set.
Happens when the model learns the random fluctuations and noise in the training dataset in detail. This happens when there is less data to build an accurate model and when we try to develop a linear model using non-linear data.

How do you treat outliers in a dataset?
The following four techniques can be used to deal with outliers:
-Delete the outlier records.
-Set a new value for your outliers data.
-Try a different transformation.
-Cap the outlier data.

What is Data Analysis?
The process of studying, modelling, and interpreting data to derive insights or conclusions is known as data analysis. Decisions can be taken with the information gathered. Every business makes use of it, which explains why data analysts are in high demand. The sole duty of a data analyst is to fiddle with enormous amounts of data and look for undiscovered insights. Data analysts help organisations understand the condition of their businesses by analysing a variety of data.
The aim of factor analysis is to explain the variance between variables.

Tell some key skills usually required for a data analyst.
-It is essential to have knowledge of reporting tools (such as Business Objects), programming languages (such XML, JavaScript, and ETL), and databases (such as SQL, SQLite, etc.).
-the capacity to correctly and effectively acquire, organise, and communicate massive data.
-the capacity to create databases, build data models, carry out data mining, and divide data.
-a working knowledge of statistical software for massive dataset analysis (SAS, SPSS, Microsoft Excel, etc.).
-Teamwork, effective problem-solving, and verbal and written communication abilities.
-excellent at drafting reports, presentations, and questions.
-knowledge of programmes for data visualisation, such as Tableau and Qlik.
-the capacity to design and use the most precise algorithms on datasets to get answers.

What do you mean by data visualisation?
A graphical depiction of information and data is referred to as data visualisation. By using visual elements like charts, graphs, and maps, data visualisation tools let users quickly identify trends, outliers, and patterns in data. With the use of this technology, data may be examined and processed more intelligently and transformed into diagrams and charts.

Write the difference between variance and covariance.
Variance: In statistics, variance is defined as the deviation of a data set from its mean value or average value. When the variances are greater, the numbers in the data set are farther from the mean. When the variances are smaller, the numbers are nearer the mean.
Covariance: Covariance is another common concept in statistics, like variance. In statistics, covariance is a measure of how two random variables change when compared with each other.

How often should a data model be retained?
A good data analyst would be able to understand the market dynamics and act accordingly to retain a working data model so as to adjust to the new environment.

Explain the essential steps in the data validation process.
Data screening and data verification are the two processes that make up the data validation process.
Data screening: Various algorithms are employed in this stage to filter the full data set for any incorrect values. Making sure the data is clear and available for analysis is what it involves.
Data Verification: Before using the data, the correctness and quality of the source data are examined. Every suspected value is assessed against a number of use cases before a final judgement is made regarding whether or not it must be included in the data. Data cleansing also includes data validation.

Mention some problems that data analysts face while performing the analysis?
When conducting data analysis, data analysts encounter the following issues:
-Data that is inconsistent and lacking
-Spelling errors and duplicate entries
-A data file with poor formatting
-Inaccurate data classification and various value representations
-Conflicting data

What is imputation?
Imputation is the process of replacing the missing data with substituted values.

What is the K-means algorithm?
K-means algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separation between these clusters. Due to the unsupervised nature, the clusters have no labels

What are the most popular statistical methods used when analysing data?
The most popular statistical methods used in data analytics are –
-Linear Regression
-Classification
-Resampling Methods
-Subset Selection
-Dimension Reduction
-Nonlinear Models
-Tree-Based Methods
-Support Vector Machines
-Unsupervised Learning

What is the difference between factor analysis and principal component analysis?
The aim of the principal component analysis is to explain the covariance between variables while the aim of factor analysis is to explain the variance between variables.