WHAT IS R PROGRAMMING?
R is a language and environment for statistical computing and graphics. It provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. It is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes:
an effective data handling and storage facility,
a suite of operators for calculations on arrays, in particular matrices,
a large, coherent, integrated collection of intermediate tools for data analysis,
graphical facilities for data analysis and display either on-screen or on hardcopy, and
a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
DIFFERENCE BETWEEN VECTOR, LIST, MATRIX AND DATAFRAME.
A vector is a series of data elements of the same basic type. The members in the vector are known as a component.
The R object that contains elements of different types such as numbers, strings, vectors, or another list inside it, is known as List.
A two-dimensional data structure used to bind the vectors from the same length, known as the matrix. The matrix contains the same types of elements.
A Data frame is a generic form of a matrix. It is a combination of lists and matrices. In the Data frame, different data columns contain different data types.
GIVE ANY 5 FEATURES OF R.
5 features of R are:
Simple and effective programming language.
a) It is a data analysis software.
b) It gives an effective storage facility and data handling.
c) It gives high extensible graphical techniques.
d) It is an interpreted language.
WHAT ARE THE ADVANTAGES AND DISADVANTAGES OF R?
Advantages of R are:
a) Open Source
b) Data Wrangling
c) Array of Packages
d) Platform Independent
e) Machine Learning Operations
f) Disadvantages of R are:
g) Weak origin
h) Data Handling
i) Basic Security
j) Complicated Language
k) Lesser Speed
WHAT ARE THE STEPS TO BUILD AND EVALUATE A LINEAR REGRESSION MODEL IN R?
When creating a linear regression model, the following successive actions must be taken:
In order to develop the model on the train set and assess its performance on the test set, you must first divide the data into train and test sets.
The “catools” package’s split() method. This function offers a split-ratio option that you can customise based on your requirements.
You can now proceed to building the model on the training set once you have finished dividing the data into the training and test sets.
A model is constructed using the “lm()” function.
Finally you can predict the values on the test set, using the “predict()” function.
The final step would be to find out the RMSE, the lower the RMSE value, the better the prediction.
WHAT IS THE CONFUSION MATRIX IN R?
It is possible to assess the accuracy of the created model using a confusion matrix. A cross-tabulation of observed and anticipated classes is calculated. The “confusionmatrix()” method from the “caTools” package can be used to accomplish this.
HOW WOULD YOU WRITE A CUSTOM FUNCTION IN R? GIVE AN EXAMPLE.
This is the syntax to write a custom function In R:
<object-name>=function(x){
—
—
—
}
Let’s look at an example to create a custom function in R ->
fun1<-function(x){ ifelse(x>5,100,0) }
z<-c(1,2,3,4,5,6,7,8,9,10)
fun1(z)->z
WHAT PACKAGES ARE USED FOR DATA MINING IN R?
Some packages used for data mining in R:
data.table- provides fast reading of large files
rpart and caret- for machine learning models.
GGplot- provides various data visualisation plots.
tm- to perform text mining.
Forecast- provides functions for time series analysis
HOW WOULD YOU MAKE MULTIPLE PLOTS ONTO A SINGLE PAGE IN R?
Plotting multiple plots onto a single page using base graphs is quite easy:
For, example if you want to plot 4 graphs onto the same pane, you can use the below command:
par(mfrow=c(2,2))
GIVEN A VECTOR OF VALUES, HOW WOULD YOU CONVERT IT INTO A TIME SERIES OBJECT?
Let’s say this is our vector->
a<-c(1,3,5,7,9)
To convert this into a time series object->
as.ts(a)->a
WHAT IS A WHITE NOISE MODEL AND HOW CAN YOU SIMULATE IT USING R?
A fundamental time series model is the white noise (WN) model. The simplest illustration of a stationary process is one example.
A white noise model includes:
a) a continuous fixed mean
b) a constant fixed variance
c) No pattern across time
Simulating a white noise model in R:
arima.sim(model=list(order=c(0,0,0)),n=50)->wn
ts.plot(wn)
WHAT IS A RANDOM WALK MODEL AND HOW CAN YOU SIMULATE IT USING R?
A random walk is a simple example of a non-stationary process.
A random walk has:
a) No specified mean or variance
b) Strong dependence over time
c) It’s changes or increments are white noise
Simulating random walk in R:
arima.sim(model=list(order=c(0,1,0)),n=50)->rw ts.plot(rw)
GIVE THE COMMAND TO CREATE A HISTOGRAM AND TO REMOVE A VECTOR FROM THE R WORKSPACE.
hist() is the command to create a histogram, where you can specify the details by typing hist(v,main,xlab,xlim,ylim,breaks,col,border).
– v is a vector containing numeric values used in histogram.
– main indicates the title of the chart.
– col is used to set the color of the bars.
– border is used to set the border color of each bar.
– xlab is used to give a description of x-axis.
– xlim is used to specify the range of values on the x-axis.
– ylim is used to specify the range of values on the y-axis.
– breaks is used to mention the width of each bar.
– rm() is used to remove a vector from the R workspace.
WHY DO WE USE APPLY() FUNCTION IN R?
This is used to apply the same function to each of the elements in an Array. For example, finding the mean of the rows in every row.
HOW DO YOU CREATE A VECTOR IN R?
To create a vector in R, you have to use the <- symbol to assign a name to a vector. For example if you want to store the values 4 5 8 14 as a vector in x, you will have to type the command: x<-c(4,5,8,14)
EXPLAIN THE DIFFERENT FUNCTIONS THAT CAN BE APPLIED FOR NORMAL DISTRIBUTION IN R.
The different functions that can be applied for normal distribution in R are as follows:
a) dnorm(x, mean, sd)
b) pnorm(x, mean, sd)
c) qnorm(p, mean, sd)
d) rnorm(n, mean, sd)
Following is the description of the parameters used in above functions −
a) x is a vector of numbers.
b) p is a vector of probabilities.
c) n is the number of observations(sample size).
mean is the mean value of the sample data. Its default value is zero.
sd is the standard deviation. Its default value is 1.
EXPLAIN THE DIFFERENT FUNCTIONS THAT CAN BE APPLIED FOR BINOMIAL DISTRIBUTION IN R.
The different functions that can be applied for Binomial distribution in R are as follows:
a) dbinom(x, size, prob)
b) pbinom(x, size, prob)
c) qbinom(p, size, prob)
d) rbinom(n, size, prob)
Following is the description of the parameters used −
a) x is a vector of numbers.
b) p is a vector of probabilities.
c) n is the number of observations.
size is the number of trials.
prob is the probability of success of each trial.
WHAT IS THE MAIN DIFFERENCE BETWEEN AN ARRAY AND A MATRIX?
A matrix is always two-dimensional as it has only rows and columns. But an array can be of any number of dimensions and each dimension is a matrix. For example, a 332 array represents 2 matrices each of dimension 33.
HOW CAN YOU LOAD AND USE A CSV FILE IN R?
A CSV file can be loaded using the read.csv function. R creates a data frame on reading the CSV files using this function.
HOW DO YOU GET THE NAME OF THE CURRENT WORKING DIRECTORY IN R?
The command getwd() gives the name of the current working directory in R.
HOW DO YOU INSTALL A PACKAGE IN R?
To install a package in R, you need to give the following command:
install.packages(“package name”)
WHAT IS THE OUTPUT OF RUNIF(6)?
runif(6) generates 6 random numbers from a uniform distribution between 0 and 1.
GIVE THE R COMMAND TO GET THE PROBABILITY OF GETTING 26 OR LESS HEADS FROM 51 TOSSES OF A COIN USING PBINOM.
The R command to get the probability of getting 26 or less heads from a 51 tosses of a coin using pbinom is:
x<-pbinom(26,51,0.5)
print(x)
The first command obtains the required probability and stores the value in x. The second command, ie., print(x) prints or shows the value of x.
GIVE THE COMMANDS TO OBTAIN THE MEAN, MEDIAN AND MODE OF A DATASET.
The command for obtaining the mean of a dataset is: mean(…)
The command for obtaining the median of a dataset is: median(…)
The command for obtaining the mode of a dataset is: mode(…)
HOW ARE R COMMANDS WRITTEN?
By using # at the starting of the line of code like #division commands are written.
WHAT IS T-TESTS() IN R?
It is used to determine if the means of two groups are equal or not by using the t.test() function.
WHAT IS THE USE OF SUBSET() AND SAMPLE() FUNCTIONS IN R?
Subset() is used to select the variables and observations and sample() function is used to generate a random sample of the size n from a dataset29.
HOW CAN YOU PRODUCE CO-RELATIONS AND COVARIANCES?
Cor-relations are produced by cor() and covariances are produced by cov() function.
WHAT IS THE WORKSPACE IN R?
Workspace is the current R working environment which includes any user defined objects like vectors, lists etc.
WHAT IS THE FITDISTR() FUNCTION?
It is used to provide the maximum likelihood fitting of univariate distributions. It is defined under the MASS package.
WHY IS THE LIBRARY() FUNCTION USED?
This function is used to show the packages which are installed.
ON WHICH TYPE OF DATA BINARY OPERATORS ARE WORKED?
Binary operators work on matrices, vectors and scalars.
WHICH FUNCTION IS USED TO CREATE A FREQUENCY TABLE?
Frequency table is created by the table() function.
HOW CAN YOU IDENTIFY THE DATA TYPE OF AN OBJECT?
Using the functions class() or typeof(), you can identify the data type of an object in R. The class() function returns the actual data type, whereas typeof() returns a more detailed idea of the type of data.