USING STATISTICS IN MACHINE LEARNING

 Statistics a subfield of Mathematics. Statistical modeling is a formalization of relationships between variables in the data in the form of mathematical equations. There are two major schools of thought: Frequentist and Bayesians (based on probability — another subfield of Mathematics that deals with predicting the likelihood of future events). ISS coaching in Lucknow further explains that statistics is usually applied to low-dimension problems when you need to know more about data and properties of estimators. Common examples of estimator properties include p-value, standard deviation, confidence interval or unbiased estimator.



Machine Learning (ML) is a subfield of computer science and artificial intelligence. ML deals with building systems (algorithms, models) that can learn from data and observations, instead of explicitly programmed.

Machine learning focus on algorithms, and a subset of these has as their objective to prediction some outcome based on a set of inputs (or predictors as we might call them in statistics). In contrast to parametric statistical models, these algorithms typically do not make rigid assumptions about the relationships between the inputs and the outcome, and therefore can perform well then the dependence of the outcome on the predictors is complex or non-linear. The potential to capture such complex relationships is however not unique to machine learning – within statistical models we have flexible parametric / semiparametric, and even non-parametric methods such as non-parametric regression.

Machine learning and statistics are closely related fields in terms of methods, but distinct in their principal goal: statistics draws population inferences from a sample, while machine learning finds generalizable predictive patterns.

You need Statistics for machine learning because with a decent understanding of statistical methods you can convert raw observations into information that is easy to understand, digest, and share. This will allow you to create machine learning models that will consistently deliver results.

Machine learning allows computers to learn and discern patterns without actually being programmed. When Statistical techniques and machine learning are combined together they are a powerful tool for analysing various kinds of data in many computer science/engineering areas including, image processing, speech processing, natural language processing, robot control, as well as in fundamental sciences such as biology, medicine, astronomy, physics, and materials.

 

In Machine Learning, Data Analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information by informing conclusions and supporting decision making. It is used in many interdisciplinary fields such as Artificial Intelligence, Pattern Recognition, Neural Networks, etc…

The major difference between machine learning and statistics is their purpose. Machine learning models are designed to make the most accurate predictions possible. Statistical models are designed for inference about the relationships between variables

The machine learning pipeline is nothing but the workflow of the Machine Learning process starting from Defining our business problem to Deployment of the model. In the Machine Learning pipeline, the data preparation part is the most difficult and time-consuming one as the data is present in an unstructured format and it needs some cleaning. 



Data collection in ML

As we all know 21th century is the known as ” Age of Data Abundance”. The collection of data is the collection of mosaic pieces. How we arrange this data to get useful insights is what machine learning provides us!!

you need statistics for machine learning. Both fields of study are highly intertwined, to the point that some statisticians refer to machine learning as statistical learning or applied statistics—instead of the name that is designed to sound a bit more computer-centric.

When getting started with machine learning, the bulk of the texts assume that you already have some statistics foundation, highlighting how it’s hard to have a sound foundation in machine learning without it.

These are just some examples showing that you need some basic understanding of statistics to properly understand machine learning. Almost anyone can apply an algorithm lifted off different sources to a dataset and claim proficiency in machine learning.

However, without adequate knowledge of statistics, you’ll find out that you can’t interpret logistic regression results. You’ll also see a poor performance from your models because you’ve failed to normalize predictors, and you’re likely using the incorrect splitting criterion with your tree-based models. You need a proper background in statistics to avoid these problems.

 

Raw observations are just data. They are not pieces of information or knowledge. With every dataset, there are a few questions that have to be answered: What does the data look like? Are there any limits on the observation? What observation is most common? 

Away from raw data, you may need to design an experiment that will help you to collect observations. The result of the experiment will raise more questions like the difference in the outcome of the two experiments and whether these differences are noise in the data or real. You’ll also need to know what variables in the experiment are most relevant.

By answering these questions, you can turn the raw observation into usable information. The results generated will be vital to the project. It will also matter to your stakeholders because the information generated will ensure better decision making overall.

So, to understand the data used in training a machine learning model and properly interpret the results, you’ll need statistics. Every step in a typical predictive modeling project will involve some use of a statistical method.

Many machine learning techniques are drawn from statistics (e.g., linear regression and logistic regression), in addition to other disciplines like calculus, linear algebra, and computer science. But it is this association with underlying statistical techniques that causes many people to conflate the disciplines.  

Interestingly, newer machine learning engineers and data scientists who use machine learning packages like scikit-learn in Python may be unaware of the underlying relationship between machine learning and statistics. 

This abstraction of machine learning from statistics with the use of libraries is often why some individuals make the argument that knowledge of statistics is not necessary to do machine learning. While this may be true for more basic tasks, experienced data scientists and machine learning engineers draw on their knowledge probability and statistics to develop models. 

 

 

Comments