USING STATISTICS IN MACHINE LEARNING
Statistics a subfield of Mathematics. Statistical modeling is a formalization of relationships between variables in the data in the form of mathematical equations. There are two major schools of thought: Frequentist and Bayesians (based on probability — another subfield of Mathematics that deals with predicting the likelihood of future events). ISS coaching in Lucknow further explains that statistics is usually applied to low-dimension problems when you need to know more about data and properties of estimators. Common examples of estimator properties include p-value, standard deviation, confidence interval or unbiased estimator.
Machine Learning (ML) is a subfield of computer science and artificial
intelligence. ML deals with building systems (algorithms, models) that can
learn from data and observations, instead of explicitly programmed.
Machine learning focus on algorithms, and a subset of
these has as their objective to prediction some outcome based on a set of
inputs (or predictors as we might call them in statistics). In contrast to parametric
statistical models, these algorithms typically do not make rigid assumptions
about the relationships between the inputs and the outcome, and therefore can
perform well then the dependence of the outcome on the predictors is complex or
non-linear. The potential to capture such complex relationships is however not
unique to machine learning – within statistical models we have flexible
parametric / semiparametric, and even non-parametric methods such as non-parametric regression.
Machine learning and statistics are closely related fields in terms of
methods, but distinct in their principal goal: statistics draws population inferences from a sample,
while machine learning finds generalizable predictive patterns.
You need
Statistics for machine learning because with a decent understanding of
statistical methods you can convert raw observations into information that is
easy to understand, digest, and share. This will allow you to create machine
learning models that will consistently deliver results.
Machine learning allows computers to learn and discern
patterns without actually being programmed. When Statistical techniques and
machine learning are combined together they are a powerful tool for analysing
various kinds of data in many computer science/engineering areas including,
image processing, speech processing, natural language processing, robot
control, as well as in fundamental sciences such as biology, medicine,
astronomy, physics, and materials.
In
Machine Learning, Data Analysis is
the process of inspecting, cleansing, transforming, and modeling data with the
goal of discovering useful information by informing conclusions and supporting
decision making. It
is used in many interdisciplinary fields such as Artificial Intelligence,
Pattern Recognition, Neural Networks, etc…
The
major difference between machine learning and statistics is their purpose.
Machine learning models are designed to make the most accurate predictions
possible. Statistical models are designed for inference about the relationships
between variables
The
machine learning pipeline is nothing but the workflow of the Machine Learning process starting from
Defining our business problem to Deployment of the model. In the Machine
Learning pipeline, the data preparation part is the most difficult and time-consuming one as the data is
present in an unstructured format and it needs some cleaning.
Data
collection in ML
As
we all know 21th century is the known as ” Age of Data Abundance”. The
collection of data is the collection of mosaic pieces. How we arrange this data
to get useful insights is what machine learning provides us!!
you need
statistics for machine learning. Both fields of study are highly intertwined,
to the point that some statisticians refer to machine learning as statistical
learning or applied statistics—instead of the name that is designed to sound a
bit more computer-centric.
When getting
started with machine learning, the bulk of the texts assume that you already
have some statistics foundation, highlighting how it’s hard to have a sound
foundation in machine learning without it.
These are just
some examples showing that you need some basic understanding of statistics to
properly understand machine learning. Almost anyone can apply an algorithm
lifted off different sources to a dataset and claim proficiency in machine
learning.
However,
without adequate knowledge of statistics, you’ll find out that you can’t
interpret logistic regression results. You’ll also see a poor performance from
your models because you’ve failed to normalize predictors, and you’re likely
using the incorrect splitting criterion with your tree-based models. You need a
proper background in statistics to avoid these problems.
Raw
observations are just data. They are not pieces of information or knowledge.
With every dataset, there are a few questions that have to be answered: What
does the data look like? Are there any limits on the observation? What
observation is most common?
Away from raw
data, you may need to design an experiment that will help you to collect
observations. The result of the experiment will raise more questions like the
difference in the outcome of the two experiments and whether these differences
are noise in the data or real. You’ll also need to know what variables in the
experiment are most relevant.
By answering
these questions, you can turn the raw observation into usable information. The
results generated will be vital to the project. It will also matter to your
stakeholders because the information generated will ensure better decision
making overall.
So, to
understand the data used in training a machine learning model and properly
interpret the results, you’ll need statistics. Every step in a typical
predictive modeling project will involve some use of a statistical method.
Many machine
learning techniques are drawn from statistics (e.g., linear regression and
logistic regression), in addition to other disciplines like calculus, linear
algebra, and computer science. But it is this association with underlying
statistical techniques that causes many people to conflate the
disciplines.
Interestingly,
newer machine learning engineers and data scientists who use machine learning
packages like scikit-learn in Python may be unaware of the underlying
relationship between machine learning and statistics.
This
abstraction of machine learning from statistics with the use of libraries is
often why some individuals make the argument that knowledge of statistics is
not necessary to do machine learning. While this may be true for more basic
tasks, experienced data scientists and machine learning engineers draw on their
knowledge probability and statistics to develop models.
Comments
Post a Comment