The doubt was as to why machine learning models don’t perform well on data whose distribution is not gaussian. And we told you that real-world data has gaussian distribution, which is wrong. The variable can have any distribution and as to why data is normally distributed. Notice the different words, data we have and variable. Another information was that ML models were built keeping that fact in mind, and that is why they don’t perform on data that is not gaussian distributed.I will clarify why is that. For this, we need to know about the central distribution theorem, distributions and samples.
Distributions- A function that shows the possible values of a variable and how often they occur. Simply put, it shows how many times a specific value of a variable occurs.
The thing to keep in mind is that the distribution contains all the possible values, which may contain values not in our data set.
Samples- All the possible values of our variable.
Now we will try to understand why most of the real world data is gaussian.
For that, we need to know and understand the Central Limit Theorem.
Central Limit therorum
The distribution of samples and the distribution of the sample’s mean, both approximates to a normal distribution, as the sample size becomes large, provided that all samples are of the same size and they are taken randomly,
Irrespective of the distribution of our variable.
Sample size - The sample size is the number of values we are considering picking from our data.For example let our distribution be