The central limit theorem is one of the greatest hits in the history of statistics. I wrote a little Shiny app to visualize it and to illustrate its infamous “counterexample”, Cauchy distribution: https://datatrigger.shinyapps.io/CLT_Visualization/. [Read More]
Visualizing the sum of two random variables
Take too independent random variables identically distributed. Question: if their sum is large, are they likely to be both large ? Let us examine this question with contour plots. [Read More]
Adding totals and subtotals rows with pandas or the tidyverse
When dealing with a dataframe, generating aggregate data is a very common task. In my experience, presenting the summary statistics for the whole population or for subgroups directly in the dataframe can be useful, if not necessary. Today, I present my recipe to achieve this with the pandas and tidyverse packages. [Read More]
Back to basics: Scaling train and test samples.
Splitting and scaling a dataset seems easy. Well, it is admittedly not that hard, however it can be tricky. Today we will see how to properly split and scale a dataset, as this step if often necessary before any ML wizardry. Let us do this with a few R & Python packages/modules. [Read More]
In this post, I try to define what an outlier is and I present several ways to approach the problem of anomaly detection. Then, I present the Local Outlier Factor algorithm and apply it on a specific dataset to show its power, using both Python and R. I also compare its performance with the Isolation Forest method. [Read More]