Data Trigger
  • Blog
  • About
Data Trigger

Data Trigger


A blog on MLOps, Data Engineering & Data Science

Build a real-time stream of air quality data with Apache Kafka

 Posted on July 20, 2022

Let’s build a data stream with Kafka today! We will retrieve air quality data using the World Air Quality Index project’s API, then push it on a Kafka cluster. [Read More]
kafka  stream  API  data engineering  distributed  http  request  air  cluster 

An asymmetric loss for regression models

 Posted on January 9, 2022

Drive regression models towards under/overestimation while keeping accurate outputs with the linear-exponential loss. [Read More]
loss  custom  asymmetric  underestimation  overestimation  regression  python 

NLP with 🤗 Hugging Face

 Posted on July 12, 2021

Zero-shot classification is basically text classification with no training at all. How does it compare with transfer learning/fine-tuning? We’ll see using the beloved 🤗 transformers library. [Read More]
NLP  zero-shot classification  text classification  distilbert  transformers  hugging face 

Interpretable machine learning with SHAP

 Posted on January 24, 2021

In this post, we predict health insurance costs with an efficient black box model, namely random forest. Then we interpret individual predictions as well as the global behavior of the estimator using SHapley Additive exPlanations. [Read More]
interpretability  explainability  Shapley  SHAP  correlation  multicollinearity  python  black box  insurance 

Image recognition with PyTorch and fastai

 Posted on December 22, 2020

Computer vision is one of the most fascinating domains in Machine Learning. Libraries like PyTorch and more recently, fastai, have made these kinds of models extraordinarily accessible. In this post, we build an aircraft classifier from gathering data to training and deployment. [Read More]
computer vision  transfer learning  pre-trained models  deployment  pytorch  torchvision  fastai  fast.ai  python 

Shiny Central Limit Theorem

 Posted on November 28, 2020

The central limit theorem is one of the greatest hits in the history of statistics. I wrote a little Shiny app to visualize it and to illustrate its infamous “counterexample”, Cauchy distribution: https://datatrigger.shinyapps.io/CLT_Visualization/. [Read More]
r  shiny  data visualization  central limit theorem 

Gradient tree boosting in the cloud

 Posted on November 13, 2020

A cloud computing experiment with two slightly different implementations of gradient boosted trees LightGBM and XGBoost. Let us evaluate how these two algorithms do on a moderately large dataset, regarding both accuracy and speed. [Read More]
python  xgboost  lightgbm  gradient boosted trees  cloud computing  machine learning  superconductors  paperspace 

Visualizing the sum of two random variables

 Posted on October 31, 2020

Take too independent random variables identically distributed. Question: if their sum is large, are they likely to be both large ? Let us examine this question with contour plots. [Read More]
r  ggplot2  contour  dataviz  data visualization  statistics 

Adding totals and subtotals rows with pandas or the tidyverse

 Posted on October 18, 2020

When dealing with a dataframe, generating aggregate data is a very common task. In my experience, presenting the summary statistics for the whole population or for subgroups directly in the dataframe can be useful, if not necessary. Today, I present my recipe to achieve this with the pandas and tidyverse packages. [Read More]
python  r  pandas  tidyverse  total row  subtotals  aggregate  groupby 

Back to basics: Scaling train and test samples.

 Posted on October 12, 2020

Splitting and scaling a dataset seems easy. Well, it is admittedly not that hard, however it can be tricky. Today we will see how to properly split and scale a dataset, as this step if often necessary before any ML wizardry. Let us do this with a few R & Python packages/modules. [Read More]
scaling  normalize  standardize  spark  pyspark  python  r  dplyr  caret 
  • Older Posts →

Vincent Le Goualher  • © 2022  •  Data Trigger

Hugo v0.91.2 powered  •  Theme Beautiful Hugo adapted from Beautiful Jekyll