Introduction
The topic of the present post originates from the superconductivity dataset — available on the UCI Machine Learning Repository — and the associated paper [1]. The problem consists in predicting the critical temperature $T_c$ of superconductors. At or below this temperature, such a compound conducts current with zero resistance, hence eliminating power loss due to Joule heating. Surprisingly enough, the scientific community has failed to model $T_c$ since the discovery of superconductivity by Dutch physicist Heike Kamerlingh Onnes in 1911.
In his study [1], Kam Hamidieh builds a gradient boosted trees using the XGBoost algorithm. He ends up with an outofsample RMSE of 9.5 °K. Let us see how the LightGBM framework performs on this specific task.
Data
The dataset contains 21 263 superconductors described by 81 explanatory variables, and one more response variable $Tc$. We are cosily ensconced in the supervised learning framework, and more specifically, regression. The explanatory variables are all continuous and positive. They are related to the eight following physical properties of every atom forming the superconductor: atomic mass, first ionization energy, atomic radius, density, electron affinity, fusion heat, thermal conductivity and valence. For each of these characteristics, ten summary statistics are computed : mean, weighted mean, geometric mean, entropy, range, etc… These quantities are the actual values of the independent variables. As a consequence, we get 8*10 = 80 highly correlated features. In this context, leaning towards treebased ensemble methods seems reasonable. The eightyfirst input variable is the number of atoms in the superconductor molecule.


More information on data preparation, feature extraction as well as a descriptive analysis of the dataset are available in the original paper [1]. The following block attests that there are no missing values :


XGBoost & LightGBM
XGBoost [2] and LightGBM [3] are slightly different implementations of gradient boosted trees. LightGBM is often considered as one as the fastest, most accurate and most efficient algorithm. The authors of the LightGBM documentation stress this point to a great extent. Although this may be correct in given situations, this Kaggle discussion shows that it depends on various parameters including the number of features, memory limitations, convergence and GPU vs CPU.
Let us see what happens with the superconductivity dataset. We will use the native Python API in both case to stay as objective as possible. Nonetheless, there are scikitlearn
wrappers for both libraries. They are particularly useful with their integration to the sklearn.grid_search
module.
Hardware
As my personal machine is getting quite old, the following computations were carried out in the cloud. We used a P4000 instance on Paperspace Gradient. We only used the CPU, although it is possible to use GPU both with LightGBM and XGBoost.
First test
In this part, we train a LightGBM model using the parameters provided by Kam Hamidieh in his paper. They were found with a grid search with crossvalidation. We had to make a heavy use of LightGBM documentation to do an appropriate translation of these XGBoost parameters. To make an honest comparison with the original work, we use the same method to estimate the outofsample RMSE, that is:
 Split the data into random train and test subsets with 2/3 of the rows for training.
 Fit the model using the train data and compute the RMSE on the test sample.
 Repeat 25 times and retain the average RSME on the test data.
LightGBM


XGBoost


Using the same set of parameters in both libraries, LightGBM was approximately 12 times faster than XGBoost. However, the outofsample RMSE is 1.20 °K off in comparison with XGBoost’s performance.
A homemade grid search for appropriate LightGBM parameters
Let us try to improve the LightGBM model. Could we get lower RMSE and execution time than with XGBoost ? In this part, we do not dive into a meticulous grid search, but we implement an attempt based on the following recommendations from the LightGBM documentation to increase accuracy:
 Use large max_bin (may be slower)
 Use small learning_rate with large num_iterations
 Use large num_leaves (may cause overfitting)
For each combination of all the other parameters, the number of iterations/trees i.e num_boost_round
is selected through 5fold crossvalidation. In the end, we select the set of parameters that does best on the test sample.


Again, we build this model 25 times on different random testtrain splits (same code as in the “First test” part, but with best_parameters
and best_num_rounds
):
To give some perspective about this result, let us note that the standard deviation of the 25 RMSEs is approximately equal to 0.2. Their range is about 0.7.
Conclusion
In the end, our LightGBM model has achieved similar performance to XGBoost regarding both speed and accuracy. This does not mean that we could not reach better performance if we went deeper into hyperparameter tuning and grid searching. However, we can note that regarding this specific superconductivity problem/dataset, LightGBM did not trivially provided a better model than XGBoost. The source code of this experiment is entirely available on Github.
References
[1] HAMIDIEH, Kam : A DataDriven Statistical Model for Predicting the Critical Temperature of a Superconductor, University of Pennsylvania, Wharton, Statistics Department, 2018.
[2] CHEN, T., GUESTRIN, C. (2016). Xgboost: A scalable tree boosting system, University of Washington, 2016.
[3] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, TieYan Liu. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 31493157.