6 min readDec 24, 2018

TEAM SOLUTION DEALING WITH REAL WORLD DATA AT DSN/ACCESS/AFF HACATHON

TEAM SOLUTION 2ND ON THE KAGGLE LEADER-BOARD.

THE TEAM CONSISTS OF 4 MEMBERS AND AN INSTRUCTOR FROM ACCESS BANK

WE HAVE OLALEYE ENIOLA

From my prior experience with LOAN DEFAULT PREDICTION, I’ve come to appreciate that this is one of the most complex problems for applying machine learning to. Data in this domain tends to be very heterogeneous, collected over different time frames, and coming from many different sources that may change and alter in midst of the data collection process. Coming up with a proper target variable is also a very tricky process, that requires deep domain knowledge and refined business analysis skills. I want, again, to commend DATA SCIENCE NIGERIA and ACCESS BANK for coming up with such a great data , that was very amenable to Machine Learning techniques.

Based on what is known about Credit Underwriting, and in general these kinds of Machine Learning problems, it has been clear all along that there would be two things crucial for building a good model for this competition: 1. Good set of smart features. 2. Diverse set of base algorithms.

The data was huge to the extent that it made my system to malfunction for some time i had to switch to kaggle kernel which made the work very easy the munging and merging of the data was done on SQL by our instructor Mr Adewuyi Afeez.

WE discovered how efficient and easy it was to merge with SQL this made all of us in the group had interest in SQL.

We also have Chinonso Ani

Chinonso helped in handling imbalanced Datasets.

How to handle imbalanced classes

he said

It is also important to look at the distribution of how many customers churn. If 95% of customers don’t churn, we can achieve 95% accuracy by building a model that simply predicts that all customers won’t churn. But this isn’t a very useful model, because it will never tell us when a customer will churn, which is what we are really interested in.

The class for churn is only around 25% of the total population of samples. There is a real risk that a model trained on this data may only make too many predictions in favour of the majority class. There are a number of techniques for handling imbalanced classes:

method 1:

Up-sampling the minority class

To balance the data set, we can randomly duplicate observations from the minority class. This is also known as re sampling with replacement:

method 2:

Down-sampling the majority class

Similar to the above method, we reduce the number of samples in the majority class to be equal to the number of samples in the minority class.

we also have Chuka Dean:

Chuka dean helped with the use of azure ml to demystify the problem.

Azure ML is built on top of the machine learning capabilities of several Microsoft products and services. It shares many of the real-time predictive analytics of the new personal assistant in Windows Phone called Cortana. Azure ML also uses proven solutions from Xbox and Bing. Outshining Nate Silver’s lauded FiveThirtyEight blog, Bing Predicts recently astonished many by correctly forecasting the results of more than 95% of the US mid-term elections. Thus, it might be worth checking out Azure ML to see what its powerful cloud-based predictive analytics can do for you.

we also have Adesiji Blessing

Adesiji Blessing did a really good job with visualization of the Datasets.

he was able to bring meaningful insights from the datas by visualizing it to know the important features.

The classification report from his work and confusion matrix are as follows:

Data visualization refers to presenting data in a graphical or pictorial form, such as a pie chart. This allows audiences to recognize patterns more quickly. Interactive visualizations allow decision-makers to drill down through the layers of detail. This changes perspectives so that users can review the facts behind the analysis.

Here are ways that data visualization affects decision-making and changes organizations.

1. Faster Action

The human brain tends to process visual information far more easily than written information. Use of a chart or graph to summarize complex data ensures faster comprehension of relationships than cluttered reports or spreadsheets.

This provides a very clear form of communication allowing business leaders to interpret and act upon their information more rapidly. Big data visualization tools can provide real-time information that’s easier for stakeholders to evaluate across the enterprise. Faster responses to market changes and quick identification of new opportunities is a competitive advantage in any industry.

2. Communicate Findings in Constructive Ways

Many business reports submitted to senior management are formalized documents that are often inflated with static tables and a variety of chart types. They become so elaborate that they fail to make information vibrant and memorable for those whose opinions matter most.

Reports coming from big data visualization tools, however, make it possible to encapsulate complex information on operational and market conditions in a brief series or even single graphic. Decision makers can easily interpret wide and varying data sources through interactive elements and new visualizations such as heat maps and fever charts. Rich but meaningful graphics help engage and inform busy executives and business partners on problems and pending initiatives.

AND finally our great instructor:

MR ADEWUYI HAFIZ

Mr adewuyi afeez helped in merging the datasets with sql which simplified our work.

SQL stands for structured query language and is supported by a set of standards, although they seem to be implemented slightly differently by every database vendor. Even though SQL is always a little different depending on if you are using MySQL, Oracle, DB2, or whatever vendor tool you have, if you are good at writing SQL and know the database model, you can adapt quickly to get whatever data you need.

Machine learning is not new, and SQL databases are not new, so what has changed? Three things.

Distributed Systems: Today, SQL databases are distributed, allowing you to use more cores and have better parallelism. Overall, this enables more hardware to address a single query all at once providing the best performance. Distributed systems can scale to hundreds of servers or cloud instances, providing a different playing field of performance than single node systems.
New Hardware: Modern CPU processor hardware is increasingly more sophisticated. For example, the Intel Advanced Vector Extension model, an implementation of Single Instruction, Multiple Data(SIMD) is widely available in its first iteration AVX-2 and new higher-end processors support the more advanced AVX-512. In short, SIMD allows processors to compute on multiple pieces of data simultaneously increasing parallelism at the processor level. Additionally, GPUs can achieve a similar type of parallelism by coordinating among multiple cores simultaneously.
Code Generation: Code generation helps optimize queries and custom functions in a database by converting the original requests into machine code, allowing it to run much faster. However, the true power of code generation comes from generating machine code optimized for a specific query, avoiding the overhead of an interpreter. An interpreter is a piece of code, which can run any query, and as such contains more generalized instructions, which is usually not optimal for the performance of a given specific query.

TOGETHER WE ARE TEAM SOLUTION THANK YOU.

How to handle imbalanced classes

Up-sampling the minority class

Down-sampling the majority class

1. Faster Action

2. Communicate Findings in Constructive Ways

Written by Eniola Olaleye