20 Integral AI Terms for 2020

20 Integral AI Terms for 2020

Despite what all the media say, AI is not all that intelligent. Sure, modern AI can solve all sorts of problems, play chess, and create the semblance of a conversation. In reality, much of this is just pattern detection and in many cases brute force, rapid calculation. 

Artificial Intelligence can be defined as giving a machine the ability to mimic human intelligence, or the ability to react to changes in its environment with the goal of accomplishing some task. In developing technologies that are considered AI, many fields are involved – from the obvious (computer science) to the somewhat obvious (cognitive neuroscience) to the less obvious (linguistics) – and all contribute to the development of new automated and intelligent systems increasingly filling our world. 

As we enter 2020, the importance of understanding AI and implementing it across businesses are now more urgent than ever. To expedite AI adoption, we have picked 20 technical terms that every business person needs to know.

  1. Data Science
  2. Machine Learning / Automated Machine Learning (AutoML)
  3. Supervised Learning / Unsupervised Learning
  4. Natural Language Processing (NLP)
  5. Neural Network / Deep Learning
  6. Reinforcement Learning
  7. Outlier
  8. Hyperparameter / Hyperparameter Optimization / Hyperparameter Tuning
  9. Modeling
  10. Training
  11. Training Dataset / Validation Dataset / Test Dataset
  12. Cross-Validation
  13. Ensemble
  14. Model Explainability
  15. Confusion Matrix
  16. Accuracy / Precision / Recall
  17. ROC / ROC AUC
  18. Data Leakage
  19. Dimensionality Reduction
  20. Overfitting / Underfitting

1. Data Science

When a business leader is tasked with implementing AI, it’s not just about installing robots to carry out repeated tasks. It is about analyzing data and leveraging the new insights to learn more about customers, find new opportunities, reduce risks, and more.

Data Science is a field of study that uses programming, statistics, and math to discover trends and insights in data. It is a very wide field that can be applied to almost any industry that uses data or collects data. The breadth of this field makes pinpointing an exact definition difficult: the term data science is sometimes used to describe everything from standard data analysis and charting approaches that utilize computers, all the way to machine learning, smart robotics, image classification, and natural language processing. 

Basically, anything that uses data and computers in tandem nowadays can be referred to as Data Science. However, Data Science is sometimes limited to just Machine Learning, Deep Learning, and Big Data. No matter what the definition is, at its core, the goal of Data Science is to learn from data, which is also the goal of implementing AI in business.

2. Machine Learning / Automated Machine Learning (AutoML)

Machine Learning is an application of AI that enables machines and systems to learn on their own to predict real-life outcomes. The main criteria here are self-learning, access to more data, and iterative learning on-the-go. The machine learns on its own without explicit programming. It utilizes data to learn patterns within and develop self-learning algorithms to maximize the performance. 

*Note that Machine Learning is currently used interchangeably with Artificial Intelligence. 

While Machine Learning implies self-learning, human guidance is needed in many places that are often time-consuming and resource-intensive. As a result, Automated Machine Learning, or AutoML, was developed. It automates many parts of building machine learning models and makes the job simpler, faster, and easier.

3. Supervised Learning / Unsupervised Learning

When we have the output values, we can teach the machine to learn the relationship or find the pattern between the input and the output values, allowing it to find the best way or build a model to predict an output when it’s given a new set of input values. This is Supervised Learning.

An example would be asking a machine to predict the chances of a person developing diabetes. If you’re looking at different key parameters – such as height, weight, BMI, age, blood sugar levels, genetic markers, etc. – you can see that there’s no way to write a simple code. Every combination of those parameters is a massively huge number, bordering on infinity. Now, if we were asking a machine to use Machine Learning to predict if an individual has a high risk of developing diabetes, we give whatever data we have to the machine and let it learn the rules. Then, when you input a new case, it will use what it built internally and make a prediction of, “This is what I think the probability is of this person developing diabetes.”

In technical terms, Supervised Learning works well with classification and regression, and some well-known algorithms include logistic regression, decision trees, and support vector machines. 

Unsupervised Learning is for when you do not have a specific set of outputs, often called as labeled data, but are trying to learn from the structure of the data. One common usage of unsupervised learning is clustering – grouping data points that are similar to each other without labeling them.

We can use the same diabetes example with one twist, we do not know which patients have diabetes. However, we can ask the machine to find clusters of patients that are similar. As a result, the machine would produce different groups of patients but we would not know which group is the group with diabetes.

4. Natural Language Processing (NLP)

Have you wondered how the chatbot handles the various nuances of language? Well, it is nothing but the machine’s ability to understand the language as it is spoken, natural language.

Many computing systems that interact with humans are designed to handle and understand speech (grammar, semantics, phonetics) in text and audio formats, and process the same for the desired output. In other words, they are processing your words as they come out naturally, to give you the results you want. So a banking chatbot will address your queries, make recommendations, search through your previous conversations or customer history with the bank, and help you resolve your problem.

5. Neural Network / Deep Learning

The Neural Network takes inspiration from the human brain and nervous system controlled by the command center, the brain. It is a system of hardware (think brain, nerves) and software (sense of hearing) patterned after the operation of neurons (brain cells that control responses) in the human brain. They are arranged in tiers and work in parallel to process information and give the desired output. 

The first tier receives the raw information and each following tier receives the output from the tier preceding it, simulating the way neurons function in the brain. Optic nerves receive the visual information and processes the information to pass on the output (what is it?) to the next tier, which further processes it to give the direction (what to do?) to the next tier, and so on, till the last tier produces the final output (response to image).

In machines trained with Neural Networks, the system is trained to be adaptive and modify on its own by processing subsequent data, just as the human brain processes information on its own and learns from experiences.

Deep Learning is a type of learning that uses the layered approach of Neural Networks. Each layer acts independently from other layers and processes information before sending information to deeper layers for further classification and transformation of the original data. 

Essentially, the “deep” in Deep Learning just refers to the number of layers, or how deep the layers go, and more layers (and consequently more depth) mean better ability to analyze the data.

For example, if you want to classify photos of different things and find only the pictures of cars, then there would need to be enough layers detecting car characteristics (and potentially eliminating pictures with non-car characteristics) to be able to successfully say that a Lamborghini is a car but a motorcycle is not.

6. Reinforcement Learning

You can think of Reinforcement Learning like training a dog. If the dog comes when you call it, you can give a treat to reinforce what they have learned. Similarly, Reinforcement Learning rewards or punishes a Machine Learning algorithm when it makes correct or incorrect decisions, respectively, by increasing the reward.

Some common words used in Reinforcement Learning are agent, environment, state, action, and reward.

An agent is the one carrying out actions, bounded by the environment. Depending on the state, the agent will evaluate possible actions and choose the one with the most reward. To enforce the agent to choose what’s best now, the reward is often discounted.

Imagine you are at a bank, trying to pick the next investment. Here, you would be the agent and the state would be the current interest rates and other financial information. The rewards would be the expected outcomes of different investments, discounted to the present value for you to evaluate each one. The actions would be the decisions you make, to invest or not, and the environment would be the regulations or agreements that you need to make with the bank.

In terms of machine learning, the agent would be the algorithm, the actions would be the available decisions, the environment would be the data, the state would be the current decision node or layer, and the reward would be right or wrong.

7. Outlier

Outliers are noise in the data that skews results in an otherwise sound machine learning model.

Look at the standard normal curve, where mu (µ) is the mean and sigma (σ) is the standard deviation.

Normal Distribution Sigma

In this normally distributed data, under the three sigma rule, the outliers lie in the green area and beyond, which is 3 standard deviations and more from the mean. What you are interested in is the area covered by red and blue, which includes 13.39% + 34.13% + 34.13% + 13.59% = 95.24% of observations. Anyone outside of that is likely to be a fluke, a non-statistically significant event, an anomaly, or what we call an outlier.

Outliers could be good or bad in different situations. For example, if your cybersecurity analysts are spending time chasing down outlier events, they could be wasting time and money. On the other hand, if your air quality monitor keeps finding outliers, it might be a signal that the system needs calibration.

8. Hyperparameter / Hyperparameter Optimization / Hyperparameter Tuning

To understand Hyperparameter, we need to know what a parameter is. A parameter is a factor that you train. It is the most useful element in building a model. When you have a model and you want to evaluate its efficiency, you check for the parameters or factors that are affecting the performance of the model. A simple example of a parameter is coefficients in a linear regression model: y = mx1 + nx2 + b. The model is learning the best coefficients m and n from the whole dataset.

Hyperparameter variables are usually model-specific properties, fixed before you train and test your model on data. Ideally, you set different values for the hyperparameters before training and decide which ones work best by testing them. This testing phase is Hyperparameter Optimization.

Hyperparameter optimization is part of the Hyperparameter Tuning process, which is adjusting possible hyperparameters to find the best working model. Hyperparameter Tuning is a recommended process, as it affects the performance of the model and the quality of results obtained.

9. Modeling

In Data Science, there are several types of Modeling.

  1. Data modeling documents the flow of data in a complex software system in an easy-to-understand diagram, using texts and symbols. It serves as a draft or prototype of the proposed software while ensuring the efficient use of data.
  2. Statistical modeling uses mathematical relationships between variables in the data and statistical assumptions to establish a scenario.
  3. Predictive modeling uses statistical formulas and predictor variables to predict future outcomes. The purpose is for the model to perform in similar situations with unseen or new data.
  4. In Machine Learning, modeling refers to the process of training a machine to learn from the data without relying on rules-based programming. Different learning algorithms can be used to build a model to predict the labels from available features.

10. Training

To give an example, training means to calculate the coefficients in a machine learning model. For simple linear models, this means finding m in y=mx + b.

Training a model can take minutes or days, depending on the volume and complexity of data. Regardless of the algorithm used, the approach is basically the same. The computer guesses at the initial value, then it calculates the error, then it tries again with a more informed guess.

You train a model using training and test datasets. By convention, that usually means taking some historical data and spitting it in some arbitrary manner, say 80:20, and feeding the training data into the model. After the model is built, you use the testing data to calculate how accurate it is. After each training, you can compare the hyperparameters and choose the ones for the best performance.

11. Training Dataset / Validation Dataset / Test Dataset

There are three different types of datasets in Machine Learning. As the name suggests, a Training Dataset is the dataset you use to train a model. A Validation Dataset is the dataset you put into the Machine Learning models to calculate its accuracy. It is not the predicted outcome. Instead, it is used to find the most accurate model. A Test Dataset is the dataset you want to make predictions on using the final model.

training, validation, and test dataset

12. Cross-Validation

Cross-Validation is a method for determining how accurate a model of the data is by dividing the data into a number of smaller sets. For example, we could divide the data into five sets, use four sets to build a model and one set to validate the model. This process will be done five times because we have five sets. The important practice is to never mix up the train and validation sets. By keeping them separated, the resulting accuracy of the model on the test set would give a better idea of how accurate the model is by using the data that the model has never seen before. In other words, Cross-Validation helps identify how generalizable the model is.

cross-validation

13. Ensemble

Ensemble is combining two or multiple Machine Learning models, which usually gets a better result than you would have gotten if you used all of them separately. It’s like the Wisdom of the Crowd concept: “it is possible that the many, though not individually good men, yet when they come together may be better, not individually but collectively, than those who are so, just as public dinners to which many contribute are better than those supplied at one man’s cost” – Aristotle* (Aristotle: Politics III.1281b. Translated by H. Rackham, Loeb Classical Library). In a mathematical format, if y1 = mx + b and y2 = nx + c, then the ensemble would be y3 = w1(mx + b) + w2(nx + c) where w1 + w2 = 1.

14. Model Explainability

Model Explainability is the ability of the model to explain itself in human language. It could answer questions like;

  • What features in the data were considered most important by the model?
  • How did each feature in the data affect a particular prediction in the model?

Model explainability also helps to apply insights obtained for debugging, feature engineering, planning future data collection, decision making, and building trust in the model so non-technical people are more comfortable using the model.

15. Confusion Matrix

A Confusion Matrix is a table that is used to determine how sensitive, precise, and specific (among other things) a given classification model is. This is done by setting predicted values against actual values to see how correct and incorrect our approach was. It gets a bit more complicated, but at a basic level, a confusion matrix helps to give a data scientist an idea of how well a classification model they built actually classifies things.

16. Accuracy / Precision / Recall

Let’s look at the example below to understand the next set of terms easily.

Suppose we have 100 patients in a hospital, where 60 of them actually have cancer (positive) while 40 do not (negative). How accurate is our doctor in his diagnosis? There are 4 possible outcomes.

  1. TP (True Positive) – The doctor will say the patient has cancer when they do.
  2. TN (True Negative) – The doctor will say the patient does not have cancer when they do not. 
  3. FP (False Positive) – The doctor will say the patient has cancer when they do not.
  4. FN (False Negative) – The doctor will say the patient does not have cancer when they do.

Suppose the doctor’s diagnosis of 100 patients is: 

Reality: Patient has cancer
Doctor’s Diagnosis: Patient has cancer
True Positive = 40
Reality: Patient doesn’t have cancer
Doctor’s Diagnosis: Patient has cancer
False Positive = 10
Reality: Patient has cancer
Doctor’s Diagnosis: Patient doesn’t have cancer
False Negative = 20
Reality: Patient doesn’t have cancer
Doctor’s Diagnosis: Patient doesn’t have cancer
True Negative = 30

Accuracy is “What proportion of the predictions were correctly identified?” To find the accuracy for the example above, we would look at the proportion of the diagnosis that was identified correctly by the doctor, which would be: TP + TN / (TP + FP + TN + FN) = 70 / 100 = 70%.

Although our doctor predicted the patients’ conditions with 70% accuracy, it does not tell the whole story. Moreover, each of the four boxes has different impacts on a patient receiving the news. For example, if the doctor diagnosed a patient with no cancer when in fact he/she does have cancer, what would happen when the diagnosis is reversed afterward? To better understand each situation, we can look at precision and recall.

Precision only looks at the positive predictions: “What proportion of the positive predictions were correctly identified?” In our example, we would find the proportion of the patients diagnosed with cancer that actually have cancer: TP / (TP + FP) = 40 / (40 + 10) = 80%.
Lastly, Recall, sometimes called sensitivity, is used to find “What proportion of the actual positive instances was correctly predicted?” To follow our example, we would ask analyze the proportion of patients with cancer that was correctly identified by the doctor: TP / (TP + FN) = 40 / (40 + 20) = 67%

17. ROC / ROC AUC

A ROC (Receiver Operating Characteristic) curve is a plot of the rate of true positives and false positives at different thresholds. To better grasp the concept, let’s use a cybersecurity example.

The goal of a cybersecurity system is to tell you with some degree of accuracy whether you are being hacked or not. Its outcome would be discreet, meaning either “yes you are being hacked” (positive) or “no you are not being hacked” (negative).

To illustrate, look at the table below.

Reality: You are being hacked
Cybersecurity Predicts: You are being hacked
True Positive (Correct) = a
Reality: You are not being hacked
Cybersecurity Predicts: You are being hacked
False Positive (Wrong) = c
Reality: You are being hacked
Cybersecurity Predicts: You are not being hacked
False Negative (Wrong) = b
Reality: You are not being hacked
Cybersecurity Predicts: You are not being hacked
False Positive (Correct) = d

Now, let’s introduce some definitions. 

  1. Sensitivity is the True Positive Rate, or the probability of correctly identifying you are being hacked
    –> a / (a + b). 
  2. Specificity is the True Negative Rate, or the probability of correctly identifying you are not being hacked
    –> d / (c + d).

The ROC curve is a graph of the TPR on the vertical axis and FPR (1 – TNR) on the horizontal axis (y) at different levels. 

ROC curves

(wiki) note in the image that  <-| is negative |-> is positive on the first graph

Let’s assume that the cybersecurity software uses logistic regression to calculate the probability of a positive or negative event. By default, logistic regression uses a threshold of 50% probability to flag an outcome as positive. But if your cybersecurity system is giving false alarms all the time you could set that threshold to a higher number, which is moving the point on the curve to the right.

The ROC AUC, or just AUC (Area Under the Curve) is simply a measure of how accurate your model is by calculating the area under the ROC at various thresholds. 

If the model is perfectly identifying each class, then the AUC would be 1.

In most cases, we would see something like the graph below. If the AUC is 0.7, it would mean that there is a 70% chance that the model will be able to identify positive class and negative class.

If it cannot distinguish positive versus negative, then AUC would be 0.5.

If the model is predicting each class in reverse, we would have AUC = 0.

18. Data Leakage

A Data Leakage is typically an unintended escape of data from where it’s supposed to be to where it’s not supposed to be. One example of Data Leakage could happen in the process of dividing a dataset into training and test set to perform Cross-Validation.

As described in Cross-Validation, the training and test set should not have any overlapping data points. If there are some data in both sets, the model could perform better than it should because it already has the correct answers.

So basically, any time a model has access to data it’s not supposed to have, then you have a data leak. And data leaks frequently lead to the appearance that a model works better than it really does.

19. Dimensionality Reduction

Dimensionality Reduction is the process of reducing the number of independent variables in a given dataset to improve processing time and make data visualization easier. 

By decreasing the complexity of the data set, less computational power is required to do calculations. And as a side benefit, it would be easier to visualize the data. The cost of dimensionality reduction is that it loses some portion of the information on the data, which means that if that smaller information is important, we may see a drop in the performance.

20. Overfitting / Underfitting

Overfitting means that the model is capturing noise or following the data too closely that the model cannot work on the new data. Underfitting is when the model is neglecting some important information from the data.

Let’s assume a model that says:

Probability of default (p) = a (education level) + b (credit score) + c (income) + d (favorite color) + e (car parked in lot number 4)

You made some loans and many borrowers turned out to be good at repaying the loan. The good borrowers happened to state purple as their favorite color. Moreover, there was always a car parked in the lot 4 when you made the loans. These two facts have nothing to do with the borrowers not defaulting, but the model is capturing them. So you’ve overfitted. The usual way to fix overfitting it to chop off those extraneous variables as they are leading you in the wrong direction.

It’s important to understand how much of the data you would need to encapture. To better understand the balance, let’s take a look at the graph below.

There can be many ways to define Model Complexity. For the purposes of understanding the concept of overfitting, let’s assume the more information from the data you use, the more complex the model becomes. If you increase the model complexity the bias will be reduced because the model is using more and more information, resulting in overfitting. In this case, the model is no longer general enough, so the variance will increase. As you can see in the graph, there’s a sweet spot where the total error (Total Error = Bias2+Variance+Irreducible Error) is at its minimum. That’s the optimal level of complexity we would like to achieve when building a model.