The data demands of most businesses are growing by the day—and the latest figures average 163 terabytes (163,000 gigabytes) of information per company. Storing this information is one thing. Analyzing it to derive accurate trends and business metrics is another goal altogether.
That's where data mining comes into play. Through the application of machine learning (ML) algorithms, data mining helps you understand what your customers want, detect fraud, improve workforce efficiency, find opportunities, and forecast the future.
There are many examples of data mining successes and failures like these:
- An insurance company boosted workplace efficiency up to 22% by implementing a work-from-home policy after ML algorithms learned that the company’s employees worked better at home.
- Researchers discovered that using data mining to find terrorists produces 100,000 false positives for every real terrorist it finds.
To boost your chances of data mining success (and prevent a disaster), we’ve prepared this guide with the following information:
The Data Mining Process
The data mining process involves five stages:
1. Understanding the Goals of Your Data Mining Project
The first stage of data mining defines how the process will support your business goals. For example, what areas of business do you want to improve through data mining?
Do you want to make your product recommendation systems better like Netflix did?
Do you want to understand your customers better through personas and segmentation?
After codifying your data mining goals, you can develop a project timeline, key actions, and assign roles for completing the project.
2. Understanding Your Data
In the next stage, you’ll assess your data sources. Data visualization tools like Google Data Studio, Tableau, or Grapher allow you to explore the properties of your data to decide which information will be useful to achieve your goals. Understanding your data also helps you determine which data mining strategies will produce the insights you want.
3. Preparing the Data (ETL)
In the data preparation stage, you'll use ETL (extract, transform, load) strategies to prepare your data for analysis. You can use an automated, cloud-based ETL solution like Xplenty to extract your data from different business applications, cloud-based SaaS platforms, and other sources—then transform the information and optimize it for high-speed analysis. Ultimately, the ETL process cleanses the data, addresses missing information, and makes sure your data mining applications can analyze the information as a whole.
4. Analyzing, Mining, and Modeling the Data
At the heart of the data mining process, you’ll introduce the prepared data to business intelligence (BI) tools—like Tableau Server, Looker, InsightSquared, Amazon QuickSight, or Microsoft Power BI. These tools will use different machine learning algorithms to mine the data for patterns and forecast future trends. More on this below!
5. Reviewing and Sharing the Findings Across the Organization
In the last stage of data mining, your data team and key decision-makers will study the results to decide:
If the findings are accurate
If the they support your goals
How to act on them
How to share the findings with your team
When it comes to sharing your data-mining results, most enterprise-level BI platforms allow you to distribute key findings across your organization quickly and efficiently.
Techniques and Strategies of Data Mining
In this section, we'll discuss the two types of machine learning algorithms used in data mining: supervised learning and unsupervised learning.
1. Supervised Learning in Data Mining
Supervised learning algorithms analyze “labeled data”—where the appropriate classifications are identified—to learn specific analytical patterns. After learning from this training data, a supervised learning algorithm can recognize the same patterns in new datasets to interpret the information, predict future events, and create valuable insights and models.
Supervised Learning Examples
For example, a supervised machine learning algorithm could learn the characteristics of emails labeled “spam,” and automatically label new messages as spam when they have the same characteristics.
As another example, imagine you want to predict which high school students will do well on the SAT based on demographic and educational data. With supervised learning, you can take information from previous test takers and label it with grades of “poor,” “average,” “good,” and “excellent.” After learning this model, the supervised learning algorithm can predict the performance of future students based on their demographic and educational data.
Types of Supervised Learning Algorithms
Here are some common supervised learning algorithms:
K-Nearest Neighbor Algorithms: Instead of deriving complex insights from the data, K-nearest neighbor algorithms establish relationships between similar items. For example, you could group businesses by type or location; organize students by gender, race, age, or address; or categorize songs by genre, artist, publisher, etc.
Here’s a visual representation of a K-nearest neighbor classification map:
Linear Regressions: This algorithm predicts the value of an unknown variable by analyzing other variables. For example, you could train a linear regression model with data pertaining to recently-sold businesses (using data that includes business type, location, size, sale price, sale date, etc.). The linear regression model could then forecast the market value of another business based on location, sector, or a future sale date.
Linear analysis could also reveal a trend of increasing monthly sales and forecast the trend into future months. Furthermore, it could zero-in on unique factors—like a new ad campaign or a change in packaging—to predict the effect of one or more factors on sales revenue.
Here’s an illustration of random data points and their linear regression:
Logistic Regressions: This algorithm is valuable for predicting whether a variable supports or does not support a specific result. A logistic regression could analyze a dataset to answer these yes-or-no questions:
- Does the number of cigarette packs you smoke per day influence the chances of getting lung cancer (yes or no)?
- Does heart attack risk increase with age (yes or no)?
For logistic regression to work, the variable needs to be “dichotomous.” In other words, you must be studying how the presence or non-presence of a variable affects a “yes-or-no” answer.
Neural Networks: This algorithm is modeled after the human brain. Neural networks learn through repetition over time:
“First, a collection of software ‘neurons’ are created and connected together, allowing them to send messages to each other. Next, the network is asked to solve a problem, which it attempts to do over and over, each time strengthening the connections that lead to success and diminishing those that lead to failure.”
Neural network models are useful when machine learning systems require fast, rapid-fire responses—like in driverless vehicle technology.
Time Series Forecasting: These machine learning models predict the best timing for specific actions. For example, a vehicle manufacturer could analyze past data with a time series model to predict when its necessary to restock parts inventories. A retailer could use time forecasting to schedule the release of a new product. In this example, a data scientist used the time series forecasting tool Prophet to predict future demand for furniture and office supplies—including the best and worst months for selling such items.
Decision Trees (a.k.a., Classification or Regression Trees): These are predictive modeling techniques that forecast outcomes based on a set of binary rules. By following the rules, a decision tree algorithm produces the same result with the same input.
Here’s a simple decision tree for playing outside:
2. Unsupervised Learning
Unsupervised learning algorithms are exactly as they sound. Instead of supervising the algorithm with labeled data, the algorithm works on its own to discover a new model/organization/pattern that may not be visible to the human eye. By drawing conclusions from "unlabeled" data like this, unsupervised learning creates novel systems for analyzing, categorizing, and deriving insights from raw information. In other words, the algorithm discovers new structures of interest that it later reports to you.
Data mining through unsupervised learning is particularly valuable when dealing with massive datasets that lack an underlying structure. It can help you find patterns and meanings in datasets that you don’t immediately understand or know what to do with.
Data scientists may use supervised and unsupervised learning separately, but they often combine them. When combined, unsupervised learning identifies the structures that serve as “training data” for the supervised learning process. Since unsupervised learning is a compute-heavy process, make sure the insights you want are worth the expense.
Unsupervised Learning Examples
Remember the above example about the high school students taking the SAT? With unsupervised learning, you’re dealing with an "unlabeled" dataset, so there’s no training data that labels each student’s test result. Nevertheless, you can still use unsupervised learning algorithms to analyze the demographic and educational data of a group of students to find patterns. For example, you could mine raw student information that includes demographic, educational, and SAT score data with an unsupervised learning algorithm. Then the algorithm might discover a correlation between demographic information and SAT results—without you even looking for it.
Alternatively, unsupervised learning could use a k-means clustering algorithm to group together "look-a-like" students or customers within a dataset. That information might inform new education standards or novel marketing strategies.
With unsupervised learning, there’s no correct answer. These systems find their own logic, so they can produce surprisingly new ways to solve problems and understand your data.
Types of Unsupervised Learning Algorithms
Here are some common unsupervised learning models:
Clustering: This is an unsupervised learning process that organizes similar and dissimilar items together. As mentioned above, you can use clustering to identify “look-a-likes.” For example, you might learn that 25% of your customers are aged 45 to 50, female, and enjoy red wine. This information could prove valuable when targeting new customers in online advertising campaigns.
Ultimately, clustering identifies relationships between objects in an unstructured dataset to provide a meaningful, searchable, and analyzable structure.
Here's a representation of a k-means cluster analysis:
Association Rule Learning: This is a popular algorithm for market researchers. Association learning looks for interesting relationships between variables in massive datasets to reveal items that frequently occur together.
For example, the system might discover that women age 30 to 40 like to buy products with a specific shade of red. This would tell product designers to include that color in a new product line. Retailers can also use association analysis to find pairs of products that customers buy together and use the information for better purchase recommendations in an online marketplace.
Principal Component Analysis (PCA): Principal component analysis (PCA) makes data analytics more efficient by distilling the variables required for a specific result to their minimum components. This involves finding the sweet spot between analytical efficiency and analytical accuracy. In most cases, you’ll end up trading a small amount of accuracy in exchange for faster, less compute-heavy data mining.
Data Mining in 2019
Modern organizations are using data mining to inform their business decisions in the following areas:
Understanding Customer Satisfaction and Public Sentiment
Companies are analyzing data from social media platforms through “text mining” to reveal how the public views their products and offerings. Text mining uses natural language processing (NLP) and statistical pattern recognition to understand overall feelings and sentiments based on what people are saying online. Once you understand public sentiment, you can steer your marketing, PR, and product development to improve your reputation.
Targeted Ads, Marketing, and Improved Recommendations
Data mining is helping advertisers identify look-a-like customers, so they can target them with tailored ads and promotions. Companies like Amazon and Netflix use these techniques to offer purchase recommendations based on customer browsing, viewing, and spending habits. Overall, it’s improving user engagement and experience, while boosting sales and retainment.
Medical Diagnosis and Patient Risk Assessment
Data mining helps medical researchers improve patient diagnosis and treatment. The statistical models from data mining medical records have allowed doctors to create risk factor warnings and lifestyle recommendations for better preventative care.
When commenting on data mining and medicine, the VP of Machine Learning for Amazon Web Services said, “We believe that machine learning can significantly accelerate the progress of medical research and help translate those advances into treatments and improved experiences for patients.”
Insurance Industry Optimization
Predictive analytics through data mining helps insurance companies understand their customers, and the risks related to accidents, bodily injury, medical conditions, surgical outcomes, and property damage. Data mining also helps insurance companies identify the 1 out of 10 insurance claims that are fraudulent. By comparing one customer’s claim history to thousands, machine learning can find potential cases of fraud.
Credit Risk Assessment
Banks are mining data related to customer credit histories, credit scores, and demographics information—then applying machine learning algorithms to the information to automatically approve or deny loans and calculate more strategic interest rates.
Financial Fraud and White-Collar Crime Prevention
Financial institutions use data mining to red-flag potentially fraudulent transactions, which they pause while requesting customer verification by text or email. These machine learning models monitor customer spending habits to identify transactions that fall outside the norm.
For example, MIT used machine learning to mine a dataset of 900 million transactions, where 122,000 were confirmed as fraudulent. Using insights from the data, MIT has improved fraud detection models for banks to dramatically reduce instances of financial fraud.
Xplenty: Fueling Your Data Mining Tools with the Information They Need
Now that you understand the basics of data mining, you can appreciate why it’s important to fuel your data mining tools with the highest quality information. That's where Xplenty can help. If you're suffering from a data integration bottleneck, Xplenty's automated, cloud-based ETL platform offers a powerful, no-code interface that makes data integration a snap. Check out our hundreds of automatic integrations here!