Computer science is all about answering the question, "How can we tell a computer what to do and what to solve while recognizing that some problems are inherently stubborn"?

Statistics is all about answering the question, "How can we use historical data to predict the future? What conclusions can we make from that data?".

Machine learning is all about asking, "How can we build a machine –or even better, a system– that automatically improves the experience? How can we train these computers to solve problems. This includes deciding for themselves how to structure their problems and proccesses to reach their goal?"

Here are a few tools for data mining and machine learning. Note: data mining can also be referred to as "unsupervised learning" and machine learning is "supervised learning".


The Three D's of AI

The three D's of artificial intelligence are that it can: detect, decide and develop.

Detect

AI can discover which elements or attributes in a subject matter domain are the most predictive.

Decide

AI can infer rules about data, from data and weigh the most predictive attributes against each other to make a decision.

Develop

AI can grow and mature with each iteration. It can alter its opinion about the environment as well as how it evalutes that environment. It can program itself.


The Process

So here's the scenario. Your boss knows that you've been collecting event tracking data through Google Analytics and she wants to know if there are any interesting patterns we can give to the marketing department for an upcoming campaign. She doesn't really know what's there but she assigns the task to you to find some form of hidden value. So you go back to your office and start thinking about how you can use data mining to find value.

These are the kinds of services companies like Palentir offer but your firm doesn't have the time or money to explore that option so they decide to do it in-house, first starting with you.

In order to generate a cool report for your VP or manager, you first need to export it in a format that they can understand and process. But before you can even export data, you need to process it using an algorithm. One step before that, you probably don't want to run an algorithm on data before you've processed it. Of course, it's hard to process data before you've collected it.

Therefore, before we can start, we need to gather a few ingredients and tools to process data.


The ingredients

The ingredients are pretty simple. Here's the breakdown:

  • Raw or processed data
  • Tools to process raw data
  • Algorithms to extrapolate processed data
  • Tools to present data in a digestible format for managers or decision makers.

Data sources

I will add more data sources as time goes on but for now, please refer to my previous blog entry

Data types as inputs
Not all data is made or formatted the same. When deciding which dataset to use, I'll need to consider the content, the source (and its credibility), the licenses on that data and the actual format. Here are a few popular formats:

  • Raw text
    • Separatd by carriage return
  • Comma separated values (CSV)
  • Tab separated values (TSV)
  • JavaScript object notation (JSON)
  • Extensible markup language (XML)
  • YAML ain't markup language (YAML)
  • Spreadsheets
    • Microsft Excel
    • Google spreadsheets
    • FileMaker
  • Databases
    • Microsoft SQL
    • MySQL
    • MongoDB
    • CouchDB
    • Cassandra
    • Redis
    • HBase
    • Spark
    • Memcache
    • Storm

Algorithms

Algorithms require data to do their work --which is why the following category will discuss software tools like Spring XD-- but first lets list a few noteable algorithms:

SUPERVISED

  • Naive Bayes Classifier Algorithm is great for classifying things. The reason this tool comes in handy is because it's too difficult classify a web page, a document, an email or any other lengthy text notes manually.
    • SPAM filters were originally built using this algorithm.
    • Classifying content, metadata and user data.
    • Categorizing content, metadata and user preferences and lifestyle information.
    • Sentiment analysis from sources like Twitter.
  • Support Vector Machine Algorithm
  • Linear Regression
  • Artificial Neural Networks
  • Decision Trees

UNSUPERVISED

  • K Means Clustering Algorithm is great for clustering data. It can help you organize words into multiple buckets for cross referencing or triangulation.
  • Apriori Algorithm
  • Logistic Regression
  • Random Forests
  • Nearest Neighbours

Decision Trees

ID3

Iterative Dichotomiser 3 was invented by Ross Quinlan to create trees from datasets. It's based on the information gain method

C4.5

C4.5 has the ability to work despite missing attribute values.

CHAID

Chi-squared Automatic Interaction Detection

MARS

Multivariate adaptive regression splines


Software

  • Weka is a data mining took it from the University of New Zealand that helps you with:
    • Preprocessing data
    • Clustering
    • Classification
    • Regression
    • Association rules
  • Mahout from the Apache Foundation is an open source toolkit that offers libraries that work with these algorithms.
    • Naive Bayes Classifier
    • K Means Clustering
    • Recommendation Engines
    • Random Forest Decision Trees
    • Logistic Regression Classifier
  • Spring XD is good for processing data.
    • Data ingestion
    • Batch processing
    • Analytics
    • Data exporting

Sources