I'm currently studying statistics for my EMBA at USC, and the best way for me to learn is to both write down my notes and include some code.


Mean

Mean - The sum of all data points divided by the total number of observations.

Ruby

weight = [115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164]
height = [58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]

# Provide the average
def mean(array)
  array = array.inject(0) { |sum, x| sum += x } / array.size.to_f
end

puts %Q{ Mean Weight: #{mean(weight)}, Mean Height: #{mean(height)} }

R

weight <- c(115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164)
height <- c(58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)

weight_mean <- mean(height)
height_mean <- mean(weight)

sprintf("Mean Weight: %1.4f, Mean Height: %1.4f", weight_mean, height_mean)

Javascript

const util = require("util");
const math = require("mathjs");

let weight = [115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164];
let height = [58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]

var weightMean = math.mean(weight).toFixed(2);
var heightMean = math.mean(height).toFixed(2);

console.log( util.format("Mean Weight %s, Mean Height: %s", weightMean, heightMean) );

Ruby is such an elegant language that shows your work is fun, but I love how R has a native method for mean(). In the Javascript example, I'm splitting the difference with a little help from MathJS package.


Median

Median is the midpoint of data. Suppose you have 25 observations. The midpoint would be the middle observation or row 13.

When you are given a mean or median number, consider this it the beginning of an adventure.

Ruby

weight = [115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164]
height = [58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]

# If the array has an odd number, then simply pick the one in the middle
# If the array size is even, then we must calculate the mean of the two middle.
def median(array, already_sorted=false) 
  return nil if array.empty?
  array = array.sort unless already_sorted 
  m_pos = array.size / 2
  return array.size % 2 == 1 ? array[m_pos] : mean(array[m_pos-1..m_pos]) 
end

puts %Q{ Median Weight: #{median(weight)}, Median Height: #{median(height)} }

R

weight <- c(115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164)
height <- c(58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)

weight_median <- median(weight)
height_median <- median(height)

sprintf("Median Weight: %s, Median Height: %s", weight_median, height_median)

Javascript

const util = require("util");
const math = require("mathjs");

let weight = [115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164];
let height = [58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]

var weightMean = math.median(weight).toFixed(2);
var heightMean = math.median(height).toFixed(2);

console.log( util.format("Mean Weight %s, Mean Height: %s", weightMean, heightMean) );

The Mode

The mode is the data point that is most prevalent in the data set. It represents the most likely outcome in a dataset.

Ruby

weight = [115, 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164]
height = [59, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]

# The mode is the single most popular item in the array.
def modes(array, find_all=true)
  histogram = array.inject(Hash.new(0)) { |h, n| h[n] += 1; h } 
  modes = nil
  histogram.each_pair do |item, times|
    modes << item if modes && times == modes[0] and find_all
    modes = [times, item] if (!modes && times>1) or (modes && times>modes[0]) 
  end
  return modes ? modes[1...modes.size] : modes 
end

puts %Q{ Mode Weight: #{modes(weight)}, Mode Height: #{modes(height)} }

R

weight <- c(115, 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164)
height <- c(58, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)

get_mode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

height_mode <- get_mode(height)
weight_mode <- get_mode(weight)

sprintf("Mode Weight: %s, Height Mode: %s", weight_mode, height_mode)

Javascript

const util = require("util");
const math = require("mathjs");

let weight = [115, 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164];
let height = [58, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]

var weightMedian = math.mode(weight);
var heightMedian = math.mode(height);

console.log( util.format("Median Weight %s, Median Height: %s", weightMedian, heightMedian) );

Standard Deviation

Standard Deviation is the average (square) distance from the mean. Said differently, it's a number that measures how close your data set –as a whole– is to the mean.

This data point will help you get a better field of the distribution of your data points.

Ruby

weight = [115, 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164]
height = [59, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]

def mean(array)
  array = array.inject(0) { |sum, x| sum += x } / array.size.to_f
end

def standard_deviation(array)
  m = mean(array)
  variance = array.inject(0) { |variance, x| variance += (x - m) ** 2 } 
  standard_deviation = Math.sqrt(variance/(array.size-1))

  # Round floating point to 4 decimals
  format = "%0.4f"
  return format % standard_deviation
end

puts %Q{ Weight SD: #{standard_deviation(weight)}, Height SD: #{standard_deviation(height)} }

R

R method sd uses sample standard deviation, not the population standard Deviation.

weight <- c(115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164)
height <- c(58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)

weight_sd <- sd(weight)
height_sd <- sd(height)

sprintf("Weight SD: %1.4f, Height SD: %1.4f", weight_sd, height_sd)

Javascript

const util = require("util");
const math = require("mathjs");

let weight = [115, 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164];
let height = [58, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]

var weightSD = math.std(weight).toFixed(4);
var heightSD = math.std(height).toFixed(4);

console.log( util.format("Weight SD %s, Height SD: %s", weightSD, heightSD) );


Z Scores

Z-scores are simple arithmetic transformations of the actual measurements.

R

In R, you can calculate the z-score using the scale() method.

Longhand

This is using the z-score algebraic expression.

weight <- c(115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164)
height <- c(58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)
x <- 50

zWeight <- (x - mean(weight) ) / sd(weight)
zHeight <- (x - mean(height) ) / sd(height)
sprintf("Weight Z: %1.2f. Height Z: %1.2f", zWeight, zHeight)

This is using R's scale() method.

weight <- c(115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164)
height <- c(58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)
x <- 50


zWeight <- scale(x, center = mean(weight), scale = sd(weight))
zHeight <- scale(x, center = mean(height), scale = sd(height))
sprintf("Weight Z: %1.2f. Height Z: %1.2f", zWeight, zHeight)

Javascript

const util = require("util");
const math = require("mathjs");

let weight = [115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164];
let height = [58, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]

//How many standard deviations our datapoints lie from the mean
//This will help you determine if a specific datapoint is an outlier
function zScore(datapoint, mean, std, n=1){
    let score = (datapoint - mean) / (std / Math.sqrt(n) );
    // Number of standard deviations from the mean. 
    return Number(score).toFixed(4);
}


var x = 50
var mean = math.mean(weight)
var sd = math.std(weight);
var zWeight = zScore(x, mean, sd);

var mean = math.mean(height)
var sd = math.std(height);
var zHeight = zScore(x, mean, sd);
console.log( util.format("Weight Z %s, Height Z: %s", zWeight, zHeight) );

Correlation

This little method in R is convenient. Sometimes you might want to ask yourself, "Are these two data points correlated?" Using R, it's straightforward to understand p.

R

weight <- c(115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 154, 159, 164)
height <- c(58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)

# What percentage of correlation
cor <- cor(weight, height)

sprintf("Percentage of Correlation: %f", cor)