Julia provides a rich set of functions for performing statistical calculations. This tutorial will cover some of the most common ones, demonstrating their usage with examples.
1. Mean:
The mean()
function calculates the arithmetic mean of a
collection of numbers.
data = [1, 2, 3, 4, 5]
average = mean(data)
println("Mean: ", average) # Output: Mean: 3.0
data_float = [1.0, 2.5, 3.7, 4.2, 5.1]
avg_float = mean(data_float)
println("Mean (Float): ", avg_float) # Output: Mean (Float): 3.3
#For other collection types
data_tuple = (1,2,3,4,5)
avg_tuple = mean(data_tuple)
println("Mean (Tuple): ", avg_tuple) # Output: Mean (Tuple): 3.0
data_set = Set([1,2,3,4,5])
avg_set = mean(data_set)
println("Mean (Set): ", avg_set) # Output: Mean (Set): 3.0
2. Median:
The median()
function returns the middle value of a
sorted collection.
data = [1, 2, 3, 4, 5]
med = median(data)
println("Median: ", med) # Output: Median: 3.0
data_even = [1, 2, 3, 4]
med_even = median(data_even)
println("Median (Even): ", med_even) # Output: Median (Even): 2.5
3. Standard Deviation:
The std()
function calculates the standard deviation, a
measure of the spread of data around the mean.
data = [1, 2, 3, 4, 5]
stdev = std(data)
println("Standard Deviation: ", stdev) # Output: Standard Deviation: 1.5811388300841898
#To calculate standard deviation of a population instead of a sample, use the stdevp() function.
stdev_pop = stdp(data)
println("Population Standard Deviation: ", stdev_pop) # Output: Population Standard Deviation: 1.4142135623730951
4. Variance:
The var()
function calculates the variance, the square
of the standard deviation.
data = [1, 2, 3, 4, 5]
variance = var(data)
println("Variance: ", variance) # Output: Variance: 2.5
#To calculate variance of a population instead of a sample, use the varp() function.
variance_pop = varp(data)
println("Population Variance: ", variance_pop) # Output: Population Variance: 2.0
5. Minimum and Maximum:
The minimum()
and maximum()
functions find
the smallest and largest values in a collection. extrema()
returns both.
data = [5, 2, 8, 1, 9]
min_val = minimum(data)
max_val = maximum(data)
extrema_val = extrema(data)
println("Minimum: ", min_val) # Output: Minimum: 1
println("Maximum: ", max_val) # Output: Maximum: 9
println("Extrema: ", extrema_val) # Output: Extrema: (1, 9)
6. Quantiles:
The quantile()
function calculates quantiles, which
divide the data into equal parts.
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
q25 = quantile(data, 0.25) # 25th percentile
q50 = quantile(data, 0.5) # 50th percentile (median)
q75 = quantile(data, 0.75) # 75th percentile
println("25th Quantile: ", q25) # Output: 25th Quantile: 3.25
println("50th Quantile: ", q50) # Output: 50th Quantile: 5.5
println("75th Quantile: ", q75) # Output: 75th Quantile: 7.75
7. describe()
(from
DataFrames.jl
):
For a more comprehensive summary of descriptive statistics, the
describe()
function from the DataFrames.jl
package is very helpful, especially when working with tabular data.
using DataFrames
df = DataFrame(A = [1, 2, 3, 4, 5], B = [2.1, 3.2, 4.3, 5.4, 6.5])
summary_stats = describe(df)
println(summary_stats)
This will output a table containing count, mean, std, min, median,
max, and other statistics for each column in the DataFrame. Remember to
install DataFrames.jl
first using
] add DataFrames
in the Julia REPL.
Example: Combining functions:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
avg = mean(data)
stdev = std(data)
println("Mean: ", avg, ", Standard Deviation: ", stdev)
zscores = (data .- avg) ./ stdev # Calculate z-scores
println("Z-scores: ", zscores)
This tutorial covers the basics. Julia’s statistical capabilities
extend far beyond these functions. The StatsBase.jl
package
provides a wide array of more advanced statistical tools. Remember to
consult the Julia documentation for more in-depth information and
further functions.
Let’s explore more statistical functions available in base Julia and
the StatsBase.jl
package.
Base Julia (Built-in Functions):
Beyond the basic functions (mean, median, std, var, min, max, quantile) we discussed before, base Julia offers several other useful statistical functions:
sum(x)
: Calculates the sum of all
elements in x
.prod(x)
: Calculates the product of all
elements in x
.cumsum(x)
: Calculates the cumulative
sum of x
. Returns a vector of the same length as
x
.cumprod(x)
: Calculates the cumulative
product of x
.diff(x)
: Calculates the differences
between consecutive elements of x
.cor(x, y)
: Computes the Pearson
correlation coefficient between vectors x
and
y
.cov(x, y)
: Computes the covariance
between vectors x
and y
.hist(x)
: Creates a histogram (returns
a tuple containing edges and counts). Plots.jl
is often
used to visualize the histogram.range(start, stop, length)
: Generates
an evenly spaced range of numbers. Useful for creating data for
statistical analysis or plotting.rand(n)
: Generates n
random numbers between 0 and 1.randn(n)
: Generates n
standard normal (mean 0, standard deviation 1) random numbers.mean(f, x)
: Calculates the mean of
f(x)
where f
is a function. Useful for
weighted averages or transformations.map(f, x)
: Applies the function
f
to each element of x
and returns a new
collection. Helpful for data preprocessing before statistical
analysis.filter(f, x)
: Returns a new collection
containing only the elements of x
for which the function
f
returns true
. Useful for subsetting your
data.StatsBase.jl
Package:
StatsBase.jl
significantly expands Julia’s statistical
capabilities. Here are some key functions and concepts it provides:
modes(x)
: Returns the mode(s) of a dataset.skewness(x)
: Measures the asymmetry of the data
distribution.kurtosis(x)
: Measures the “tailedness” of the data
distribution.describe(x)
: Provides a summary of descriptive
statistics.percentile(x, p)
: Calculates the p-th percentile.iqr(x)
: Calculates the interquartile range.StatsBase.jl
integrates
well with the Distributions.jl
package, which provides a
wide variety of probability distributions. You can work with these
distributions to perform calculations related to probabilities,
quantiles, fitting data to distributions, etc.StatsBase.jl
, in
conjunction with other packages, supports hypothesis testing.GLM.jl
(Generalized
Linear Models).StatsBase.jl
provides
tools for bootstrapping and other resampling techniques.StatsBase.jl
accept weights, allowing you to perform
weighted statistical calculations.Example using StatsBase.jl
:
using StatsBase
data = [1, 2, 2, 3, 4, 4, 4, 5, 6]
println("Mode: ", modes(data)) # Output: Mode: [4]
println("Skewness: ", skewness(data)) # Output: Skewness: 0.475206912734336
println("Kurtosis: ", kurtosis(data)) # Output: Kurtosis: -0.5265625
println("Percentile (75th): ", percentile(data, 75)) # Output: Percentile (75th): 4.0
println("Interquartile Range: ", iqr(data)) # Output: Interquartile Range: 2.5
println("Describe: \n", describe(data)) # Output: Describe: (summary statistics)
# Using weights
weights = [1, 2, 1, 1, 2, 1, 3, 1, 1] # Example weights
println("Weighted Mean: ", mean(data, weights)) # Output: Weighted Mean: 3.5294117647058822
Key Considerations:
StatsBase.jl
before using it. In the Julia REPL, type
] add StatsBase
.StatsBase.jl
documentation is your best resource for a complete list of functions and
their usage. You can access it online or within the Julia REPL using
?StatsBase
.This expanded list and example should give you a better overview of the statistical functions available in Julia. Remember to consult the documentation for the most up-to-date information and details on function usage.
The linreg
function makes it easy to perform simple and
multiple linear regression on datasets containing one or multiple
independent variables.
The data set represents samples of gasoline of various octane ratings. For each sample, the octane rating was measured along with the component makeup in terms of three components. We aim to model octane rating as a function of the component makeup of gasoline.
There are several Julia packages that can assist you with regression diagnostics. Here are a few of them:
GLM.jl: The Generalized Linear Models package is the go-to for fitting linear and generalized linear models.
Plots.jl and StatsPlots.jl: These packages are excellent for creating diagnostic plots.
HypothesisTests.jl: This package contains various statistical tests, including those for heteroscedasticity and autocorrelation.
RegressionDiagnostics.jl: This package provides tools for checking multicollinearity, among other diagnostic tests.
DataFrames.jl: While not strictly for diagnostics, this package is incredibly useful for data manipulation and preparation.
StatsBase.jl: Provides basic statistical functions that can be helpful in diagnostics.
By combining these packages, you can perform a thorough analysis and diagnostics of your regression models. Each package has its own set of functionalities that complement the others, making the entire diagnostic process smoother and more comprehensive.
There are several Julia packages that can assist you with regression diagnostics. Here are a few of them:
GLM.jl: The Generalized Linear Models package is the go-to for fitting linear and generalized linear models.
Plots.jl and StatsPlots.jl: These packages are excellent for creating diagnostic plots.
HypothesisTests.jl: This package contains various statistical tests, including those for heteroscedasticity and autocorrelation.
RegressionDiagnostics.jl: This package provides tools for checking multicollinearity, among other diagnostic tests.
DataFrames.jl: While not strictly for diagnostics, this package is incredibly useful for data manipulation and preparation.
StatsBase.jl: Provides basic statistical functions that can be helpful in diagnostics.
By combining these packages, you can perform a thorough analysis and diagnostics of your regression models. Each package has its own set of functionalities that complement the others, making the entire diagnostic process smoother and more comprehensive.