About

Column

The RDatasets.jl package

The RDatasets.jl package in Julia provides an easy way for users to access many of the standard datasets available in R. It’s essentially a port of the Rdatasets repository created by Vincent Arelbundock, which gathers datasets from various R packages in one convenient location.

With RDatasets.jl, you can load datasets using the dataset() function, which takes the name of the package and the dataset as arguments. For example, you can load the famous iris dataset with dataset("datasets", "iris").

This package is particularly useful for those who are familiar with R and want to use the same datasets in Julia for analysis or experimentation.


The RDatasets.jl package includes over 700 datasets from various R packages. Some of the datasets are from the core of R, while others are included with many of R’s most popular packages. Here are a few examples of the types of datasets you can find:

  • Iris dataset from the datasets package
  • Neuro dataset from the boot package
  • Ecdat datasets for econometrics
  • HistData datasets from the history of statistics and data visualization
  • ISLR datasets for “An Introduction to Statistical Learning with Applications in R”
  • MASS datasets for support functions and datasets for Venables and Ripley’s MASS
  • SASmixed datasets from the “SAS System for Mixed Models”
  • Zelig datasets for everyone’s statistical software

You can use the RDatasets.datasets() function to get a table describing all the included datasets, or pass in a package name for a targeted list of datasets.

NYC Flights

Column

NYC Flights (nycflights13)

The NYC Flights 13 dataset in R is a popular resource for data analysis. It contains comprehensive information about all domestic flights departing from New York City airports (JFK, LGA, EWR) during the year 2013.

Key Features:

  • Comprehensive Data: Includes flight dates and times (scheduled and actual), departure and arrival delays, carrier information, origin and destination airports, flight numbers, and more.
  • Real-World Data: Offers a realistic and engaging dataset for practicing data manipulation, visualization, and statistical analysis techniques.
  • Tidyverse-Friendly: Easily integrates with the tidyverse ecosystem in R, making it convenient to use with popular packages like dplyr, ggplot2, and tidyr.

Common Uses:

  • Analyzing Flight Delays: Investigating factors that contribute to flight delays, such as weather conditions, carrier performance, and time of year.
  • Visualizing Flight Patterns: Creating maps and charts to explore flight routes, destinations, and travel trends.
  • Data Cleaning and Transformation: Practicing data wrangling techniques, such as filtering, grouping, and joining data from different sources.
  • Statistical Modeling: Building predictive models to forecast flight delays or other aviation-related outcomes.

The NYC Flights 13 dataset provides a valuable resource for learning and applying data analysis skills in R.

using DataFrames, CSV, HTTP

# Download the NYC Flights 13 data from GitHub
url = "https://raw.githubusercontent.com/tidyverse/nycflights13/master/data-raw/nycflights13.csv"
response = HTTP.get(url)
data = String(response.body)

# Read the data into a DataFrame
flights = CSV.read(IOBuffer(data), DataFrame)

# Inspect the dimensions of the dataset
num_rows, num_cols = size(flights) 
println("Number of rows:", num_rows)
println("Number of columns:", num_cols)

Explanation:

  1. Import necessary packages:
    • DataFrames: For working with tabular data.
    • CSV: For reading CSV files.
    • HTTP: For downloading data from the web.
  2. Download and load the data:
    • The code downloads the NYC Flights 13 data from GitHub and reads it into a DataFrame.
  3. Inspect dimensions:
    • size(flights) returns a tuple containing the number of rows and columns of the DataFrame.
    • The code extracts the number of rows and columns from the tuple and prints them to the console.

This will output the number of rows and columns in the NYC Flights 13 dataset.

Exercise 1.1: Basic Data Exploration

using DataFrames, CSV, HTTP

# Download and load the data
url = "https://raw.githubusercontent.com/tidyverse/nycflights13/master/data-raw/nycflights13.csv"
response = HTTP.get(url)
data = String(response.body)
flights = CSV.read(IOBuffer(data), DataFrame)

# Print the first 10 rows
println("First 10 rows:")
show(first(flights, 10))

# Determine the number of rows and columns
num_rows, num_cols = size(flights)
println("Number of rows:", num_rows)
println("Number of columns:", num_cols)

# Print data types of each column
println("Data types of each column:")
for (col_name, col_type) in pairs(eltype(flights))
    println("$col_name: $col_type")
end

Exercise 1.2: Unique Airlines and Flight Counts

# Find unique airlines
unique_carriers = unique(flights.carrier)
println("Unique airlines:")
println(unique_carriers)

# Count flights for each carrier
carrier_counts = countmap(flights.carrier)
println("Flight counts for each carrier:")
println(carrier_counts)

Exercise 2.1: Filtering and Selecting Data

# Filter flights departing from JFK
jfk_flights = filter(:origin => ==("JFK"), flights)

# Select specific columns
selected_columns = select(jfk_flights, [:carrier, :flight, :origin, :dest])
println("Flights departing from JFK:")
show(selected_columns)

Exercise 2.2: Filtering and Calculating Average Delay

# Filter flights with arrival delay > 30 minutes
delayed_flights = filter(:arr_delay => >=(30), flights)

# Calculate average arrival delay
avg_delay = mean(delayed_flights.arr_delay)
println("Average arrival delay for delayed flights:", avg_delay)

Exercise 3.1: Creating a New Column and Calculating Average Delay

# Create a new column 'total_delay'
flights_with_total_delay = transform(flights, :total_delay => (x, y) -> x .+ y, :dep_delay, :arr_delay)

# Calculate average total delay for each carrier
avg_total_delay_by_carrier = combine(groupby(flights_with_total_delay, :carrier), :total_delay => mean => :avg_total_delay)
println("Average total delay for each carrier:")
show(avg_total_delay_by_carrier)

Exercise 3.2: Grouping and Counting Flights by Month

# Create a new column 'flight_month'
flights_with_month = transform(flights, :flight_month => x -> month(x), :month)

# Group by 'flight_month' and count flights
monthly_flight_counts = combine(groupby(flights_with_month, :flight_month), nrow => :num_flights)
println("Number of flights per month:")
show(monthly_flight_counts)

Exercise 4.1: Data Visualization (requires Plots.jl)

using Plots

# Histogram of arrival delays
histogram(flights.arr_delay, xlabel="Arrival Delay (minutes)", ylabel="Frequency", title="Histogram of Arrival Delays")

# Bar chart of flights per carrier
bar(keys(carrier_counts), values(carrier_counts), xlabel="Carrier", ylabel="Number of Flights", title="Flights per Carrier")

# Scatter plot of departure delay vs. arrival delay
scatter(flights.dep_delay, flights.arr_delay, xlabel="Departure Delay (minutes)", ylabel="Arrival Delay (minutes)", title="Departure Delay vs. Arrival Delay")

Note: These are basic examples. You can further explore and refine these exercises based on the specific learning objectives and the level of your students.

Remember to install the required packages (DataFrames, CSV, HTTP, Plots) before running these scripts. You can install them using the Julia package manager:

using Pkg
Pkg.add(["DataFrames", "CSV", "HTTP", "Plots"])

Pima Diabetes Dataset

Column

Pima Diabetes Data

The Pima Indian Diabetes dataset is a widely used dataset in the field of machine learning, particularly for classification tasks. It’s originally from the National Institute of Diabetes and Digestive and Kidney Diseases and contains data from a population of Pima Indian women who live near Phoenix, Arizona.

Here’s what you need to know about the dataset:

Purpose:

  • The main goal of this dataset is to predict whether a patient has diabetes based on various diagnostic measurements.

Content:

  • The dataset contains information about 768 women, all of Pima Indian heritage and at least 21 years old.
  • It includes 8 predictor variables (features) that are believed to be related to diabetes:
    • Pregnancies: Number of times pregnant
    • Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
    • BloodPressure: Diastolic blood pressure (mm Hg)
    • SkinThickness: Triceps skin fold thickness (mm)
    • Insulin: 2-Hour serum insulin (mu U/ml)
    • BMI: Body mass index (weight in kg/(height in m)^2)
    • DiabetesPedigreeFunction: Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
    • Age: Age (years)
  • There is one target variable (outcome):
    • Outcome: Class variable (0 or 1) where 1 indicates positive for diabetes, and 0 indicates negative.

Why it’s popular:

  • Benchmark Dataset: It’s a well-known and commonly used dataset for learning and experimenting with machine learning algorithms, especially classification algorithms.
  • Real-World Data: It provides a real-world scenario for predicting a health condition, making it relevant for practical applications.
  • Relatively Small: It’s a manageable size, making it suitable for quick experimentation and model development.

Where to find it:

  • You can find this dataset on various platforms, including:
    • Kaggle: A popular website for datasets and machine learning competitions.
    • UCI Machine Learning Repository: A collection of datasets used in machine learning research.
    • GitHub: Many users have uploaded the dataset to GitHub repositories.

Important Note:

While this dataset is widely used, it’s important to remember that it represents a specific population (Pima Indian women) and may not be representative of other populations. When using this dataset, it’s crucial to be mindful of potential biases and limitations.

Sonar Dataset

Column

Sonar Dataset

Connectionist Bench Sonar dataset

The Sonar dataset, also known as the Connectionist Bench (Sonar, Mines vs. Rocks) dataset, is a classic dataset used in machine learning for binary classification. It’s designed to help train models that can distinguish between sonar signals bounced off underwater objects. Specifically, it aims to differentiate between signals returned from mines (metal cylinders) and rocks (roughly cylindrical rocks).

Here’s a breakdown of its key characteristics:

  • Task: Binary classification. The goal is to categorize a sonar signal as either a mine (“M”) or a rock (“R”).
  • Data: The dataset consists of 208 instances (examples).
  • Features: Each instance has 60 features. These features represent the energy within a particular frequency band of the sonar signal, integrated over a certain period of time. Think of it as a breakdown of the signal’s characteristics.
  • Labels: Each instance is labeled with either “M” (mine) or “R” (rock). These are the ground truth values used for training and evaluating the model.
  • Source: It’s available from the UCI Machine Learning Repository, a common source for benchmark datasets in machine learning.
  • Real-World Application: The dataset simulates a real-world problem of using sonar to detect underwater mines, which is a crucial task in naval operations.

In simpler terms: Imagine you have a submarine trying to detect mines underwater. Sonar sends out sound waves and listens for the echoes. The Sonar dataset provides examples of these echoes, represented by numbers (the 60 features), and tells you whether each echo came from a mine or a rock. Your job as a machine learning practitioner is to build a model that can learn from these examples and accurately predict whether a new echo comes from a mine or a rock.

Why is it used?

  • Benchmark Dataset: It’s a well-established dataset, so it’s often used to compare the performance of different machine learning algorithms.
  • Relatively Simple: While 60 features might seem like a lot, it’s considered a relatively small and manageable dataset, making it good for learning and experimentation.
  • Real-World Relevance: It’s based on a real-world problem, which makes it more interesting than purely synthetic datasets.

If you’re new to machine learning, the Sonar dataset is a good place to start practicing binary classification tasks. It allows you to work with real-world-inspired data and experiment with different algorithms and techniques.

Survival Analysis

Column

Survival Analysis

Datasets for Survival Analysis

Here are some commonly used datasets for teaching survival analysis:

  • Rossi Recidivism Data:
    • This dataset tracks the time to rearrest of a group of male inmates.
    • It’s often used to illustrate basic survival analysis concepts like Kaplan-Meier curves and log-rank tests.
  • Veteran’s Administration Lung Cancer Data:
    • This dataset contains information on lung cancer patients, including treatment and survival times.
    • It’s frequently used for demonstrating Cox proportional hazards models and assessing model fit.
  • Breast Cancer Data:
    • Various breast cancer datasets are available, often containing information on patient characteristics, treatment, and survival outcomes.
    • These datasets can be used to explore various aspects of survival analysis, including time-dependent covariates and competing risks.
  • Leukemia Remission Data:
    • This dataset tracks the time to remission for leukemia patients.
    • It’s a classic dataset used to illustrate survival analysis concepts and techniques.
  • Simulated Datasets:
    • Simulated datasets can be very valuable for teaching purposes.
    • They allow instructors to control the underlying data generating process and create scenarios with specific characteristics to illustrate key concepts.

These datasets are readily available in statistical software packages like R and Python, often within their survival analysis libraries.

Key Considerations When Choosing Datasets:

  • Relevance to the Course Objectives: The dataset should align with the specific topics and learning objectives of the course.
  • Data Quality: The dataset should be of good quality, with accurate and complete information.
  • Complexity: The dataset should be appropriate for the level of the students. Beginner-level courses might benefit from simpler datasets, while more advanced courses can use more complex datasets.
  • Availability and Accessibility: The dataset should be easily accessible to students, either through built-in functions in the software or through readily available online repositories.

Lung Cancer Data

Column

The Lung Cancer Dataset

The Lung Cancer Dataset, often associated with the National Lung Screening Trial (NLST), is a valuable resource for researchers studying lung cancer and the effectiveness of screening methods.

Key Features:

  • Origin: The dataset primarily stems from the NLST, a large-scale clinical trial that investigated whether low-dose computed tomography (LDCT) screening could reduce lung cancer mortality compared to chest X-ray screening.
  • Data Types: It encompasses a wide range of data, including:
    • Patient Demographics: Age, sex, smoking history, medical conditions.
    • Screening Results: Results of LDCT and chest X-ray screenings, including nodule detection, size, and characteristics.
    • Diagnostic Procedures: Information on subsequent diagnostic procedures like biopsies, surgeries, and staging.
    • Treatment Information: Details about cancer treatments received, such as surgery, chemotherapy, and radiation.
    • Survival Data: Time to lung cancer diagnosis, time to death, and cause of death.

Applications:

  • Evaluating Screening Effectiveness: Researchers use the dataset to assess the effectiveness of LDCT screening in detecting lung cancer at early stages and reducing lung cancer mortality.
  • Risk Factor Analysis: Identifying risk factors for lung cancer development and progression.
  • Developing Predictive Models: Creating models to predict the risk of lung cancer, the likelihood of developing specific types of lung cancer, and patient outcomes.
  • Assessing Treatment Outcomes: Evaluating the effectiveness of different treatment options for lung cancer.

Access and Usage:

  • The NLST dataset is available through the Cancer Data Access System (CDAS) from the National Cancer Institute (NCI).
  • Researchers must obtain approval from the NCI to access and use the data, which typically involves submitting a research proposal and adhering to specific data use agreements.

Importance in Research:

The Lung Cancer Dataset plays a crucial role in advancing our understanding of lung cancer and improving patient outcomes. It provides a valuable resource for researchers to investigate various aspects of lung cancer, from screening and early detection to treatment and prognosis.

Note:

This is a general overview of the Lung Cancer Dataset. The specific details and availability of the data may vary. It’s essential to refer to the official documentation and guidelines from the NCI for the most accurate and up-to-date information.