RDatasets.jl
packageThe RDatasets.jl
package in Julia provides an easy way
for users to access many of the standard datasets available in R. It’s
essentially a port of the Rdatasets repository created by Vincent
Arelbundock, which gathers datasets from various R packages in one
convenient location.
With RDatasets.jl
, you can load datasets using the
dataset()
function, which takes the name of the package and
the dataset as arguments. For example, you can load the famous iris
dataset with dataset("datasets", "iris")
.
This package is particularly useful for those who are familiar with R and want to use the same datasets in Julia for analysis or experimentation.
The RDatasets.jl
package includes over 700
datasets from various R packages. Some of the datasets are from
the core of R, while others are included with many of R’s most popular
packages. Here are a few examples of the types of datasets you can
find:
datasets
packageboot
packageYou can use the RDatasets.datasets()
function to get a
table describing all the included datasets, or pass in a package name
for a targeted list of datasets.
nycflights13
)The NYC Flights 13 dataset in R is a popular resource for data analysis. It contains comprehensive information about all domestic flights departing from New York City airports (JFK, LGA, EWR) during the year 2013.
Key Features:
Common Uses:
The NYC Flights 13 dataset provides a valuable resource for learning and applying data analysis skills in R.
using DataFrames, CSV, HTTP
# Download the NYC Flights 13 data from GitHub
url = "https://raw.githubusercontent.com/tidyverse/nycflights13/master/data-raw/nycflights13.csv"
response = HTTP.get(url)
data = String(response.body)
# Read the data into a DataFrame
flights = CSV.read(IOBuffer(data), DataFrame)
# Inspect the dimensions of the dataset
num_rows, num_cols = size(flights)
println("Number of rows:", num_rows)
println("Number of columns:", num_cols)
Explanation:
DataFrames
: For working with tabular data.CSV
: For reading CSV files.HTTP
: For downloading data from the web.size(flights)
returns a tuple containing the number of
rows and columns of the DataFrame.This will output the number of rows and columns in the NYC Flights 13 dataset.
Exercise 1.1: Basic Data Exploration
using DataFrames, CSV, HTTP
# Download and load the data
url = "https://raw.githubusercontent.com/tidyverse/nycflights13/master/data-raw/nycflights13.csv"
response = HTTP.get(url)
data = String(response.body)
flights = CSV.read(IOBuffer(data), DataFrame)
# Print the first 10 rows
println("First 10 rows:")
show(first(flights, 10))
# Determine the number of rows and columns
num_rows, num_cols = size(flights)
println("Number of rows:", num_rows)
println("Number of columns:", num_cols)
# Print data types of each column
println("Data types of each column:")
for (col_name, col_type) in pairs(eltype(flights))
println("$col_name: $col_type")
end
Exercise 1.2: Unique Airlines and Flight Counts
# Find unique airlines
unique_carriers = unique(flights.carrier)
println("Unique airlines:")
println(unique_carriers)
# Count flights for each carrier
carrier_counts = countmap(flights.carrier)
println("Flight counts for each carrier:")
println(carrier_counts)
Exercise 2.1: Filtering and Selecting Data
# Filter flights departing from JFK
jfk_flights = filter(:origin => ==("JFK"), flights)
# Select specific columns
selected_columns = select(jfk_flights, [:carrier, :flight, :origin, :dest])
println("Flights departing from JFK:")
show(selected_columns)
Exercise 2.2: Filtering and Calculating Average Delay
# Filter flights with arrival delay > 30 minutes
delayed_flights = filter(:arr_delay => >=(30), flights)
# Calculate average arrival delay
avg_delay = mean(delayed_flights.arr_delay)
println("Average arrival delay for delayed flights:", avg_delay)
Exercise 3.1: Creating a New Column and Calculating Average Delay
# Create a new column 'total_delay'
flights_with_total_delay = transform(flights, :total_delay => (x, y) -> x .+ y, :dep_delay, :arr_delay)
# Calculate average total delay for each carrier
avg_total_delay_by_carrier = combine(groupby(flights_with_total_delay, :carrier), :total_delay => mean => :avg_total_delay)
println("Average total delay for each carrier:")
show(avg_total_delay_by_carrier)
Exercise 3.2: Grouping and Counting Flights by Month
# Create a new column 'flight_month'
flights_with_month = transform(flights, :flight_month => x -> month(x), :month)
# Group by 'flight_month' and count flights
monthly_flight_counts = combine(groupby(flights_with_month, :flight_month), nrow => :num_flights)
println("Number of flights per month:")
show(monthly_flight_counts)
Exercise 4.1: Data Visualization (requires Plots.jl)
using Plots
# Histogram of arrival delays
histogram(flights.arr_delay, xlabel="Arrival Delay (minutes)", ylabel="Frequency", title="Histogram of Arrival Delays")
# Bar chart of flights per carrier
bar(keys(carrier_counts), values(carrier_counts), xlabel="Carrier", ylabel="Number of Flights", title="Flights per Carrier")
# Scatter plot of departure delay vs. arrival delay
scatter(flights.dep_delay, flights.arr_delay, xlabel="Departure Delay (minutes)", ylabel="Arrival Delay (minutes)", title="Departure Delay vs. Arrival Delay")
Note: These are basic examples. You can further explore and refine these exercises based on the specific learning objectives and the level of your students.
Remember to install the required packages (DataFrames
,
CSV
, HTTP
, Plots
) before running
these scripts. You can install them using the Julia package manager:
The Pima Indian Diabetes dataset is a widely used dataset in the field of machine learning, particularly for classification tasks. It’s originally from the National Institute of Diabetes and Digestive and Kidney Diseases and contains data from a population of Pima Indian women who live near Phoenix, Arizona.
Here’s what you need to know about the dataset:
Purpose:
Content:
Why it’s popular:
Where to find it:
Important Note:
While this dataset is widely used, it’s important to remember that it represents a specific population (Pima Indian women) and may not be representative of other populations. When using this dataset, it’s crucial to be mindful of potential biases and limitations.
The Sonar dataset, also known as the Connectionist Bench (Sonar, Mines vs. Rocks) dataset, is a classic dataset used in machine learning for binary classification. It’s designed to help train models that can distinguish between sonar signals bounced off underwater objects. Specifically, it aims to differentiate between signals returned from mines (metal cylinders) and rocks (roughly cylindrical rocks).
Here’s a breakdown of its key characteristics:
In simpler terms: Imagine you have a submarine trying to detect mines underwater. Sonar sends out sound waves and listens for the echoes. The Sonar dataset provides examples of these echoes, represented by numbers (the 60 features), and tells you whether each echo came from a mine or a rock. Your job as a machine learning practitioner is to build a model that can learn from these examples and accurately predict whether a new echo comes from a mine or a rock.
Why is it used?
If you’re new to machine learning, the Sonar dataset is a good place to start practicing binary classification tasks. It allows you to work with real-world-inspired data and experiment with different algorithms and techniques.
Here are some commonly used datasets for teaching survival analysis:
These datasets are readily available in statistical software packages like R and Python, often within their survival analysis libraries.
Key Considerations When Choosing Datasets:
The Lung Cancer Dataset, often associated with the National Lung Screening Trial (NLST), is a valuable resource for researchers studying lung cancer and the effectiveness of screening methods.
Key Features:
Applications:
Access and Usage:
Importance in Research:
The Lung Cancer Dataset plays a crucial role in advancing our understanding of lung cancer and improving patient outcomes. It provides a valuable resource for researchers to investigate various aspects of lung cancer, from screening and early detection to treatment and prognosis.
Note:
This is a general overview of the Lung Cancer Dataset. The specific details and availability of the data may vary. It’s essential to refer to the official documentation and guidelines from the NCI for the most accurate and up-to-date information.