R-bloggers

R news and tutorials contributed by hundreds of R bloggers

A data science case study in r.

Posted on March 13, 2017 by Robert Grünwald in R bloggers | 0 Comments

[social4i size="small" align="align-left"] --> [This article was first published on R-Programming – Statistik Service , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Demanding data science projects are becoming more and more relevant, and the conventional evaluation procedures are often no longer sufficient. For this reason, there is a growing need for tailor-made solutions, which are individually tailored to the project’s goal, which is often implemented by  R programming . T o provide our readers with support in their own R programming, we have carried out an example evaluation which demonstrates several application possibilities of the R programming.

Data Science Projects

Approaching your first data science project can be a daunting task. Luckily, there are rough step-by-step outlines and heuristics than can help you on your way to becoming a data ninja. In this article, we review some of these guidelines and apply them to an example project in R.

For our analysis and the R programming, we will make use of the following R packages:

Anatomy of a Data Science project

A basic data science project consists of the following six steps:

  • State the problem you are trying to solve. It has to be an unambiguous question that can be answered with data and a statistical or machine learning model. At least, specify: What is being observed? What has to be predicted?
  • Collect the data, then clean and prepare it. This is commonly the most time-consuming task, but it has to be done in order to fit a prediction model with the data.
  • Explore the data. Get to know its properties and quirks. Check numerical summaries of your metric variables, tables of the categorical data, and plot univariate and multivariate representations of your variables. By this, you also get an overview of the quality of the data and can find outliers.
  • Check if any variables may need to be transformed. Most commonly, this is a logarithmic transformation of skewed measurements such as concentrations or times. Also, some variables might have to be split up into two or more variables.
  • Choose a model and train it on the data. If you have more than one candidate model, apply each and evaluate their goodness-of-fit using independent data that was not used for training the model.
  • Use the best model to make your final predictions.

We apply the principles on an example data set that was used in the ASA’s 2009 Data expo . The given data are around 120 million commercial and domestic flights within the USA between 1987 and 2008. Measured variables include departure and arrival airport, airline, and scheduled and actual departure and arrival times.

We will focus on the 2008 subset of this data. Because even this is a 600MB subset, it makes sense to start a first analysis on a random sample of this data to be able to quickly explore and develop your code, and then, periodically verify on the real data set that your results are still valid on the complete data.

The following commands read in our subset data and display the first three observations:

Fortunately, the ASA provides a code book with descriptions of each variable here . For example, we now know that for the Variable DayOfWeek, a 1 denotes Monday, a 2 is Tuesday, and so on.

The problem

With this data, it is possible to answer many interesting questions. Examples include:

Do planes with a delayed departure fly with a faster average speed to make up for the delay?

How does the delay of arriving flights vary during the day are planes more delayed on weekends.

  • How has the market share of different airlines shifted over these 20 years?
  • Are there specific planes that tend to have longer delays? What characterizes them? Maybe the age, or the manufacturer?

Additionally to these concrete questions, the possibilities for explorative, sandbox-style data analysis are nearly limitless.

Here, we will focus on the first two boldened questions.

Data cleaning

You should always check out the amount of missing values in your data. For this, we write an sapply-loop over each column in the flights data and report the percentage of missing values:

We see that most variables have at most a negligible amount of missing values. However, the last five variables, starting at the CarrierDelay, show almost 80% missing values. This is usually an alarmingly high amount of missing data that would suggest dropping this variable from the analysis altogether, since not even a sophisticated imputing procedure can help here. But, as further inspection shows, these variables only apply for delayed flights, i.e. a positive value in the ArrDelay column.

When selecting only the arrival delay and the five sub-categories of delays, we see that they add up to the total arrival delay. For our analysis here, we are not interested in the delay reason, but view only the total ArrDelay as our outcome of interest.

The pipe operator %>%, by the way, is a nice feature of the magrittr package (also implemented in dplyr) that resembles the UNIX-style pipe. The following two lines mean and do exactly the same thing, but the second version is much easier to read:

The pipe operator thus takes the output of the left expression, and makes it the first argument of the right expression.

We have surprisingly clean data where not much has to be done before proceeding to feature engineering.

Explorative analyses

Our main variables of interest are:

  • The date, which conveniently is already split up in the columns Year, Month, and DayOfMonth, and even contains the weekday in DayOfWeek. This is rarely the case, you mostly get a single column with a name like date and entries such as „2016-06-24“. In that case, the R package lubridate provides helpful functions to efficiently work with and manipulate these dates.
  • CRSDepTime, the scheduled departure time. This will indicate the time of day for our analysis of when flights tend to have higher delays.
  • ArrDelay, the delay in minutes at arrival. We use this variable (rather than the delay at departure) for the outcome in our first analysis, since the arrival delay is what has the impact on our day.
  • For our second question of whether planes with delayed departure fly faster, we need DepDelay, the delay in minutes at departure, as well as a measure of average speed while flying. This variable is not available, but we can compute it from the available variables Distance and AirTime. We will do that in the next section, „Feature Engineering“.

Let’s have an exploratory look at all our variables of interest.

Flight date

Since these are exploratory analyses that you usually won’t show anyone else, spending time on pretty graphics does not make sense here. For quick overviews, I mostly use the standard graphics functions from R, without much decoration in terms of titles, colors, and such.

r data analysis case study

Since we subsetted the data beforehand, it makes sense that all our flights are from 2008. We also see no big changes between the months. There is a slight drop after August, but the remaining changes can be explained by the number of days in a month.

The day of the month shows no influence on the amount of flights, as expected. The fact that the 31st has around half the flights of all other days is also obvious.

When plotting flights per weekday, however, we see that Saturday is the most quiet day of the week, with Sunday being the second most relaxed day. Between the remaining weekdays, there is little variation.

Departure Time

r data analysis case study

A histogram of the departure time shows that the number of flights is relatively constant from 6am to around 8pm and dropping off heavily before and after that.

Arrival and departure delay

r data analysis case study

Both arrival and departure delay show a very asymmetric, right-skewed distribution. We should keep this in mind and think about a logarithmic transformation or some other method of acknowledging this fact later.

The structure of the third plot of departure vs. arrival delay suggests that flights that start with a delay usually don’t compensate that delay during the flight. The arrival delay is almost always at least as large as the departure delay.

To get a first overview for our question of how the departure time influences the average delay, we can also plot the departure time against the arrival delay:

r data analysis case study

Aha! Something looks weird here. There seem to be periods of times with no flights at all. To see what is going on here, look at how the departure time is coded in the data:

A departure of 2:55pm is written as an integer 1455. This explains why the values from 1460 to 1499 are impossible. In the feature engineering step, we will have to recode this variable in a meaningful way to be able to model it correctly.

Distance and AirTime

r data analysis case study

Plotting the distance against the time needed, we see a linear relationship as expected, with one large outlier. This one point denotes a flight of 2762 miles and an air time of 823 minutes, suggesting an average speed of 201mph. I doubt planes can fly at this speed, so we should maybe remove this observation.

Feature Engineering

Feature engineering describes the manipulation of your data set to create variables that a learning algorithm can work with. Often, this consists of transforming a variable (through e.g. a logarithm), or extracting specific information from a variable (e.g. the year from a date string), or converting something like the ZIP code to a

For our data, we have the following tasks:

  • Convert the weekday into a factor variable so it doesn’t get interpreted linearly.
  • Create a log-transformed version of the arrival and departure delay.
  • Transform the departure time so that it can be used in a model.
  • Create the average speed from the distance and air time variables.

Converting the weekday into a factor is important because otherwise, it would be interpreted as a metric variable, which would result in a linear effect. We want the weekdays to be categories, however, and so we create a factor with nice labels:

log-transform delay times

When looking at the delays, we note that there are a lot of negative values in the data. These denote flights that left or arrived earlier than scheduled. To allow a log-transformation, we set all negative values to zero, which we interpret as „on time“:

Now, since there are zeros in these variables, we create the variables log(1+ArrDelay) and log(1+DepDelay):

Transform the departure time

The departure time is coded in the format hhmm, which is not helpful for modelling, since we need equal distances between equal durations of time. This way, the distance between 10:10pm and 10:20pm would be 10, but the distance between 10:50pm and 11:00pm, the same 10 minutes, would be 50.

For the departure time, we therefore need to convert the time format. We will use a decimal format, so that 11:00am becomes 11, 11:15am becomes 11.25, and 11:45 becomes 11.75.

The mathematical rule to transform the „old“ time in hhmm-format into a decimal format is:

Here, the first part of the sum generates the hours, and the second part takes the remainder when dividing by 100 (i.e., the last two digits), and divides them by 60 to transform the minutes into a fraction of one hour.

Let’s implement that in R:

Of course, you should always verify that your code did what you intended by checking the results.

Create average speed

The average flight speed is not available in the data – we have to compute it from the distance and the air time:

r data analysis case study

We have a few outliers with very high, implausible average speeds. Domain knowledge or a quick Google search can tell us that speeds of more than 800mph are not maintainable with current passenger planes. Thus, we will remove these flights from the data:

Choosing an appropriate Method

For building an actual model with your data, you have the choice between two worlds, statistical modelling and machine learning.

Broadly speaking, statistical models focus more on quantifying and interpreting the relationships between input variables and the outcome. This is the case in situations such as clinical studies, where the main goal is to describe the effect of a new medication.

Machine learning methods on the other hand focus on achieving a high accuracy in prediction, sometimes sacrificing interpretability. This results in what is called „black box“ algorithms, which are good at predicting the outcome, but it’s hard to see how a model computes a prediction for the outcome. A classic example for a question where machine learning is the appropriate answer is the product recommendation algorithm on online shopping websites.

For our questions, we are interested in the effects of certain input variables (speed and time of day / week). Thus we will make use of statistical models, namely a linear model and a generalized additive model.

To answer our first question, we first plot the variables of interest to get a first impression of the relationship. Since these plots will likely make it to the final report or at least a presentation to your supervisors, it now makes sense to spend a little time on generating a pretty image. We will use the ggplot2 package to do this:

r data analysis case study

It seems like there is a slight increase in average speed for planes that leave with a larger delay. Let’s fit a linear model to quantify the effect:

There is a highly significant effect of 0.034 for the departure delay. This represents the increase in average speed for each minute of delay. So, a plane with 60 minutes of delay will fly 2.04mph faster on average.

Even though the effect is highly significant with a p value of less than 0.0001, its actual effect is negligibly small.

For the second question of interest, we need a slightly more sophisticated model. Since we want to know the effect of the time of day on the arrival delay, we cannot assume a linear effect of the time on the delay. Let’s plot the data:

r data analysis case study

We plot both the actual delay and our transformed log-delay variable. The smoothing line of the second plot gives a better image of the shape of the delay. It seems that delays are highest at around 8pm, and lowest at 5am. This emphasizes the fact that a linear model would not be appropriate here.

We fit a generalized additive model , a GAM, to this data. Since the response variable is right skewed, a Gamma distribution seems appropriate for the model family. To be able to use it, we have to transform the delay into a strictly positive variable, so we compute the maximum of 1 and the arrival delay for each observation first.

r data analysis case study

We again see the trend of lower delays in the morning before 6am, and high delays around 8pm. To differentiate between weekdays, we now include this variable in the model:

With this model, we can create an artificial data frame x_new, which we use to plot one prediction line per weekday:

r data analysis case study

We now see several things:

  • The nonlinear trend over the day is the same shape on every day of the week
  • Fridays are the worst days to fly by far, with Sunday being a close second. Expected delays are around 20 minutes during rush-hour (8pm)
  • Wednesdays and Saturdays are the quietest days
  • If you can manage it, fly on a Wednesday morning to minimize expected delays.

Closing remarks

As noted in the beginning of this post, this analysis is only one of many questions that can be tackled with this enormous data set. Feel free to browse the data expo website and especially the „Posters & results“ section for many other interesting analyses.

Der Beitrag A Data Science Case Study in R erschien zuerst auf Statistik Service .

To leave a comment for the author, please follow the link and comment on their blog: R-Programming – Statistik Service . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job . Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Copyright © 2022 | MH Corporate basic by MH Themes

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

  • Case Study: Exploratory Data Analysis in R
  • by Daniel Pinedo
  • Last updated over 3 years ago
  • Hide Comments (–) Share Hide Toolbars

Twitter Facebook Google+

Or copy & paste this link into an email or IM:

CYCLISTIC BIKE SHARE - CASE STUDY WITH R

Idris oluwatobi, published on september 20, 2022.

Bike Station

INTRODUCTION

This capstone project is the final project in my Google Data Analytics Professional Certificate Course. In this case study, I will be analyzing a public dataset for a fictional company called Cyclistic, provided by the course. Here, I will be using R programming language for this analysis because of its potential benefits to reproducibility, transparency, easy statistical analysis tools and data visualizations.

The following sets of data analysis process will be followed:

The case study road map as listed below will be followed on each step

Codes, when needed.

Deliverables.

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members.But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data.

  • Three questions will guide the future marketing program:

How do annual members and casual riders use Cyclistic bikes differently?

Why would casual riders buy Cyclistic annual memberships?

How can Cyclistic use digital media to influence casual riders to become members?

Lily Moreno (director of marketing and my manager) has assigned me the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?

  • Identify the business task
  • The main business obejective is to design marketing strategies aimed at converting casual riders into annual members by understanding how they differ.
  • Consider key stakeholders
  • The key stakeholders are the Director of Marketing (Lily Moreno), Marketing Analytics team, and Executive team.

Deliverable

  • A clear statement of the business task
  • To find the differences between the casual riders and annual members.
  • I will use Cyclistic’s historical trip data to analyze and identify trends.The data has been made available by Motivate International Inc. under this license . Datasets are available here link .
  • Download data and store it appropriately.
  • Data has been downloaded and copies have been stored securely on my computer.
  • Identify how it’s organized.
  • The data is in CSV (comma-separated values) format, and there are a total of 13 columns.
  • Sort and filter the data.
  • For this analysis, I will be using data for the year 2019 and 2020.
  • Determine the credibility of the data.
  • For the purposes of this case study, the datasets are appropriate and will enable me to answer the business questions. The data has been made available by Motivate International Inc. This is public data that I can use to explore how different customer types are using Cyclistic bikes. But data-privacy issues will prohibit me from using rider’s personally identifiable information and this will prevent me from determining if riders have purchased multiple single passes. All ride ids are unique.
  • A description of all data sources used
  • The main source of all the data used was provided by the Cyclistic Company .

Install and load required packages

Import data to R Studio

  • Importing data for 2019
  • Importing data for 2020

Wrangle and merge all data into a single file

  • I made sure all the data sets have the same number of columns and also the name of each columns are the same before merging it all together

Cleaning up data and adding data to prepare for analysis

Check the data for errors.

Choose your tools.

Transform the data so you can work with it effectively.

Document the cleaning process.

Deliverables

  • Documentation of any cleaning or manipulation of data.
  • I inspected the new table that has been created using the following code chunks

Adding columns that list the date, month, day, and year of each ride.

  • This will allow us to aggregate ride data for each month, day, or year. Therefore, the code chunks used are as follows

Adding a “ride_length” calculation to all_trips (in seconds)

Convert “ride_length” from factor to numeric so we can run calculations on the data

Remove “bad” data

  • The dataframe includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative or Zero, so I created a new version of the dataframe (v2) since data is being removed
  • All the data have been strored appropriately and has been prepared for analysis, so they are ready for exploration.
  • Aggregate your data so it’s useful and accessible.
  • Organize and format your data.
  • Perform calculations.
  • Identify trends and relationships.
  • A summary of your analysis

Conduct descriptive analysis - Descriptive analysis on ride_length (all figures in seconds)

mean median max min of all_trips

Let’s visualize members and casuals by the total ride taken (ride count)

member_casual ride_count

  • From the above graph, we can observe that there are more member riders compared to casual rides based on the ride count.

Let’s see the average time ride by each day for members vs casual users

average ride time by each day for member vs casual users

total rides and average ride time(duration) by each day for members vs casual riders

Let’s visualize the above table by days of the week and number of rides taken by member and casual riders.

  • From the above graph, it is observed that the members are quite consistent with higher number of rides throughout the week compared to the casual riders, and also, the differences in number of rides between the members and casual riders during the weekends(Saturdays and Sundays) is not as much as the differences during the other days of the week.

Let’s visualize the average duration of Members and Casual riders Vs. Day of the week

  • From the graph above, it is observed that the casual riders ride for a longer period of time throughout the week while the members ride at a consistent pace during the week with the highest rides on the weekend.

Let’s create a visualization for Total rides by members and casual riders by month

  • From the above graph, it is observed that members have the highest number of rides throughout the year with August being the month with the highest number of rides in general for both members and casual riders.

Let’s compare Members and Casual riders depending on ride distance.

  • From the graph above, we can observe that the distance traveled by the casual riders is far more than the distance traveled by the member with a very large difference in kilometers.

This phase involves using visualization to share my findings and can be done by presentation.

Determine the best way to share your findings.

Create effective data visualizations.

Present your findings.

Ensure your work is accessible.

  • Supporting visualizations and key findings

This phase will be carried out by the executive team, Director of Marketing (Lily Moreno) and the Marketing Analytics team based on my analysis.

Members have more bikes compared to casual riders.

We have more members riding in all months compared to casual riders.

Casual riders travel for a longer time period.

Members ride more throughout the entire weekday while the casual riders also have a high ride record during the weekends(Saturday and Sunday) compared to the other days of the week.

Casual riders go farther in terms of distance.

  • Your top 3 recommendations based on your analysis

Have a slash sale or promo for casual riders so they can acquire more bikes and indulge them in the benefits of being a member.

Host fun biking competitions with prizes at intervals for casual riders on the weekends. Since there are lot of members on weekends,this will also attract them to get a membership.

Encourage casual riders to ride more in the entire year through advertisement, hand flyers, by giving them various coupons so as to convince them into being a member.

THANK YOU FOR READING, PLEASE PROVIDE YOUR VALUABLE FEEDBACK.

Search code, repositories, users, issues, pull requests...

Provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

Codes for case studies for the Bekes-Kezdi Data Analysis textbook

gabors-data-analysis/da_case_studies

Folders and files, repository files navigation, data analysis case study codebase for r, python and stata.

R, Python and Stata code for Data Analysis for Business, Economics, and Policy by Gábor Békés (CEU) and Gábor Kézdi (U. Michigan) Published on 6 May 2021 by Cambridge University Press gabors-data-analysis.com

On the textbook's website, we have detailed discussion of how to set up libraries, get data and code: Overview of data and code

To see options for various languages, check out:

  • R -- How to run code in R
  • Stata -- How to run code in Stata
  • Python -- How to run code in Python

Status (25 November, 2022)

The Latest release, 0.8.3 "Ethics Gradient" was released on 25 November.

In the latest release we did some refactoring re Python and R codes. We continuously monitor bugs and do regular, if mostly minor updates.

Organization

  • Each case study has a separate folder.
  • Within case study folders, codes in different languages are simply stored together.
  • Data should be downloaded and stored in a separate folder.

Code language versions

  • R -- We used R 4.0.2.
  • Stata -- We used version 15, allmost all code should work in version 13 up.
  • Python -- We used Python 3.8.0.

Data is hosted on OSF.io

Get data by datasets

Found an error or have a suggestion?

Awesome, we know there are errors and bugs. Or just much better ways to do a procedure.

To make a suggestion, please open a github issue here with a title containing the case study name. You may also contact us directctly . Cheers!

Contributors 10

  • Jupyter Notebook 95.2%
  • Python 0.1%

MarketSplash

How To Conduct Data Analysis In R: Essential Steps And Techniques

Embarking on data analysis in R can be a pivotal skill for developers. This article walks you through the key steps of data analysis using R, from importing and manipulating data to statistical analysis and visualization, equipping you with the essential tools and methods needed.

💡 KEY INSIGHTS

  • Effective data visualization can be achieved in R through basic plotting techniques like ggplot() and geom_ functions, essential for transforming complex data into actionable insights.
  • For statistical analysis , starting with fundamental functions like mean(), median(), and sd() provides a basic understanding of data, and advanced methods like t-tests and linear regression reveal deeper patterns and relationships.
  • The article highlights a case study on stratified sampling and model stability, emphasizing the importance of averaging outcomes from multiple iterations to mitigate small sample size impacts in R.
  • Reproducibility in R coding is crucial; using R Markdown for documentation and managing sessions with packrat or renv ensures consistent and shareable results.

Data analysis in R offers a streamlined approach for programmers and developers to handle complex datasets and extract meaningful insights. This article guides you through the essential steps and techniques in R, from data manipulation to visualization, ensuring a practical understanding of this widely-used programming language. With clear examples and straightforward explanations, you'll gain the skills needed to effectively analyze data in your projects.

r data analysis case study

Importing Data

Data manipulation, data visualization, statistical analysis, frequently asked questions, reading csv files, reading excel files, connecting to databases, handling different data formats.

R makes it easy to import data from CSV files using the read.csv() function. For example:

For Excel files, the readxl package is commonly used. First, install and load the package:

Then, use the read_excel() function:

To import data from a database, use the DBI and RMySQL (or a relevant database-specific package) libraries:

R supports various data formats. Functions like read.table() , read.csv2() , and read.delim() are tailored for different file types and delimiters.

By understanding these methods, you can efficiently import data into R for analysis, setting the stage for further data manipulation and exploration.

Selecting Columns

Filtering rows, adding new columns, summarizing data, grouping data, joining data frames.

To select specific columns from a dataset, use the select() function from dplyr . For example:

Filtering rows based on conditions is done using the filter() function:

To add new columns based on existing data, use the mutate() function:

For summarizing data, such as finding means or sums, use the summarise() function:

Grouping data based on a column is achieved with group_by() :

To merge data frames, use functions like inner_join() :

By mastering these functions, you can effectively manipulate data in R, preparing it for analysis or visualization. These operations form the backbone of most data analysis tasks in R.

Creating A Basic Plot

Customizing plots, creating bar charts, plotting histograms, using facets for multiple plots.

To start with a basic plot, use ggplot() along with geom_ functions. For instance, for a scatter plot:

Customization, like changing plot titles or axes labels, enhances plot clarity:

Bar charts are useful for categorical data. Use geom_bar() for this purpose:

Histograms are great for visualizing distributions. They can be created using geom_histogram() :

To compare different subsets of data, use facet_wrap() or facet_grid() :

Effective data visualization in R can transform complex data into understandable and actionable insights. These basic plotting techniques are the foundation for more advanced visualizations.

Descriptive Statistics

Linear regression, correlation analysis.

To get a basic understanding of your data, start with descriptive statistics. Functions like mean() , median() , and sd() are fundamental:

A t-test helps compare means between two groups. Use t.test() for this purpose:

Linear regression models the relationship between two variables. Use lm() to perform linear regression:

ANOVA (Analysis of Variance) tests differences among group means. The aov() function is used for this:

To examine the relationship between two continuous variables, use cor() :

Statistical analysis in R allows for a deeper understanding of data, revealing patterns, relationships, and insights that can guide decision-making. These basic statistical methods are essential tools in a data analyst's toolkit.

How Do I Ensure My R Code is Reproducible?

To ensure reproducibility, use R Markdown for documenting your analysis, manage your R sessions with packages like packrat or renv , and share your code and data in a clear and organized manner.

Is R Good for Statistical Analysis Compared to Other Tools?

R is particularly strong in statistical analysis, offering a wide range of statistical tests, models, and visualization options. It's highly regarded in academia and industries for statistical computing.

How Can I Improve the Performance of My R Code?

To improve performance, consider using more efficient data manipulation packages like data.table , writing more optimized code, and possibly parallel processing using packages like parallel or foreach .

What's the Difference Between apply() , lapply() , and sapply() ?

apply() is used for applying a function over the margins of an array or matrix, lapply() returns a list and is used for lists or vectors, while sapply() is a user-friendly version of lapply() that simplifies the output to a vector or matrix.

Let’s test your knowledge!

Which function is used to read CSV files in R

Continue learning with these 'programming' guides.

  • How To Debug In R: Effective Strategies For Developers
  • How To Optimize R Code Performance: Effective Strategies For Developers
  • How To Integrate R With Other Programming Languages: A Step-By-Step Approach
  • How To Customize Plot Themes In R: Enhancing Your Data Visualizations
  • How To Use Loops In R: Practical Approaches For Developers

Subscribe to our newsletter

Subscribe to be notified of new content on marketsplash..

Case studies

Gautier paux and alex dmitrienko, introduction.

Several case studies have been created to facilitate the implementation of simulation-based Clinical Scenario Evaluation (CSE) approaches in multiple settings and help the user understand individual features of the Mediana package. Case studies are arranged in terms of increasing complexity of the underlying clinical trial setting (i.e., trial design and analysis methodology). For example, Case study 1 deals with a number of basic settings and increasingly more complex settings are considered in the subsequent case studies.

Case study 1

This case study serves a good starting point for users who are new to the Mediana package. It focuses on clinical trials with simple designs and analysis strategies where power and sample size calculations can be performed using analytical methods.

  • Trial with two treatment arms and single endpoint (normally distributed endpoint).
  • Trial with two treatment arms and single endpoint (binary endpoint).
  • Trial with two treatment arms and single endpoint (survival-type endpoint).
  • Trial with two treatment arms and single endpoint (survival-type endpoint with censoring).
  • Trial with two treatment arms and single endpoint (count-type endpoint).

Case study 2

This case study is based on a clinical trial with three or more treatment arms . A multiplicity adjustment is required in this setting and no analytical methods are available to support power calculations.

This example also illustrates a key feature of the Mediana package, namely, a useful option to define custom functions, for example, it shows how the user can define a new criterion in the Evaluation Model.

Clinical trial in patients with schizophrenia

Case study 3

This case study introduces a clinical trial with several patient populations (marker-positive and marker-negative patients). It demonstrates how the user can define independent samples in a data model and then specify statistical tests in an analysis model based on merging several samples, i.e., merging samples of marker-positive and marker-negative patients to carry out a test that evaluated the treatment effect in the overall population.

Clinical trial in patients with asthma

Case study 4

This case study illustrates CSE simulations in a clinical trial with several endpoints and helps showcase the package’s ability to model multivariate outcomes in clinical trials.

Clinical trial in patients with metastatic colorectal cancer

Case study 5

This case study is based on a clinical trial with several endpoints and multiple treatment arms and illustrates the process of performing complex multiplicity adjustments in trials with several clinical objectives.

Clinical trial in patients with rheumatoid arthritis

Case study 6

This case study is an extension of Case study 2 and illustrates how the package can be used to assess the performance of several multiplicity adjustments. The case study also walks the reader through the process of defining customized simulation reports.

Case study 1 deals with a simple setting, namely, a clinical trial with two treatment arms (experimental treatment versus placebo) and a single endpoint. Power calculations can be performed analytically in this setting. Specifically, closed-form expressions for the power function can be derived using the central limit theorem or other approximations.

Several distribution will be illustrated in this case study:

Normally distributed endpoint

Binary endpoint, survival-type endpoint, survival-type endpoint (with censoring), count-type endpoint.

Suppose that a sponsor is designing a Phase III clinical trial in patients with pulmonary arterial hypertension (PAH). The efficacy of experimental treatments for PAH is commonly evaluated using a six-minute walk test and the primary endpoint is defined as the change from baseline to the end of the 16-week treatment period in the six-minute walk distance.

Define a Data Model

The first step is to initialize the data model:

After the initialization, components of the data model can be added to the DataModel object incrementally using the + operator.

The change from baseline in the six-minute walk distance is assumed to follow a normal distribution. The distribution of the primary endpoint is defined in the OutcomeDist object:

The sponsor would like to perform power evaluation over a broad range of sample sizes in each treatment arm:

As a side note, the seq function can be used to compactly define sample sizes in a data model:

The sponsor is interested in performing power calculations under two treatment effect scenarios (standard and optimistic scenarios). Under these scenarios, the experimental treatment is expected to improve the six-minute walk distance by 40 or 50 meters compared to placebo, respectively, with the common standard deviation of 70 meters.

Therefore, the mean change in the placebo arm is set to μ = 0 and the mean changes in the six-minute walk distance in the experimental arm are set to μ = 40 (standard scenario) or μ = 50 (optimistic scenario). The common standard deviation is σ = 70.

Note that the mean and standard deviation are explicitly identified in each list. This is done mainly for the user’s convenience.

After having defined the outcome parameters for each sample, two Sample objects that define the two treatment arms in this trial can be created and added to the DataModel object:

Define an Analysis Model

Just like the data model, the analysis model needs to be initialized as follows:

Only one significance test is planned to be carried out in the PAH clinical trial (treatment versus placebo). The treatment effect will be assessed using the one-sided two-sample t -test:

According to the specifications, the two-sample t-test will be applied to Sample 1 (Placebo) and Sample 2 (Treatment). These sample IDs come from the data model defied earlier. As explained in the manual, see Analysis Model , the sample order is determined by the expected direction of the treatment effect. In this case, an increase in the six-minute walk distance indicates a beneficial effect and a numerically larger value of the primary endpoint is expected in Sample 2 (Treatment) compared to Sample 1 (Placebo). This implies that the list of samples to be passed to the t-test should include Sample 1 followed by Sample 2. It is of note that from version 1.0.6, it is possible to specify an option to indicate if a larger numeric values is expected in the Sample 2 ( larger = TRUE ) or in Sample 1 ( larger = FALSE ). By default, this argument is set to TRUE .

To illustrate the use of the Statistic object, the mean change in the six-minute walk distance in the treatment arm can be computed using the MeanStat statistic:

Define an Evaluation Model

The data and analysis models specified above collectively define the Clinical Scenarios to be examined in the PAH clinical trial. The scenarios are evaluated using success criteria or metrics that are aligned with the clinical objectives of the trial. In this case it is most appropriate to use regular power or, more formally, marginal power . This success criterion is specified in the evaluation model.

First of all, the evaluation model must be initialized:

Secondly, the success criterion of interest (marginal power) is defined using the Criterion object:

The tests argument lists the IDs of the tests (defined in the analysis model) to which the criterion is applied (note that more than one test can be specified). The test IDs link the evaluation model with the corresponding analysis model. In this particular case, marginal power will be computed for the t-test that compares the mean change in the six-minute walk distance in the placebo and treatment arms (Placebo vs treatment).

In order to compute the average value of the mean statistic specified in the analysis model (i.e., the mean change in the six-minute walk distance in the treatment arm) over the simulation runs, another Criterion object needs to be added:

The statistics argument of this Criterion object lists the ID of the statistic (defined in the analysis model) to which this metric is applied (e.g., Mean Treatment ).

Perform Clinical Scenario Evaluation

After the clinical scenarios (data and analysis models) and evaluation model have been defined, the user is ready to evaluate the success criteria specified in the evaluation model by calling the CSE function.

To accomplish this, the simulation parameters need to be defined in a SimParameters object:

The function call for CSE specifies the individual components of Clinical Scenario Evaluation in this case study as well as the simulation parameters:

The simulation results are saved in an CSE object ( case.study1.results ). This object contains complete information about this particular evaluation, including the data, analysis and evaluation models specified by the user. The most important component of this object is the data frame contained in the list named simulation.results ( case.study1.results$simulation.results ). This data frame includes the values of the success criteria and metrics defined in the evaluation model.

Summarize the Simulation Results

Summary of simulation results in r console.

To facilitate the review of the simulation results produced by the CSE function, the user can invoke the summary function. This function displays the data frame containing the simulation results in the R console:

If the user is interested in generate graphical summaries of the simulation results (using the the ggplot2 package or other packages), this data frame can also be saved to an object:

General a Simulation Report

Presentation model.

A very useful feature of the Mediana package is generation of a Microsoft Word-based report to provide a summary of Clinical Scenario Evaluation Report.

To generate a simulation report, the user needs to define a presentation model by creating a PresentationModel object. This object must be initialized as follows:

Project information can be added to the presentation model using the Project object:

The user can easily customize the simulation report by defining report sections and specifying properties of summary tables in the report. The code shown below creates a separate section within the report for each set of outcome parameters (using the Section object) and sets the sorting option for the summary tables (using the Table object). The tables will be sorted by the sample size. Further, in order to define descriptive labels for the outcome parameter scenarios and sample size scenarios, the CustomLabel object needs to be used:

Generate a Simulation Report

This case study will also illustrate the process of customizing a Word-based simulation report. This can be accomplished by defining custom sections and subsections to provide a structured summary of the complex set of simulation results.

Create a Customized Simulation Report

Define a presentation model.

Several presentation models will be used produce customized simulation reports:

A report without subsections.

A report with subsections.

A report with combined sections.

First of all, a default PresentationModel object ( case.study6.presentation.model.default ) will be created. This object will include the common components of the report that are shared across the presentation models. The project information ( Project object), sorting options in summary tables ( Table object) and specification of custom labels ( CustomLabel objects) are included in this object:

Report without subsections

The first simulation report will include a section for each outcome parameter set. To accomplish this, a Section object is added to the default PresentationModel object and the report is generated:

Report with subsections

The second report will include a section for each outcome parameter set and, in addition, a subsection will be created for each multiplicity adjustment procedure. The Section and Subsection objects are added to the default PresentationModel object as shown below and the report is generated:

Report with combined sections

Finally, the third report will include a section for each combination of outcome parameter set and each multiplicity adjustment procedure. This is accomplished by adding a Section object to the default PresentationModel object and specifying the outcome parameter and multiplicity adjustment in the section’s by argument.

CSE report without subsections

CSE report with subsections

CSE report with combined subsections

  • Data Visualization
  • Statistics in R
  • Machine Learning in R
  • Data Science in R
  • Packages in R
  • Data Science for Beginners

Fundamental of Data Science

  • What is Data Science with Example?
  • What Are the Roles and Responsibilities of a Data Scientist?
  • Top 10 Data Science Job Profiles
  • Applications of Data Science
  • Data Science vs Data Analytics
  • Data Science VS Machine Learning
  • Difference Between Data Science and Business Intelligence
  • Data Science Fundamentals
  • Data Science Lifecycle
  • How Much Math Do You Need to Become a Data Scientist?

Programming Language for Data Science

  • Python for Data Science - Learn the Uses of Python in Data Science
  • R Programming for Data Science
  • SQL for Data Science

Complete Data Science Program

  • Learn Data Science With Python

Data Analysis tutorial

  • Data Analysis Tutorial
  • Data Analysis with Python

Data analysis using R

  • Data Analyst Interview Questions and Answers

Data Vizualazation Tutotrial

  • Python - Data visualization tutorial
  • Data Visualization with Python
  • Data Visualization in R
  • Machine Learning Tutorial
  • Machine Learning Mathematics
  • 100+ Machine Learning Projects with Source Code [2024]
  • Machine Learning Interview Questions
  • Machine Learning with R

Deep Learning & NLP Tutorial

  • Deep Learning Tutorial
  • 5 Deep Learning Project Ideas for Beginners
  • Deep Learning Interview Questions
  • Natural Language Processing (NLP) Tutorial
  • Top 50 NLP Interview Questions and Answers (2023)
  • Computer Vision Tutorial
  • Top Computer Vision Projects (2023)
  • Why Data Science Jobs Are in High Demand?

Data Analysis is a subset of data analytics, it is a process where the objective has to be made clear, collect the relevant data, preprocess the data, perform analysis(understand the data, explore insights), and then visualize it. The last step visualization is important to make people understand what’s happening in the firm.

Steps involved in data analysis:

r data analysis case study

The process of data analysis would include all these steps for the given problem statement. Example- Analyze the products that are being rapidly sold out and details of frequent customers of a retail shop.

  • Defining the problem statement – Understand the goal, and what is needed to be done. In this case, our problem statement is – “The product is mostly sold out and list of customers who often visit the store.” 
  • Collection of data –  Not all the company’s data is necessary, understand the relevant data according to the problem. Here the required columns are product ID, customer ID, and date visited.
  • Preprocessing – Cleaning the data is mandatory to put it in a structured format before performing analysis. 
  • Removing outliers( noisy data).
  • Removing null or irrelevant values in the columns. (Change null values to mean value of that column.)
  • If there is any missing data, either ignore the tuple or fill it with a mean value of the column.

Data Analysis using the Titanic dataset

You can download the titanic dataset (it contains data from real passengers of the titanic)from here . Save the dataset in the current working directory, now we will start analysis (getting to know our data).

Our dataset contains all the columns like name, age, gender of the passenger and class they have traveled in, whether they have survived or not, etc. To understand the class(data type) of each column sapply() method can be used.

We can categorize the value “survived” into “dead” to 0 and “alive” to 1 using factor() function.

We analyze data using a summary of all the columns, their values, and data types. summary() can be used for this purpose.

From the above summary we can extract below observations:

  • Total passengers:  891
  • The number of total people who survived:  342
  • Number of total people dead:  549
  • Number of males in the titanic:  577
  • Number of females in the titanic:  314
  • Maximum age among all people in titanic:  80
  • Median age:  28

Preprocessing of the data is important before analysis, so null values have to be checked and removed.

  • dropnull_train contains only 631 rows because (total rows in dataset (808) – null value rows (177) = remaining rows (631) )
  • Now we will divide survived and dead people into a separate list from 631 rows.

Now we can visualize the number of males and females dead and survived using bar plots , histograms , and piecharts .

r data analysis case study

From the above pie chart, we can certainly say that there is a data imbalance in the target/Survived column.

r data analysis case study

Now let’s draw a bar plot to visualize the number of males and females who were there on the titanic ship.

r data analysis case study

From the barplot above we can analyze that there are nearly 350 males, and 50 females those are not survived in titanic.

r data analysis case study

Here we can observe that there are some passengers who are charged extremely high. So, these values can affect our analysis as they are outliers. Let’s confirm their presence using a boxplot .

r data analysis case study

Certainly, there are some extreme outliers present in this dataset.

Please Login to comment...

Similar reads.

  • AI-ML-DS With R
  • R Data Analysis
  • Technical Scripter 2022
  • Data Analysis
  • Technical Scripter
  • Google Releases ‘Prompting Guide’ With Tips For Gemini In Workspace
  • Google Cloud Next 24 | Gmail Voice Input, Gemini for Google Chat, Meet ‘Translate for me,’ & More
  • 10 Best Viber Alternatives for Better Communication
  • 12 Best Database Management Software in 2024
  • 30 OOPs Interview Questions and Answers (2024)

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Introduction to R for Data Science: A LISA 2020 Guidebook

Chapter 7 network analysis.

In this chapter, we will cover concepts and procedures related to network analysis in R. “Networks enable the visualization of complex, multidimensional data as well as provide diverse statistical indices for interpreting the resultant graphs” (Jones et al., 2018). Put otherwise, network analysis is a collection of techniques that visualize and estimate relationships among agents in a social context. Furthermore, network analysis is used “to analyze the social structures that emerge from the recurrence of these relations” where “[the] basic assumption is that better explanations of social phenomena are yielded by analysis of the relations among entities” (Science Direct; Linked Below).

Networks are made up of nodes (i.e., individual actors, people, or things within the network) and the ties , edges , or links (i.e., relationships or interactions) that connect them. The extent to which nodes are connected lends to interpretations of the measured social context.

“By comparison with most other branches of quantitative social science, network analysts have given limited attention to statistical issues. Most techniques and measures examine the structure of specific data sets without addressing sampling variation, measurement error, or other uncertainties. Such issues are complex because of the dependencies inherent in network data, but they are now receiving increased study. The most widely investigated approach to the statistical analysis of networks stresses the detection of formal regularities in local relational structure .

r data analysis case study

The figure above illustrates some of the relational structures commonly found in analyses of social networks.

A: Demonstrates a relationship of reciprocity/mutuality.

B: Demonstrates a directed relationship with a common target.

C: Relationships emerge from a common source.

D: Transitive direct relationships with indirect influences.

Another type is homophily, which is present, for example, when same-sex friendships are more common than between-sex friendships. This involves an interaction between a property of units and the presence of relationships” (Peter V. Marsden, in Encyclopedia of Social Measurement , 2005). This sort of model might reflect the tendency of people to seek out those that are similar to themselves.

7.0.0.1 Measures of Centrality

Measures of centrality provide quantitative context regarding the importance of a node within a network. There are four measures of centrality that we will cover.

Degree Centrality : The degree of a node is the number of other nodes that single node is connected to. Important nodes tend to have more connections to other nodes. Highly connected nodes are interpreted to have high degree centrality.

Eigenvector Centrality : The extent to which adjacent nodes are connected themselves also indicate importance (e.g., Important nodes increase the importance of other nodes).

Closeness centrality : Closeness centrality measures how many steps are required to access every other node from a given node. In other words, important nodes have easy access to other nodes given multiple connections.

Betweenness Centrality : This ranks the nodes based on the flow of connections through the network. Importance is demonstrated through high frequency of connection with multiple other nodes. Nodes with high levels of betweenness tend to serve as a bridge for multiple sets of other important nodes. See this link for a set of journals and books that cover the topic.

Also, examine this (paid) online tool for text-based network analysis: https://www.infranodus.com

7.1 Zacharies Karate Club Case Study

We will be working with a dataset called Zacharies Karate Club, a seminal dataset in network analysis literature. First we need to install the relevant packages. Today we will need a package called igraph , a package useful for creating, analyzing, and visualizing networks. If you do not have the packages already, install the tidyverse , igraph , ggnetwork , and intergraph . igraph helps us perform network analysis. ggnetwork and intergraph are both packages used for plotting networks in the ggplot framework.

Zachary’s Karate Club Background

Taken from wikipedia: “A social network of a karate club was studied by Wayne W. Zachary for a period of three years from 1970 to 1972. The network captures 34 members of a karate club, documenting pairwise links between members who interacted outside the club. During the study a conflict arose between the administrator”John A” and instructor “Mr. Hi” (pseudonyms), which led to the split of the club into two. Half of the members formed a new club around Mr. Hi; members from the other group found a new instructor or gave up karate. Based on network analysis Zachary correctly predicted each member’s decision except member #9, who went with Mr. Hi instead of John A.” In this case study, we will try to infer/predict the group splits with network analysis techniques.

7.1.0.1 Load Data and Extract Model Features

Now it’s time to extract the relevant information that we need from the dataset. We need the associations between members (edges), the groupings after the split of the network, and the labels of the nodes.

Extract the groups and labels of the vertices and store them in vectors. Make sure that the labels are called as characters and not factors using the “str()” function, as igraph requires character data to cast labels.

7.1.0.2 Creating Networks From Data

Now that we have extracted the relevant data that we need, let’s construct a network of Zachary’s Karate club.

We can also create vertex attributes. Let’s make a vertex attribute for each group (Mr. Hi and John A).

Create a vertex attribute for node label. Call the attribute ‘label’.

7.1.0.3 Visualizing Networks with baseR

Now visualize the network by running the plot function on our network ‘G’.

r data analysis case study

Let’s change some of the plot aesthetics. We can change the vertex colors, edge colors, vertex sizes, etc. Play around with the arguments for plotting a network.

r data analysis case study

We can also change the color of our vertices according to group.

r data analysis case study

7.1.0.4 Visualizing Networks with ggnetwork

You can also use ggplot to visualize igraph objects.

r data analysis case study

Let’s see if we can make our the ggplot version look better.

r data analysis case study

Using ggnetwork and ggplot, color or shape the nodes by karate group. Also make some other plot aesthetic changes to your liking.

7.1.0.5 Measuring Centrality

r data analysis case study

Finally, Let’s put all of the centrality measures in one table so that we can compare the outputs.

It makes sense that the most connected members of the network are indeed John A. and Mr. Hi. We can view the centrality measures from the perspective of the graph. Here, we add the object degr_cent to the vertex size to display the nodes via their degree centrality using baseR .

r data analysis case study

Now, using the tidyverse ! Change the code below to make a graph of our network where node sizes are scaled by the degree centrality.

r data analysis case study

7.1.0.6 Modularity

Modularity is a measure that describes the extent to which community structure is present within a network when the groups are labeled. A modularity score close to 1 indicates the presence of strong community structure in the network. In other words, nodes in the same group are more likely to be connected than nodes in different groups. A modularity score close to -1 indicates the opposite of community structure. In other words, nodes in different groups are more likely to be connected than nodes in the same group. A modularity score close to 0 indicates that no community structure (or anti-community structure) is present in the network.

r data analysis case study

Compute the modularity of the Zacharies Karate Club network using the modularity() function.

Higher modularity scores are better, however, modularity should not be used alone to assess the presence of communities in network. Rather, multiple measures should be used to provide an argument for community in a network.

7.2 Community Detection

Suppose we no longer have the group labels, but we want to infer the existence of groups in our network. This process is known as community detection. There are many different ways to infer the existence of groups in a network.

7.2.0.1 Via Modularity Maximization

The goal here is to find the groupings of nodes that lead to the highest possible modularity score.

r data analysis case study

It turns out that the modularity maximization algorithm finds 3 communities within the Zacharies Karate Club network. But, if we merge those two groups into two, only one node is incorrectly grouped. Let’s try another community detection algorithm.

7.2.0.2 Via Edge Betweenness

Edge betweenness community structure detection is based on the following assumption; that edges connecting separate groupings have high edge betweenness as all the shortest paths from one module to another must traverse through them. Practically this means that if we gradually remove the edge with the highest edge betweenness score, our network will separate into communities.

r data analysis case study

7.3 Network Simulation

Say you want to model a new network with no data. it’s possible to simulate a network to find out if it is actually interesting, or random. If you are familiar with hypothesis testing, we can view these random networks as our “null models”. We assume that our null model is true until there is enough evidence to suggest that our null model does not describe the real-life network. If our null-model is a good fit, then we have achieved a good representation of our network. If we don’t have a good fit, then there is likely additional structure in the network that is unaccounted for.

Our Question: How can we explain the group structure of our network? Is it random or can we explain it via the degree sequence?

7.3.0.1 Random Network Generation

Erdos-Renyi random networks in R require that we specify a number of nodes \(n\) , and an edge construction probability \(p\) . Essentially, for every pair of nodes, we flip a biased coin with the probability of “heads” being \(p\) . If we get a “heads”, then we draw an edge between that pair of nodes. This process simulates the social connections rather than plotting them from a dataset.

r data analysis case study

Is this Erdos-Renyi random network a good representative model of the Zacharies Karate Club Network? Let’s construct the Erdos-Renyi random network that is most similar to our network.

We can map in parameters in the Erdo-Renyi random graph by specifying the number of nodes and the edge connection probability p. Considering the Zacharies Karate Club Network, we want to use 34 nodes in our graph. If we change the number of nodes, then we lose the ability to compare our network with the theoretical model. We can estimate a probability value for the simulated network using the mean of degr_cent over the length of the nodes - 1 from the ZKC network.

r data analysis case study

Let’s check out the degree distribution for our random graph and the actual ZCC graph.

r data analysis case study

7.3.0.2 Configuration Model

For this kind of random-graph model, we specify the exact degree sequence of all the nodes. We then construct a random graph that has the exact degree sequence as the one given.

r data analysis case study

Is the configuration model random network a good representative model of the Zachary’s Karate Club Network?

Let’s see if the configuration model captures the group structure of the model. We are going to perform a permutation test in which we generate 1000 different configuration models (with the same degree sequence as ZKC), and then estimate how the actual value of the ZKC modularity lines up with the distribution of configuration model modularities.

Now let’s plot a histogram of these values, with a vertical line representing the modularity of ZKC network that we computed earlier. This value is stored in the object ZCCmod .

r data analysis case study

We can see from the above that our computed modularity is extremely improbable. No simulations had a modularity that was as high as the one in ZKC. This tells us that the particular degree sequence of ZKC does not capture the community structure. Put otherwise, the configuration model does a bad job reflecting the community structure captured in the ZKC dataset.

7.3.0.3 Stochastic Block Model

Stochastic Block models are similar to the Erdos-Renyi random network but provide the additional ability to specify additional parameters. The stochastic block model adds a group structure into the random graph model. We can specify the group sizes and the edge construction probability for within group and between group modeling

r data analysis case study

Is the stochastic block model a good representative model of the Zacharies Karate Club Network?

r data analysis case study

7.4 Advanced Case Study

See this link ( https://www.frontiersin.org/articles/10.3389/fpsyg.2018.01742/ ) to access a paper by Jones, Mair, & McNally (2018), all professors at Harvard University in the Department of Psychology who discuss visualizing psychological networks in R.

See this link ( https://www.frontiersin.org/articles/10.3389/fpsyg.2018.01742/full#supplementary-material ) to access all supplementary material, including the relevant datasets needed for the code below.

Read the paper and run the code alongside the narrative to get the most out of this case study. For a brief overview of the paper see this abstract:

“Networks have emerged as a popular method for studying mental disorders. Psychopathology networks consist of aspects (e.g., symptoms) of mental disorders (nodes) and the connections between those aspects (edges). Unfortunately, the visual presentation of networks can occasionally be misleading. For instance, researchers may be tempted to conclude that nodes that appear close together are highly related, and that nodes that are far apart are less related. Yet this is not always the case. In networks plotted with force-directed algorithms, the most popular approach, the spatial arrangement of nodes is not easily interpretable. However, other plotting approaches can render node positioning interpretable. We provide a brief tutorial on several methods including multidimensional scaling, principal components plotting, and eigenmodel networks. We compare the strengths and weaknesses of each method, noting how to properly interpret each type of plotting approach.”

7.5 Datasets for Network Analysis

There is a package called “igraphdata” that contains many network datasets. Additionally, there are several more datasets at “The Colorado Index of Complex Networks (ICON)”. Here is the link: https://icon.colorado.edu/#!/

In this chapter we introduced network analysis concepts and methods. To make sure you understand this material, there is a practice assessment to go along with this chapter at https://jayholster1.shinyapps.io/NetworksinRAssessment/

7.7 References

Bojanowski, M. (2015). intergraph: Coercion routines for network data objects. R package version 2.0-2. http://mbojan.github.io/intergraph

Csardi, G., Nepusz, T. (2006). “The igraph software package for complex network research.” InterJournal , Complex Systems, 1695. <https://igraph.org> .

Paranyushkin, D. (2019). InfraNodus: Generating insight using text network analysis. In The World Wide Web Conference ( WWW ’19 ). Association for Computing Machinery, New York, NY, USA, 3584–3589. https://doi.org/10.1145/3308558.3314123

Payton, J. J., Mair, P., & McNally, R. J. (2018). Visualizing psychological networks: A tutorial in R. Frontiers in Psychology, 9 (1), https://doi.org/10.3389/fpsyg.2018.01742

Tyner, S., Briatte, F., & Hofmann, H. (2017). Network Visualization with ggplot2 , The R Journal 9(1): 27–59. https://briatte.github.io/ggnetwork/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T.L., Miller, E., Bache, S.M., Müller, K., Ooms, J., Robinson, D., Seidel, D.P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4 (43), 1686. https://doi.org/10.21105/joss.01686 .

7.7.1 R Short Course Series

Video lectures of each guidebook chapter can be found at https://osf.io/6jb9t/ . For this chapter, find the follow the folder path Network Analysis in R -> AY 2021-2022 Spring and access the video files, r markdown documents, and other materials for each short course.

7.7.2 Acknowledgements

This guidebook was created with support from the Center for Research Data and Digital Scholarship and the Laboratory for Interdisciplinary Statistical Analaysis at the University of Colorado Boulder, as well as the U.S. Agency for International Development under cooperative agreement #7200AA18CA00022. Individuals who contributed to materials related to this project include Jacob Holster, Eric Vance, Michael Ramsey, Nicholas Varberg, and Nickoal Eichmann-Kalwara.

r data analysis case study

rstudio::conf 2020

R + Tidyverse in Sports

January 30, 2020

There are many ways in which R and the Tidyverse can be used to analyze sports data and the unique considerations that are involved in applying statistical tools to sports problems. See more

Putting the Fun in Functional Data: A tidy pipeline to identify routes in NFL tracking data

Putting the Fun in Functional Data: A tidy pipeline to identify routes in NFL tracking data

Currently in football many hours are spent watching game film to manually label the routes run on passing plays. See more

Professional Case Studies

Professional Case Studies

The path to becoming a world-class, data-driven organization is daunting. See more

Making better spaghetti (plots): Exploring the individuals in longitudinal data with the brolgar pac

Making better spaghetti (plots): Exploring the individuals in longitudinal data with the brolgar pac

There are two main challenges of working with longitudinal (panel) data: 1) Visualising the data, and 2) Understanding the model. See more

Journalism with RStudio, R, and the Tidyverse

Journalism with RStudio, R, and the Tidyverse

The Associated Press data team primarily uses R and the Tidyverse as the main tool for doing data processing and analysis. See more

How Vibrant Emotional Health Connected Siloed Data Sources and Streamlined Reporting Using R

How Vibrant Emotional Health Connected Siloed Data Sources and Streamlined Reporting Using R

Vibrant Emotional Health is the mental health not-for-profit behind the US National Suicide Prevention Lifeline, New York City's NYC Well program, and various other emotional health contact center... See more

How to win an AI Hackathon, without using AI

How to win an AI Hackathon, without using AI

Once “big data” is thrown into the mix, the AI solution is all but certain. But is AI always needed? See more

Building a new data science pipeline for the FT with RStudio Connect

Building a new data science pipeline for the FT with RStudio Connect

We have recently implemented a new Data Science workflow and pipeline, using RStudio Connect and Google Cloud Services. See more

Teach the Tidyverse to beginners

rstudio::conf 2018

Teach the Tidyverse to beginners

March 4, 2018

Storytelling with R

Storytelling with R

Imagine Boston 2030: Using R-Shiny to keep ourselves accountable and empower the public

Imagine Boston 2030: Using R-Shiny to keep ourselves accountable and empower the public

How I Learned to Stop Worrying and Love the Firewall

How I Learned to Stop Worrying and Love the Firewall

Differentiating by data science

Differentiating by data science

Agile data science

Agile data science

Achieving impact with advanced analytics: Breaking down the adoption barrier

Achieving impact with advanced analytics: Breaking down the adoption barrier

A SAS-to-R success story

A SAS-to-R success story

Our biostatistics group has historically utilized SAS for data management and analytics for biomedical research studies, with R only used occasionally for new methods or data visualization. Several years ago and with the encouragement of leadership, we initiated a movement to increase our usage of R significantly. See more

Understanding PCA using Shiny and Stack Overflow data

Understanding PCA using Shiny and Stack Overflow data

February 26, 2018

The unreasonable effectiveness of empathy

The unreasonable effectiveness of empathy

Rapid prototyping data products using Shiny

Rapid prototyping data products using Shiny

Phrasing: Communicating data science through tweets, gifs, and classic misdirection

Phrasing: Communicating data science through tweets, gifs, and classic misdirection

Open-source solutions for medical marijuana

Open-source solutions for medical marijuana

Developing and deploying large scale shiny applications

Developing and deploying large scale shiny applications

Accelerating cancer research with R

Accelerating cancer research with R

r data analysis case study

  • Data Analysis with R
  • Programming in R
  • 1-Introduction to data science and R
  • 2-Basics in R
  • 3-Data structures and basic calculations
  • 4-Operators
  • 5-Data wrangling - 1.Import
  • 6-Data wrangling - 2.Tidy Data
  • Data exploration & visualization
  • 7-Data wrangling - 3.Transformation
  • 8-Intro2Visualization - Part 1
  • 9-Intro2Visualization - Part 2(Adjusting plots)
  • 10-Handling and visualization of categorical data
  • 11-R Markdown for communication
  • Statistical Modelling
  • 12-Intro2Statistical Modelling - Part 1
  • 13-Intro2Statistical Modelling - Part 2
  • 14-Model building and selection
  • 15-Modelling demonstration
  • 16-String manipulation and regular expressions
  • 17-Functions and Iteration 1 (Loops)
  • 18-Iteration 2 (purrr and the map family)
  • slides, keyboard shortcuts:
  • w: widescreen on/off
  • f: fullscreen on/off
  • ctrl +/ ctrl -: zoom in/out
  • p: presenter mode with additional info (NOT in Safari)
  • Quizzes with swirl

Case studies

  • Case study 1
  • Case study 2
  • Case study 1 - solution
  • Case study 2 - solution
  • Lecture 8 - Geom functions
  • Lecture 8 - Code for ggplots in lecture
  • Lecture 14 - Find the model
  • Lecture 15 - Model building demo

Data analysis with R

This is a course taught as part of the curriculum of the Master program MARSYS-MARine EcoSYStem and Fishery Sciences at the Institute of Marine Ecosystem and Fishery Science (IMF) , University of Hamburg, Germany.

The course is designed for 36hours in total (~120min per pecture), excluding the time students spend on their case studies, and held over the duration of one semester with 2 lectures per week. The course can also be used as a short-term (e.g. 1 week) intensive training or in a self-study mode at the Bachelor, Master or PhD level.

Dat Science principles

This course will introduce the principles of data science and how to mine out insights from data to understand complex behaviors, trends, and inferences. It will teach skills in three major areas with a focus on marine topics. However, the course can be utilized by any other scientist as key concepts are the same across disciplines.

Tidyverse packages

Data Analysis with R builds heavily on the tidyverse framework and introduces various of its packages, which provide an R syntax ‘dialect’ to simplify data import, processing and visualization.

Course learning outcomes

At the end of the course students will

  • understand the principles of data science
  • be trained in formulating and investigating research questions within the marine context
  • feel confident working with one of the most common and popular software for data analysis
  • will be familiar with various data types and data structures
  • import data into and export from R
  • subset, manipulate and transform data
  • write own functions and apply iterations such as loops
  • compute descriptive statistics
  • be able to visualize data in various ways, including creating maps
  • understand the principles of statistical modelling and the mathematics behind simple linear regression models
  • apply different linear model families,
  • compare and select models,
  • visualize model results,
  • evaluate model diagnostics using real datasets
  • learn how to work as part of a research team to produce scientific products

Requirements

This course assumes no prior knowledge in computer programming or statistical modelling. Some knowledge in basic statics, however, will be advantageous. For an efficient workflow, please make sure to download the data and install everything before working through the material provided. The course will be tought in the institutes’ computer room using R Studio Server Pro. The server version can be accessed from any location through an internet browser so no further preparation is required. But if you want to work on your own computer using a desktop version look at lecture 1 for installation informations.

Course structure

The course provides 18 lectures (each ~ 120min) covering the topics Programming in R , Data Exploration & Visualization and Statistical Modelling . Each lecture contains throughout interactive quizzes and exercises that should be done by each student individually. Some of the exercises require also a bit of homework. Please note that the interactive quizzes only work in the browser and not in the PDF files.

During the slide show the following single character keyboard shortcuts enable alternate display modes:

o enables overview mode

w toggles widescreen mode

f enables fullscreen mode

h enables code highlight mode

control (Windows) or command (Mac) AND + / - to zoom in or out

p opens a separate window for additional information (does not work in Safari).

Pressing esc exits all of these modes.

swirl course

As part of the R-Lab 2.0 project at the University of Hamburg, all quiz questions in the lectures have been additionally converted into a swirl course. Now, students can answer the quiz questions directly from within R, without all the additional information shown in the slides.

2 Case studies are provided:

  • one on data exploration and visualization using hydrographical data for the Baltic Sea → suitable after lecture 11
  • one on statistical modelling using fish catch data for the Baltic Sea → suitable after lecture 15 or later

These case studies are meant as group exercises (3-4 students) but can easily be split into individual tasks. Each group is expected to work in R Markdown for communicating their work progress and results.

The solution script for case study 2 will be made accessible for a short time after assignments were submitted. If you are not part of the course and interested in the solution script feel free to contact me!

Page built on: 📆 2018-03-10 ‒ 🕢 12:05:32

Introduction to Statistical Thinking

Chapter 16 case studies, 16.1 student learning objective.

This chapter concludes this book. We start with a short review of the topics that were discussed in the second part of the book, the part that dealt with statistical inference. The main part of the chapter involves the statistical analysis of 2 case studies. The tools that will be used for the analysis are those that were discussed in the book. We close this chapter and this book with some concluding remarks. By the end of this chapter, the student should be able to:

Review the concepts and methods for statistical inference that were presented in the second part of the book.

Apply these methods to requirements of the analysis of real data.

Develop a resolve to learn more statistics.

16.2 A Review

The second part of the book dealt with statistical inference; the science of making general statement on an entire population on the basis of data from a sample. The basis for the statements are theoretical models that produce the sampling distribution. Procedures for making the inference are evaluated based on their properties in the context of this sampling distribution. Procedures with desirable properties are applied to the data. One may attach to the output of this application summaries that describe these theoretical properties.

In particular, we dealt with two forms of making inference. One form was estimation and the other was hypothesis testing. The goal in estimation is to determine the value of a parameter in the population. Point estimates or confidence intervals may be used in order to fulfill this goal. The properties of point estimators may be assessed using the mean square error (MSE) and the properties of the confidence interval may be assessed using the confidence level.

The target in hypotheses testing is to decide between two competing hypothesis. These hypotheses are formulated in terms of population parameters. The decision rule is called a statistical test and is constructed with the aid of a test statistic and a rejection region. The default hypothesis among the two, is rejected if the test statistic falls in the rejection region. The major property a test must possess is a bound on the probability of a Type I error, the probability of erroneously rejecting the null hypothesis. This restriction is called the significance level of the test. A test may also be assessed in terms of it’s statistical power, the probability of rightfully rejecting the null hypothesis.

Estimation and testing were applied in the context of single measurements and for the investigation of the relations between a pair of measurements. For single measurements we considered both numeric variables and factors. For numeric variables one may attempt to conduct inference on the expectation and/or the variance. For factors we considered the estimation of the probability of obtaining a level, or, more generally, the probability of the occurrence of an event.

We introduced statistical models that may be used to describe the relations between variables. One of the variables was designated as the response. The other variable, the explanatory variable, is identified as a variable which may affect the distribution of the response. Specifically, we considered numeric variables and factors that have two levels. If the explanatory variable is a factor with two levels then the analysis reduces to the comparison of two sub-populations, each one associated with a level. If the explanatory variable is numeric then a regression model may be applied, either linear or logistic regression, depending on the type of the response.

The foundations of statistical inference are the assumption that we make in the form of statistical models. These models attempt to reflect reality. However, one is advised to apply healthy skepticism when using the models. First, one should be aware what the assumptions are. Then one should ask oneself how reasonable are these assumption in the context of the specific analysis. Finally, one should check as much as one can the validity of the assumptions in light of the information at hand. It is useful to plot the data and compare the plot to the assumptions of the model.

16.3 Case Studies

Let us apply the methods that were introduced throughout the book to two examples of data analysis. Both examples are taken from the case studies of the Rice Virtual Lab in Statistics can be found in their Case Studies section. The analysis of these case studies may involve any of the tools that were described in the second part of the book (and some from the first part). It may be useful to read again Chapters  9 – 15 before reading the case studies.

16.3.1 Physicians’ Reactions to the Size of a Patient

Overweight and obesity is common in many of the developed contrives. In some cultures, obese individuals face discrimination in employment, education, and relationship contexts. The current research, conducted by Mikki Hebl and Jingping Xu 87 , examines physicians’ attitude toward overweight and obese patients in comparison to their attitude toward patients who are not overweight.

The experiment included a total of 122 primary care physicians affiliated with one of three major hospitals in the Texas Medical Center of Houston. These physicians were sent a packet containing a medical chart similar to the one they view upon seeing a patient. This chart portrayed a patient who was displaying symptoms of a migraine headache but was otherwise healthy. Two variables (the gender and the weight of the patient) were manipulated across six different versions of the medical charts. The weight of the patient, described in terms of Body Mass Index (BMI), was average (BMI = 23), overweight (BMI = 30), or obese (BMI = 36). Physicians were randomly assigned to receive one of the six charts, and were asked to look over the chart carefully and complete two medical forms. The first form asked physicians which of 42 tests they would recommend giving to the patient. The second form asked physicians to indicate how much time they believed they would spend with the patient, and to describe the reactions that they would have toward this patient.

In this presentation, only the question on how much time the physicians believed they would spend with the patient is analyzed. Although three patient weight conditions were used in the study (average, overweight, and obese) only the average and overweight conditions will be analyzed. Therefore, there are two levels of patient weight (average and overweight) and one dependent variable (time spent).

The data for the given collection of responses from 72 primary care physicians is stored in the file “ discriminate.csv ” 88 . We start by reading the content of the file into a data frame by the name “ patient ” and presenting the summary of the variables:

Observe that of the 72 “patients”, 38 are overweight and 33 have an average weight. The time spend with the patient, as predicted by physicians, is distributed between 5 minutes and 1 hour, with a average of 27.82 minutes and a median of 30 minutes.

It is a good practice to have a look at the data before doing the analysis. In this examination on should see that the numbers make sense and one should identify special features of the data. Even in this very simple example we may want to have a look at the histogram of the variable “ time ”:

r data analysis case study

A feature in this plot that catches attention is the fact that there is a high concventration of values in the interval between 25 and 30. Together with the fact that the median is equal to 30, one may suspect that, as a matter of fact, a large numeber of the values are actually equal to 30. Indeed, let us produce a table of the response:

Notice that 30 of the 72 physicians marked “ 30 ” as the time they expect to spend with the patient. This is the middle value in the range, and may just be the default value one marks if one just needs to complete a form and do not really place much importance to the question that was asked.

The goal of the analysis is to examine the relation between overweigh and the Doctor’s response. The explanatory variable is a factor with two levels. The response is numeric. A natural tool to use in order to test this hypothesis is the \(t\) -test, which is implemented with the function “ t.test ”.

First we plot the relation between the response and the explanatory variable and then we apply the test:

r data analysis case study

Nothing seems problematic in the box plot. The two distributions, as they are reflected in the box plots, look fairly symmetric.

When we consider the report that produced by the function “ t.test ” we may observe that the \(p\) -value is equal to 0.005774. This \(p\) -value is computed in testing the null hypothesis that the expectation of the response for both types of patients are equal against the two sided alternative. Since the \(p\) -value is less than 0.05 we do reject the null hypothesis.

The estimated value of the difference between the expectation of the response for a patient with BMI=23 and a patient with BMI=30 is \(31.36364 -24.73684 \approx 6.63\) minutes. The confidence interval is (approximately) equal to \([1.99, 11.27]\) . Hence, it looks as if the physicians expect to spend more time with the average weight patients.

After analyzing the effect of the explanatory variable on the expectation of the response one may want to examine the presence, or lack thereof, of such effect on the variance of the response. Towards that end, one may use the function “ var.test ”:

In this test we do not reject the null hypothesis that the two variances of the response are equal since the \(p\) -value is larger than \(0.05\) . The sample variances are almost equal to each other (their ratio is \(1.044316\) ), with a confidence interval for the ration that essentially ranges between 1/2 and 2.

The production of \(p\) -values and confidence intervals is just one aspect in the analysis of data. Another aspect, which typically is much more time consuming and requires experience and healthy skepticism is the examination of the assumptions that are used in order to produce the \(p\) -values and the confidence intervals. A clear violation of the assumptions may warn the statistician that perhaps the computed nominal quantities do not represent the actual statistical properties of the tools that were applied.

In this case, we have noticed the high concentration of the response at the value “ 30 ”. What is the situation when we split the sample between the two levels of the explanatory variable? Let us apply the function “ table ” once more, this time with the explanatory variable included:

Not surprisingly, there is still high concentration at that level “ 30 ”. But one can see that only 2 of the responses of the “ BMI=30 ” group are above that value in comparison to a much more symmetric distribution of responses for the other group.

The simulations of the significance level of the one-sample \(t\) -test for an Exponential response that were conducted in Question  \[ex:Testing.2\] may cast some doubt on how trustworthy are nominal \(p\) -values of the \(t\) -test when the measurements are skewed. The skewness of the response for the group “ BMI=30 ” is a reason to be worry.

We may consider a different test, which is more robust, in order to validate the significance of our findings. For example, we may turn the response into a factor by setting a level for values larger or equal to “ 30 ” and a different level for values less than “ 30 ”. The relation between the new response and the explanatory variable can be examined with the function “ prop.test ”. We first plot and then test:

r data analysis case study

The mosaic plot presents the relation between the explanatory variable and the new factor. The level “ TRUE ” is associated with a value of the predicted time spent with the patient being 30 minutes or more. The level “ FALSE ” is associated with a prediction of less than 30 minutes.

The computed \(p\) -value is equal to \(0.05409\) , that almost reaches the significance level of 5% 89 . Notice that the probabilities that are being estimated by the function are the probabilities of the level “ FALSE ”. Overall, one may see the outcome of this test as supporting evidence for the conclusion of the \(t\) -test. However, the \(p\) -value provided by the \(t\) -test may over emphasize the evidence in the data for a significant difference in the physician attitude towards overweight patients.

16.3.2 Physical Strength and Job Performance

The next case study involves an attempt to develop a measure of physical ability that is easy and quick to administer, does not risk injury, and is related to how well a person performs the actual job. The current example is based on study by Blakely et al.  90 , published in the journal Personnel Psychology.

There are a number of very important jobs that require, in addition to cognitive skills, a significant amount of strength to be able to perform at a high level. Construction worker, electrician and auto mechanic, all require strength in order to carry out critical components of their job. An interesting applied problem is how to select the best candidates from amongst a group of applicants for physically demanding jobs in a safe and a cost effective way.

The data presented in this case study, and may be used for the development of a method for selection among candidates, were collected from 147 individuals working in physically demanding jobs. Two measures of strength were gathered from each participant. These included grip and arm strength. A piece of equipment known as the Jackson Evaluation System (JES) was used to collect the strength data. The JES can be configured to measure the strength of a number of muscle groups. In this study, grip strength and arm strength were measured. The outcomes of these measurements were summarized in two scores of physical strength called “ grip ” and “ arm ”.

Two separate measures of job performance are presented in this case study. First, the supervisors for each of the participants were asked to rate how well their employee(s) perform on the physical aspects of their jobs. This measure is summarizes in the variable “ ratings ”. Second, simulations of physically demanding work tasks were developed. The summary score of these simulations are given in the variable “ sims ”. Higher values of either measures of performance indicates better performance.

The data for the 4 variables and 147 observations is stored in “ job.csv ” 91 . We start by reading the content of the file into a data frame by the name “ job ”, presenting a summary of the variables, and their histograms:

r data analysis case study

All variables are numeric. Examination of the 4 summaries and histograms does not produce interest findings. All variables are, more or less, symmetric with the distribution of the variable “ ratings ” tending perhaps to be more uniform then the other three.

The main analyses of interest are attempts to relate the two measures of physical strength “ grip ” and “ arm ” with the two measures of job performance, “ ratings ” and “ sims ”. A natural tool to consider in this context is a linear regression analysis that relates a measure of physical strength as an explanatory variable to a measure of job performance as a response.

Scatter Plots and Regression Lines

FIGURE 16.1: Scatter Plots and Regression Lines

Let us consider the variable “ sims ” as a response. The first step is to plot a scatter plot of the response and explanatory variable, for both explanatory variables. To the scatter plot we add the line of regression. In order to add the regression line we fit the regression model with the function “ lm ” and then apply the function “ abline ” to the fitted model. The plot for the relation between the response and the variable “ grip ” is produced by the code:

The plot that is produced by this code is presented on the upper-left panel of Figure  16.1 .

The plot for the relation between the response and the variable “ arm ” is produced by this code:

The plot that is produced by the last code is presented on the upper-right panel of Figure  16.1 .

Both plots show similar characteristics. There is an overall linear trend in the relation between the explanatory variable and the response. The value of the response increases with the increase in the value of the explanatory variable (a positive slope). The regression line seems to follow, more or less, the trend that is demonstrated by the scatter plot.

A more detailed analysis of the regression model is possible by the application of the function “ summary ” to the fitted model. First the case where the explanatory variable is “ grip ”:

Examination of the report reviles a clear statistical significance for the effect of the explanatory variable on the distribution of response. The value of R-squared, the ration of the variance of the response explained by the regression is \(0.4094\) . The square root of this quantity, \(\sqrt{0.4094} \approx 0.64\) , is the proportion of the standard deviation of the response that is explained by the explanatory variable. Hence, about 64% of the variability in the response can be attributed to the measure of the strength of the grip.

For the variable “ arm ” we get:

This variable is also statistically significant. The value of R-squared is \(0.4706\) . The proportion of the standard deviation that is explained by the strength of the are is \(\sqrt{0.4706} \approx 0.69\) , which is slightly higher than the proportion explained by the grip.

Overall, the explanatory variables do a fine job in the reduction of the variability of the response “ sims ” and may be used as substitutes of the response in order to select among candidates. A better prediction of the response based on the values of the explanatory variables can be obtained by combining the information in both variables. The production of such combination is not discussed in this book, though it is similar in principle to the methods of linear regression that are presented in Chapter  14 . The produced score 92 takes the form:

\[\mbox{\texttt{score}} = -5.434 + 0.024\cdot \mbox{\texttt{grip}}+ 0.037\cdot \mbox{\texttt{arm}}\;.\] We use this combined score as an explanatory variable. First we form the score and plot the relation between it and the response:

The scatter plot that includes the regression line can be found at the lower-left panel of Figure  16.1 . Indeed, the linear trend is more pronounced for this scatter plot and the regression line a better description of the relation between the response and the explanatory variable. A summary of the regression model produces the report:

Indeed, the score is highly significant. More important, the R-squared coefficient that is associated with the score is \(0.5422\) , which corresponds to a ratio of the standard deviation that is explained by the model of \(\sqrt{0.5422} \approx 0.74\) . Thus, almost 3/4 of the variability is accounted for by the score, so the score is a reasonable mean of guessing what the results of the simulations will be. This guess is based only on the results of the simple tests of strength that is conducted with the JES device.

Before putting the final seal on the results let us examine the assumptions of the statistical model. First, with respect to the two explanatory variables. Does each of them really measure a different property or do they actually measure the same phenomena? In order to examine this question let us look at the scatter plot that describes the relation between the two explanatory variables. This plot is produced using the code:

It is presented in the lower-right panel of Figure  16.1 . Indeed, one may see that the two measurements of strength are not independent of each other but tend to produce an increasing linear trend. Hence, it should not be surprising that the relation of each of them with the response produces essentially the same goodness of fit. The computed score gives a slightly improved fit, but still, it basically reflects either of the original explanatory variables.

In light of this observation, one may want to consider other measures of strength that represents features of the strength not captures by these two variable. Namely, measures that show less joint trend than the two considered.

Another element that should be examined are the probabilistic assumptions that underly the regression model. We described the regression model only in terms of the functional relation between the explanatory variable and the expectation of the response. In the case of linear regression, for example, this relation was given in terms of a linear equation. However, another part of the model corresponds to the distribution of the measurements about the line of regression. The assumption that led to the computation of the reported \(p\) -values is that this distribution is Normal.

A method that can be used in order to investigate the validity of the Normal assumption is to analyze the residuals from the regression line. Recall that these residuals are computed as the difference between the observed value of the response and its estimated expectation, namely the fitted regression line. The residuals can be computed via the application of the function “ residuals ” to the fitted regression model.

Specifically, let us look at the residuals from the regression line that uses the score that is combined from the grip and arm measurements of strength. One may plot a histogram of the residuals:

r data analysis case study

The produced histogram is represented on the upper panel. The histogram portrays a symmetric distribution that my result from Normally distributed observations. A better method to compare the distribution of the residuals to the Normal distribution is to use the Quantile-Quantile plot . This plot can be found on the lower panel. We do not discuss here the method by which this plot is produced 93 . However, we do say that any deviation of the points from a straight line is indication of violation of the assumption of Normality. In the current case, the points seem to be on a single line, which is consistent with the assumptions of the regression model.

The next task should be an analysis of the relations between the explanatory variables and the other response “ ratings ”. In principle one may use the same steps that were presented for the investigation of the relations between the explanatory variables and the response “ sims ”. But of course, the conclusion may differ. We leave this part of the investigation as an exercise to the students.

16.4 Summary

16.4.1 concluding remarks.

The book included a description of some elements of statistics, element that we thought are simple enough to be explained as part of an introductory course to statistics and are the minimum that is required for any person that is involved in academic activities of any field in which the analysis of data is required. Now, as you finish the book, it is as good time as any to say some words regarding the elements of statistics that are missing from this book.

One element is more of the same. The statistical models that were presented are as simple as a model can get. A typical application will required more complex models. Each of these models may require specific methods for estimation and testing. The characteristics of inference, e.g. significance or confidence levels, rely on assumptions that the models are assumed to possess. The user should be familiar with computational tools that can be used for the analysis of these more complex models. Familiarity with the probabilistic assumptions is required in order to be able to interpret the computer output, to diagnose possible divergence from the assumptions and to assess the severity of the possible effect of such divergence on the validity of the findings.

Statistical tools can be used for tasks other than estimation and hypothesis testing. For example, one may use statistics for prediction. In many applications it is important to assess what the values of future observations may be and in what range of values are they likely to occur. Statistical tools such as regression are natural in this context. However, the required task is not testing or estimation the values of parameters, but the prediction of future values of the response.

A different role of statistics in the design stage. We hinted in that direction when we talked about in Chapter  \[ch:Confidence\] about the selection of a sample size in order to assure a confidence interval with a given accuracy. In most applications, the selection of the sample size emerges in the context of hypothesis testing and the criteria for selection is the minimal power of the test, a minimal probability to detect a true finding. Yet, statistical design is much more than the determination of the sample size. Statistics may have a crucial input in the decision of how to collect the data. With an eye on the requirements for the final analysis, an experienced statistician can make sure that data that is collected is indeed appropriate for that final analysis. Too often is the case where researcher steps into the statistician’s office with data that he or she collected and asks, when it is already too late, for help in the analysis of data that cannot provide a satisfactory answer to the research question the researcher tried to address. It may be said, with some exaggeration, that good statisticians are required for the final analysis only in the case where the initial planning was poor.

Last, but not least, is the theoretical mathematical theory of statistics. We tried to introduce as little as possible of the relevant mathematics in this course. However, if one seriously intends to learn and understand statistics then one must become familiar with the relevant mathematical theory. Clearly, deep knowledge in the mathematical theory of probability is required. But apart from that, there is a rich and rapidly growing body of research that deals with the mathematical aspects of data analysis. One cannot be a good statistician unless one becomes familiar with the important aspects of this theory.

I should have started the book with the famous quotation: “Lies, damned lies, and statistics”. Instead, I am using it to end the book. Statistics can be used and can be misused. Learning statistics can give you the tools to tell the difference between the two. My goal in writing the book is achieved if reading it will mark for you the beginning of the process of learning statistics and not the end of the process.

16.4.2 Discussion in the Forum

In the second part of the book we have learned many subjects. Most of these subjects, especially for those that had no previous exposure to statistics, were unfamiliar. In this forum we would like to ask you to share with us the difficulties that you encountered.

What was the topic that was most difficult for you to grasp? In your opinion, what was the source of the difficulty?

When forming your answer to this question we will appreciate if you could elaborate and give details of what the problem was. Pointing to deficiencies in the learning material and confusing explanations will help us improve the presentation for the future editions of this book.

Hebl, M. and Xu, J. (2001). Weighing the care: Physicians’ reactions to the size of a patient. International Journal of Obesity, 25, 1246-1252. ↩

The file can be found on the internet at http://pluto.huji.ac.il/~msby/StatThink/Datasets/discriminate.csv . ↩

One may propose splinting the response into two groups, with one group being associated with values of “ time ” strictly larger than 30 minutes and the other with values less or equal to 30. The resulting \(p\) -value from the expression “ prop.test(table(patient$time>30,patient$weight)) ” is \(0.01276\) . However, the number of subjects in one of the cells of the table is equal only to 2, which is problematic in the context of the Normal approximation that is used by this test. ↩

Blakley, B.A., Qui?ones, M.A., Crawford, M.S., and Jago, I.A. (1994). The validity of isometric strength tests. Personnel Psychology, 47, 247-274. ↩

The file can be found on the internet at http://pluto.huji.ac.il/~msby/StatThink/Datasets/job.csv . ↩

The score is produced by the application of the function “ lm ” to both variables as explanatory variables. The code expression that can be used is “ lm(sims ~ grip + arm, data=job) ”. ↩

Generally speaking, the plot is composed of the empirical percentiles of the residuals, plotted against the theoretical percentiles of the standard Normal distribution. The current plot is produced by the expression “ qqnorm(residuals(sims.score)) ”. ↩

IMAGES

  1. case study data analysis in r

    r data analysis case study

  2. R for Data Analysis in easy steps

    r data analysis case study

  3. case analysis of data

    r data analysis case study

  4. Exploratory Data Analysis in R (introduction)

    r data analysis case study

  5. Introduction to Data Science with R

    r data analysis case study

  6. data analysis of case study research

    r data analysis case study

VIDEO

  1. [R18] Case study 2 data analysis using R Language

  2. Airbnb Data Analytics Case Study and Exercises for Data Science Project

  3. Case Study of Data Science training Videos 1 for Beginners +91 8886552866

  4. One-Year Online Masters Programs in Data Analytics

  5. 2023 PhD Research Methods: Qualitative Research and PhD Journey

  6. 46 extrapolatory data analysis case study car price data sets

COMMENTS

  1. A Data Science Case Study in R

    R packages. For our analysis and the R programming, we will make use of the following R packages: library (dplyr) # Easy data cleaning, and the very convenient pipe operator %>% library (ggplot2) # Beautiful plots library (mgcv) # Package to fit generalized additive models.

  2. RPubs

    Forgot your password? Sign InCancel. RPubs. by RStudio. Sign inRegister. Case Study: Exploratory Data Analysis in R. by Daniel Pinedo. Last updatedover 3 years ago. HideComments(-)ShareHide Toolbars.

  3. Cyclistic Bike Share

    The data is in CSV (comma-separated values) format, and there are a total of 13 columns. Sort and filter the data. For this analysis, I will be using data for the year 2019 and 2020. Determine the credibility of the data. For the purposes of this case study, the datasets are appropriate and will enable me to answer the business questions.

  4. How to Perform Exploratory Data Analysis in R (With Example)

    One of the first steps of any data analysis project is exploratory data analysis. This involves exploring a dataset in three ways: 1. Summarizing a dataset using descriptive statistics. 2. Visualizing a dataset using charts. 3. Identifying missing values. By performing these three actions, you can gain an understanding of how the values in a ...

  5. gabors-data-analysis/da_case_studies

    Data Analysis Case Study codebase for R, Python and Stata. R, Python and Stata code for Data Analysis for Business, Economics, and Policy by Gábor Békés (CEU) and Gábor Kézdi (U. Michigan) Published on 6 May 2021 by Cambridge University Press gabors-data-analysis.com. How to use.

  6. Data Analysis with R Programming Course (Google)

    The R programming language was designed to work with data at all stages of the data analysis process. In this part of the course, you'll examine how R can help you structure, organize, and clean your data using functions and other processes. You'll learn about data frames and how to work with them in R. You'll also revisit the issue of ...

  7. Linear Regression in R: A Case Study

    Step 1: Save the data to a file (excel or CSV file) and read it into R memory for analysis. This step is completed by following the steps below. 1. Save the CSV file locally on desktop. 2. In RStudio, navigate to "Session" -> "Set Working Directory" ->"Choose Directory" -> Select folder where the file was saved in Step 1. 3.

  8. How To Conduct Data Analysis In R: Essential Steps And Techniques

    Embarking on data analysis in R can be a pivotal skill for developers. This article walks you through the key steps of data analysis using R, from importing and manipulating data to statistical analysis and visualization, equipping you with the essential tools and methods needed. ... The article highlights a case study on stratified sampling ...

  9. Case studies

    Case study 3. This case study introduces a clinical trial with several patient populations (marker-positive and marker-negative patients). It demonstrates how the user can define independent samples in a data model and then specify statistical tests in an analysis model based on merging several samples, i.e., merging samples of marker-positive and marker-negative patients to carry out a test ...

  10. Data analysis using R

    Data analysis using R. Data Analysis is a subset of data analytics, it is a process where the objective has to be made clear, collect the relevant data, preprocess the data, perform analysis (understand the data, explore insights), and then visualize it. The last step visualization is important to make people understand what's happening in ...

  11. Chapter 7 Network Analysis

    Chapter 7. Network Analysis. In this chapter, we will cover concepts and procedures related to network analysis in R. "Networks enable the visualization of complex, multidimensional data as well as provide diverse statistical indices for interpreting the resultant graphs" (Jones et al., 2018). Put otherwise, network analysis is a collection ...

  12. case study

    A SAS-to-R success story. Our biostatistics group has historically utilized SAS for data management and analytics for biomedical research studies, with R only used occasionally for new methods or data visualization. Several years ago and with the encouragement of leadership, we initiated a movement to increase our usage of R significantly.

  13. (PDF) Data Analytics in R: A Case Study Based Approach

    Data Analytics in R: A Case Study Based Approach. December 2019. Publisher: Himalaya Publishing House Pvt. Ltd. ISBN: 978-93-5367-791-6. Authors: Rajani S Kamath. Chh. Shahu Institute of Business ...

  14. PDF Open Case Studies: Statistics and Data Science Education through Real

    question and to create an illustrative data analysis - and the domain expertise needed. As a result, case studies based on realistic challenges, not toy examples, are scarce. To address this, we developed the Open Case Studies (opencasestudies.org) project, which offers a new statistical and data science education case study model.

  15. Data analysis with R

    Data analysis with R. This is a course taught as part of the curriculum of the Master program MARSYS-MARine EcoSYStem and Fishery Sciences at the Institute of Marine Ecosystem and Fishery Science (IMF), University of Hamburg, Germany. The course is designed for 36hours in total (~120min per pecture), excluding the time students spend on their ...

  16. Bellabeat Case Study with R

    Explore and run machine learning code with Kaggle Notebooks | Using data from FitBit Fitness Tracker Data. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion.

  17. Programming in R: A Case Study

    R is built by statisticians and leans heavily into statistical models and specialized analytics. Data scientists use R for deep statistical analysis, supported by just a few lines of code and ...

  18. A Case Study on R: a powerful OSS and data analysis platform

    R is a programming language that's especially designed for data analysis and data visualization. In some cases, it's more convenient to use R than C++ or Java, making R a key data analysis tool.

  19. Chapter 16 Case Studies

    16.3 Case Studies. Let us apply the methods that were introduced throughout the book to two examples of data analysis. Both examples are taken from the case studies of the Rice Virtual Lab in Statistics can be found in their Case Studies section. The analysis of these case studies may involve any of the tools that were described in the second part of the book (and some from the first part).

  20. Data Analysis Case Study : r/dataanalysis

    In the first one, they asked me to do sales forecasting for another company, which no body is doing using kaggle datasets. Not sure, what else could be Data Analysis case study. StrataScratch has a bunch of data projects taken from data science and data analyst interiews. Not all of the projects are about creating an ML model like with Kaggle ...

  21. Analysis of Logistics Curriculum and Recruitment Requirements Based on

    The logistics industry is an essential industry for the development of national transportation or distribution. The total amount of social logistics for the first half of 2022 is 160 trillion Yuan, showing a year-on-year growth of 3.1% based on comparable prices according to the data from the China Federation of Logistics and Purchasing.

  22. Policing during a pandemic: A case study analysis of body-worn camera

    This case study uses BWC footage derived from a police agency in Washington state. Methods. Using a population of 136 interactions involving suspected violations of COVID-19 ordinance violations between March 2020 and November 2020, this study uses convergent holistic triangulation within a mixed-method research design to extract data for ...