A Glimpse into R and the Data Scientist’s Toolbox

I recently completed Coursera’s “The Data Scientist’s Toolbox” course presented by Johns Hopkins University. This 4-week course offers a broad overview of what data science is and how to set up your “toolbox” for R programming and analysis of data. Below are my takeaways that have inspired me to continue to learn more about data science and R:

  • Data science begins with formulating the right questions and finding the right data set before bringing in math/statistics and hacking (programming) knowledge. This is part of the experimental design process.
  • A data science is broadly defined as someone:

    “who combines the skills of software programmer, statistician and storyteller slash artist to extract the nuggets of gold hidden under mountains of data”

  • RStudio is a go-to graphical interface (application) to start developing R projects right away.
  • RStudio plays well with Github.
  • R’s strength is statistical computing.
  • R has a markdown package! It can “knit” your project together into HTML, PDF, or Word document.
  • There are 6 general categories of data analyses:
    1. descriptive: summarizing a data set (U.S. census)
    2. exploratory: exploring data to find relationships (% of women in specific work sectors)
    3. inferential: generalizing from a small sample to reflect on a larger group (air pollution in small area infers how all of US residents are impacted by pollution anywhere)
    4. predictive: using historical data to predict what happens next (elections)
    5. causal: exploring cause & effect of variables upon each other (trials for drugs)
    6. mechanistic: measuring exact variable differences (material science experiments)
  • Experimental design begins with choosing variables (independent and dependent) in order to formulate an hypothesis (expected outcome) about which variables will be affected or changed.
  • An independent variable (factor) is often the X-axis when plotted.
  • The dependent variable is often the Y-axis when plotted.
  • Big data is defined by volume (more data), velocity (data is being generated quickly), and variety (data is available in several formats).

Important note: I did not pay to take this course. I audited it for it’s core content, therefore, was unable to take any of the module quizzes or submit a course project. This is a good course to use as a way to explore if data science is an area you’d like to get to know and explore.

Statistics was a weakness of mine in undergraduate coursework, but asking questions, doing math, and writing code don’t scare me. So! Next up in my exploration of data science: R Programming on Coursera.

Advertisement

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s