Information technology advances have made it possible to collect huge amounts of data in every walk of our life and beyond. These vast amounts of data have enabled scientists, social scientists, government agencies, and companies to ask increasingly complex questions aimed at understanding the physical and human world, informing pubic policies, and improving productivity.
However, having data alone is not enough. To be able to solve a domain problem or answer a domain question, it takes a principled statistical investigation process. This process includes problem formulation, data collection (ideally through careful experimental design), data cleaning, data visualization, algorithm/model development, validation, and post-hoc analysis. It rests on suitable computing platforms, appropriate choices of scalable algorithms/models, and careful consideration of domain knowledge and information from the data. With both computation and narratives (domain knowledge) forming its foundation, the statistical investigation process is one of rigorous evidence-seeking to make trustworthy data-driven conclusions that are useful for the domain problem and accessible by the domain experts.
Often, the most impactful contributions are made when domain experts (scientists, for example) and statisticians work together to brainstorm and ask questions. These domain experts are not only key to formalizing the ideas, but they also are integral in generating the data. Engaging with the individuals who collected the data in the first place allows the statistician to learn about the context in which the data lives, and subsequently, to conduct an effective analysis capable of answering the question being asked.
This course will demonstrate what it is like to be an applied statistician or data scientist in today’s data-rich world. We emphasize the goal of answering questions outside of statistics using data and domain knowledge through working with domain experts. We illustrate through lectures, class discussions, data labs, and homework assignments, the many steps involved in the iterative process of information extraction or of a statistical investigation. Specifically, students will learn together and critically understand the technical topics of EDA (exploratary data analysis), prediction algorithms (e.g. Least Squares, random forests), identification of sources of randomness in data, probabilistic models (e.g. linear regression), inference, and interpretation. We use the concept of three realms to separate current data, algorithms/models and future data and discuss when and how to connect them. The PCS framework (workflow and documentation) based on the three principles of data science - predictability, stability and computability (PCS) - will be applied as an overarching theme.
The lectures (and labs) will be based on real-data problems, and students will learn useful statistical concepts and methods in the contexts of these problems. The aim is to illustrate how judgement and common-sense are crucial to the statistical investigation process. We introduce the technical topics through a first-principles approach so that students gain the skills necessary to develop new techniques to solve problems in unfamiliar situations in the future.
The essential elements of applied statistics are captured in Bin’s piece entitled “Data Wisdom”. Students are asked to read the piece after the first lecture.