NGCM Summer Academy 2016: Pandas

As part of the second annual NGCM Summer Academy, students and academics received a two day training course on the Pandas data analysis library for the Python programming language by Chris Fonnesbeck from the Department of Biostatistics at the Vanderbilt University and Skipper Seabold from Civis Analytics.

Pandas is an open source library of Python software, which improves data analysis by making it faster and more straightforward. It provides a higher-performance, easier-to-use interface for data structures and enhanced tools for data analysis.

The first day of the workshop commenced with the presentation of Introduction to NumPy and Pandas, led by Skipper Seabold. His presentation began with a brief explanation about the basic functionalities of the Pandas, followed by an exposition of Array’s performance, methods and functions. Subsequently, Skipper introduced a fast and efficient DataFrame object, which provides improvements in data manipulation, higher performance when merging and joining data sets and tools for reading and writing data in a wider range of formats.

Simultaneously, Chris Fonnesbeck was conducting a presentation on Integrated indexing and Hierarchical axis indexing, which is an intuitive way of working with high-dimensional data in a lower-dimensional data structure. Afterwards, Chris completed the first day chairing the workshop on data alignment and integrated handling of missing data; he explained thoroughly how to make a flexible reshaping and pivoting of data sets and aggregating or transforming data with groupby method.

During the second day, the lecturers lead an introductory workshop on handling data with high-level plotting using Pandas and Seaborn, for making statistical graphs more attractive and informative in Python. Later, a parallel computing with Dask library was shown, focusing on task scheduling and task graphs, providing different performances guarantees and operating in different contexts.

Additionally, participants were introduced to methods for statistical data modeling, including fitting statistical models using linear and non-linear models and bootstrapping methods. Finally, Dr. Fonnesbeck provided a robust set of machine learning algorithms using the scikit-learn library for Python.

The demonstration sessions were culminated with set of exercises in each topic covered during the 2-days workshop.

All the material for the workshop can be downloaded from Github by following this link.

Posted by Alejandra Vergara