View event

Date: 8 January 2024 @ 13:00 - 16:00

Timezone: Brussels

Duration: 3 hours

Loading map...

Researchers often spend a significant amount of time on data-wrangling tasks, such as reformatting, cleaning, and integrating data from different sources. Despite the availability of software tools, they often end up with difficult-to-reuse workflows that require manual steps. Omnipy is a new Python library that offers a systematic and scalable approach to research data and metadata wrangling. It allows researchers to import data in various formats and continuously reshape it through typed transformations. For large datasets, Omnipy seamlessly scales up data flows for deployment on external compute resources, with the user in full control of the orchestration.

This workshop will build on the half-day workshop "Using Omnipy for data wrangling and metadata mapping (beginner level)". In this second workshop, participants will learn how to develop various types of data flows in Omnipy, including integration with web services. They will make use of the powerful industry-developed Prefect orchestration engine to scale up the game and deploy high-throughput ETL flows using external compute resources.

The workshop is divided into three parts:

The first part will introduce the slogan "parse, don't validate" and show how these concepts are implemented in Omnipy. On this background, we will introduce the three types of data flows supported by Omnipy: linear, DAG, and function flows. We will also, through hands-on examples, show how to make use of various job modifiers to power up and customise predefined tasks and flows to construct more complex data flows.
The second part will focus on integrating data flows with web services through REST APIs. We will mainly focus on extracting data from data sources, but will also touch upon loading results onto data sinks. Hands-on examples will introduce tasks and flows that allow flattening of JSON data into relational tabular form for mapping, and then restructuring the results back to JSON.
The last part will introduce Omnipy's integration with S3-based cloud storage and the Prefect ETL orchestration library. As a hands-on exercise, the participant will scale up the data flow developed in the second part of the workshop by deploying it on an external compute infrastructure, potentially the Kubernetes-based NIRD Toolkit from SIGMA2 (if Prefect-integration in NIRD is finalised in time for the workshop).

Venue: Georg Sverdrups hus, 39 Moltke Moes vei

City: Oslo

Region: Oslo kommune

Country: Norway

Postcode: 0851

Prerequisites:

The participant should have some experience with Python programming and scripting. We will also assume a basic understanding of the JSON format and how to make use of REST APIs. It is also preferable if the participants have experience with using an Integrated Development Environment (IDE) and the command line, but this is not a prerequisite. Note that we will also assume that the participant has attended the beginner-level workshop "Using Omnipy for Data Wrangling and Metadata Mapping" that we are holding before lunch.

Learning objectives:

Put the concepts behind the slogan "parse, don't validate" into practice
Define the three fundamental flow types in Omnipy
Reuse and repurpose existing tasks and flows by applying job modifiers
Export data from external REST APIs
Transform nested JSON output into normalised tables
Load results to external services
Scale up a data flow by deploying it on external compute resources
Orchestrate flow runs using the Prefect web-based GUI and inspect data output from external runs.

Organizer: The workshop is provided by ELIXIR Oslo as part of an extended event organised by Digital Scholarship Center (DSC), USIT, Carpentry@UiO, Coderefinery, dScience, Simula, Data Managers Network at UiO, ELIXIR Oslo, Norway's Reproducibility Network (norrn) and University of Oslo Library

Host institutions: University of Oslo

Target audience: PhD, Postdoctoral Fellows, Researchers, Engineers

Capacity: 20

Event types: