Date: 9 December 2022 @ 13:00 - 16:00

Timezone: Brussels

Duration: 3 hours

Loading map...

Life science researchers often need to extract, manipulate and integrate data and/or metadata from different sources, such as repositories, databases or flat files. Much research time is spent on trivial and not-so-trivial details of data wrangling: to reformat data structures; clean up errors; remove duplicate data; or map and integrate dataset fields. Software for data wrangling and analysis, such as Pandas, R or Frictionless, is useful, but researchers still regularly end up with hard-to-reuse scripts, often with manual steps. uniFAIR is a new Python library with a systematic and scalable approach to research data wrangling. With uniFAIR, researchers can import (meta)data in almost any shape or form: nested JSON; tabular (relational) data; binary streams; or other data structures. Data is continuously parsed and reshaped through a step-by-step process according to a series of data model transformations. uniFAIR provides a catalogue of generic task and subflow templates that the researcher can refine and apply to carry out the transformations needed to wrangle data into the required shape. For large datasets, uniFAIR allows local test jobs on sample-sized data to be seamlessly scaled up to the full datasets and offloaded to external compute resources. Persistent access to the state of the data is available at every step. This workshop will introduce you to the technical and conceptual background needed to make use of uniFAIR, including the new type hints in Python. Participants will follow hands-on tutorials that are based on a series of use cases from genomics, proteomics, and machine learning.

Venue: Ole-Johan Dahl's House, 23B Gaustadalléen

City: Oslo

Region: Oslo kommune

Country: Norway

Postcode: 0373

Prerequisites:

The participant should have at least an intermediate level of experience with Python programming. Experience with type hints in Python is useful, but not required.

Learning objectives:

  • Use type hints in Python in general and to define data models in uniFAIR/Pydantic
  • Understand the ideas behind the slogan "parse, don't validate"
  • Know the architecture of uniFAIR and its main classes, and have an overview of the different modules and their usage
  • Define, refine, apply and revise tasks and flows in uniFAIR
  • Import data from external REST APIs and flat files
  • Develop data transformation flows to solve a selection of use cases
  • Inspect data after each transformation step. Make informed choices on how to configure the next tasks.
  • Transform nested JSON output into normalized tables (without duplicate data)
  • Map (meta)data fields from the input data model to the user-defined output model

These outcomes will be demonstrated and not hands-on due to time constraints:

  • Scale up the data import from a representative sample to a large dataset and deploy the flow on external compute resources (NIRD service platform)
  • Orchestrate flow runs using the Prefect web-based GUI and inspect data output from external runs
  • Get started with contributing to the Open Source catalogue of uniFAIR modules

Organizer: The workshop is provided by ELIXIR Oslo as part of an extended event organised by the Student Committee of the Centre for Bioinformatics at the University of Oslo in collaboration with the ISCB Regional Student group in Norway

Host institutions: University of Oslo

Target audience: PhD, Postdoctoral Fellows, Researchers, Engineers

Capacity: 20

Event types:

  • Workshops and courses

Cost basis: Free to all

Scientific topics: Data submission, annotation, and curation, Data identity and mapping, Data quality management, Data governance, Workflows

Operations: Data handling

External resources:

Activity log