View event

Date: No date given

Artificial intelligence (AI) has permeated our lives, transforming how we live and work. Over the past few years, a rapid and disruptive acceleration of progress in AI has occurred, driven by significant advances in widespread data availability, computing power and machine learning. Remarkable strides were made in particular in the development of foundation models - AI models trained on extensive volumes of unlabelled data. Moreover, given the large amounts of omics data that are being generated and made accessible to researchers due to the drop in the cost of high-throughput technologies, analysing these complex high-volume data is not trivial, and the use of classical statistics can not explore their full potential. As such, Machine Learning (ML) and Artificial Intelligence (AI) have been recognized as key opportunity areas for ELIXIR, as evidenced by a number of ongoing activities and efforts throughout the community. However, beyond the technological advances, it is equally important that the individual researchers acquire the necessary knowledge and skills to fully take advantage of Machine Learning. Being aware of the challenges, opportunities and constraints that ML applications entail, is a critical aspect in ensuring high quality research in life sciences

Keywords: ai, elixir, ml

Learning objectives:

Analyze the need for regularization techniques and justify their importance in preventing overfitting and improving model generalization.
Compare different evaluation metrics and select appropriate metrics for imbalanced datasets, ensuring accurate and meaningful model assessment.
Compare the embeddings of wild-type and mutated sequences to quantify the impact of mutations using L2 distance as a metric.
Concept of RNNs
Concept of attention
Concept of backprop and epochs
Concept of filters
Concept of pooling layers
Configure and implement data collation to organize tokenized data into batches for efficient training.
Define and configure hyperparameters for pretraining a model, such as learning rate and batch size.
Define and configure training parameters to optimize the model's performance on the classification task.
Describe the process of generating synthetic DNA sequences using pre-trained language models and explain the significance of temperature settings in controlling sequence variability.
Develop a complete workflow for training a language model on DNA sequences, from data preparation to model evaluation, and apply it to real-world bioinformatics tasks.
Develop a comprehensive plan for documenting and sharing AI model configurations, datasets, and evaluation results to enhance transparency and reproducibility in their research.
Develop a pipeline to detect open reading frames (ORFs) within generated DNA sequences and translate them into amino acid sequences, demonstrating the potential for creating novel synthetic genes.
Develop a script to automate the process of predicting mutation impacts using zero-shot learning, enabling researchers to apply this method to their own datasets efficiently.
Evaluate the effectiveness of cross-validation and test sets in assessing model performance and implement these techniques using scikit-learn.
Evaluate the fine-tuned model's accuracy and robustness in distinguishing between different classes of DNA sequences.
Explain the concept of zero-shot learning and its application in predicting the impact of DNA mutations using pre-trained large language models (LLMs).
Explain the importance of data provenance and dataset splits in ensuring the integrity and reproducibility of AI research.
Explain the role of a tokenizer in converting DNA sequences into numerical tokens for model processing.
Forward step
How model parameters are learned
Identify and explain the concepts of overfitting and underfitting in Machine Learning models, and discuss their implications on model performance.
Identify and load a pre-trained language model (LLM) suitable for DNA sequence analysis.
Implementation of RNN (code)
Implementation of attention mechanism (code)
Implementation of fine-tuning (code)
Initialising a model with conv layers (code)
Initializing model with a single layer (code)
Initializing model with multiple layers (code)
Input data representation
Interpret the results of the L2 distance calculations to determine the significance of mutation effects and discuss potential implications in genomics research.
Learn the fundamentals of programming in Python
Load a pre-trained model and modify its architecture to include a classification layer.
Loss function
Model as equation
Monitor and evaluate the model's performance during training to ensure effective learning.
Perform BLAST searches to assess the novelty of generated DNA sequences and interpret the results to determine the biological relevance and uniqueness of the synthetic sequences.
Predictions and save+load models
Prepare and preprocess labeled DNA sequences for fine-tuning.
Prepare and tokenize DNA sequence datasets for model training and evaluation.
Set up a computational environment (e.g., Google Colab) and configure a pre-trained language model to generate synthetic DNA sequences, ensuring all necessary libraries are installed and configured.
Training (code)
Training steps (code)
Understand and apply the general syntax and functions of the scikit-learn library to implement basic Machine Learning models in Python.
Use k-mer counts and Principal Component Analysis (PCA) to compare generated synthetic DNA sequences with real genomic sequences, identifying similarities and differences.
Use the trained model to generate embeddings for DNA sequences and interpret these embeddings for downstream bioinformatics applications.
Utilize a pre-trained DNA LLM from Hugging Face to compute embeddings for wild-type and mutated DNA sequences.
finetuning LLM
pretraining LLM for DNA
to do
zeroshot prediction for DNA variants and synthetic DNA sequence generation.

Event types:

Workshops and courses

Activity log

Content provider

Learning Pathway Artificial Intelligence and Machine Learning in Life Sciences using Python