Learning Pathway Artificial Intelligence and Machine Learning in Life Sciences using Python
Date: No date given
Artificial intelligence (AI) has permeated our lives, transforming how we live and work. Over the past few years, a rapid and disruptive acceleration of progress in AI has occurred, driven by significant advances in widespread data availability, computing power and machine learning. Remarkable strides were made in particular in the development of foundation models - AI models trained on extensive volumes of unlabelled data. Moreover, given the large amounts of omics data that are being generated and made accessible to researchers due to the drop in the cost of high-throughput technologies, analysing these complex high-volume data is not trivial, and the use of classical statistics can not explore their full potential. As such, Machine Learning (ML) and Artificial Intelligence (AI) have been recognized as key opportunity areas for ELIXIR, as evidenced by a number of ongoing activities and efforts throughout the community. However, beyond the technological advances, it is equally important that the individual researchers acquire the necessary knowledge and skills to fully take advantage of Machine Learning. Being aware of the challenges, opportunities and constraints that ML applications entail, is a critical aspect in ensuring high quality research in life sciences
Keywords: ai, elixir, ml
Learning objectives:
- Compare the embeddings of wild-type and mutated sequences to quantify the impact of mutations using L2 distance as a metric.
- Concept of RNNs
- Concept of attention
- Concept of backprop and epochs
- Concept of filters
- Concept of pooling layers
- Configure and implement data collation to organize tokenized data into batches for efficient training.
- Define and configure hyperparameters for pretraining a model, such as learning rate and batch size.
- Define and configure training parameters to optimize the model's performance on the classification task.
- Describe the process of generating synthetic DNA sequences using pre-trained language models and explain the significance of temperature settings in controlling sequence variability.
- Develop a complete workflow for training a language model on DNA sequences, from data preparation to model evaluation, and apply it to real-world bioinformatics tasks.
- Develop a comprehensive plan for documenting and sharing AI model configurations, datasets, and evaluation results to enhance transparency and reproducibility in their research.
- Develop a pipeline to detect open reading frames (ORFs) within generated DNA sequences and translate them into amino acid sequences, demonstrating the potential for creating novel synthetic genes.
- Develop a script to automate the process of predicting mutation impacts using zero-shot learning, enabling researchers to apply this method to their own datasets efficiently.
- Evaluate the fine-tuned model's accuracy and robustness in distinguishing between different classes of DNA sequences.
- Explain the concept of zero-shot learning and its application in predicting the impact of DNA mutations using pre-trained large language models (LLMs).
- Explain the importance of data provenance and dataset splits in ensuring the integrity and reproducibility of AI research.
- Explain the role of a tokenizer in converting DNA sequences into numerical tokens for model processing.
- Forward step
- How model parameters are learned
- Identify and load a pre-trained language model (LLM) suitable for DNA sequence analysis.
- Implementation of RNN (code)
- Implementation of attention mechanism (code)
- Implementation of fine-tuning (code)
- Initialising a model with conv layers (code)
- Initializing model with a single layer (code)
- Initializing model with multiple layers (code)
- Input data representation
- Interpret the results of the L2 distance calculations to determine the significance of mutation effects and discuss potential implications in genomics research.
- Learn the fundamentals of programming in Python
- Load a pre-trained model and modify its architecture to include a classification layer.
- Loss function
- Model as equation
- Monitor and evaluate the model's performance during training to ensure effective learning.
- Perform BLAST searches to assess the novelty of generated DNA sequences and interpret the results to determine the biological relevance and uniqueness of the synthetic sequences.
- Predictions and save+load models
- Prepare and preprocess labeled DNA sequences for fine-tuning.
- Prepare and tokenize DNA sequence datasets for model training and evaluation.
- Set up a computational environment (e.g., Google Colab) and configure a pre-trained language model to generate synthetic DNA sequences, ensuring all necessary libraries are installed and configured.
- Training (code)
- Training steps (code)
- Use k-mer counts and Principal Component Analysis (PCA) to compare generated synthetic DNA sequences with real genomic sequences, identifying similarities and differences.
- Use the trained model to generate embeddings for DNA sequences and interpret these embeddings for downstream bioinformatics applications.
- Utilize a pre-trained DNA LLM from Hugging Face to compute embeddings for wild-type and mutated DNA sequences.
- cross validation and a test set
- finetuning LLM
- general sklearn syntax intro
- metrics and imbalance
- overfit/underfit
- pretraining LLM for DNA
- the need for regularization
- to do
- zeroshot prediction for DNA variants and synthetic DNA sequence generation.
Event types:
- Workshops and courses
Activity log