e-learning
Fine-tuning a LLM for DNA Sequence Classification
Abstract
After preparing, training, and utilizing a language model for DNA sequences, we can now fine-tune a pre-trained Large Language Model (LLM) for specific DNA sequence classification tasks. Here, we will use a pre-trained model from Hugging Face, specifically the Mistral-DNA-v1-17M-hg38, and adapt it to classify DNA sequences based on their biological functions. Our objective is to classify sequences according to whether they bind to transcription factors.
About This Material
This is a Hands-on Tutorial from the GTN which is usable either for individual self-study, or as a teaching material in a classroom.
Questions this will address
- How to classify a DNA sequence depending on if it binds a protein or not (transcription factor)?
Learning Objectives
- Load a pre-trained model and modify its architecture to include a classification layer.
- Prepare and preprocess labeled DNA sequences for fine-tuning.
- Define and configure training parameters to optimize the model's performance on the classification task.
- Evaluate the fine-tuned model's accuracy and robustness in distinguishing between different classes of DNA sequences.
Licence: Creative Commons Attribution 4.0 International
Keywords: AI & ML, ELIXIR, Large Language Model, Statistics and machine learning, jupyter-notebook, work-in-progress
Target audience: Students
Resource type: e-learning
Version: 2
Status: Draft
Prerequisites:
- Deep Learning (without Generative Artificial Intelligence) using Python
- Foundational Aspects of Machine Learning using Python
- Introduction to Python
- Neural networks using Python
- Pretraining a Large Language Model (LLM) from Scratch on DNA Sequences
- Python - Warm-up for statistics and machine learning
Learning objectives:
- Load a pre-trained model and modify its architecture to include a classification layer.
- Prepare and preprocess labeled DNA sequences for fine-tuning.
- Define and configure training parameters to optimize the model's performance on the classification task.
- Evaluate the fine-tuned model's accuracy and robustness in distinguishing between different classes of DNA sequences.
Date modified: 2025-04-25
Date published: 2025-04-17
Contributors: Anup Kumar, Björn Grüning, Bérénice Batut, Wandrille Duchemin, olisand
Scientific topics: Statistics and probability
Activity log