Generating Artificial Yeast DNA Sequences using a DNA LLM

e-learning

Generating Artificial Yeast DNA Sequences using a DNA LLM

View material

Abstract

Generating synthetic DNA sequences using pre-trained language models bridges the fields of synthetic biology and artificial intelligence, enabling the creation of novel DNA sequences that closely mimic natural genomes. By leveraging the power of advanced language models, we can generate biologically relevant sequences that have the potential to revolutionize genetic engineering, drug discovery, and our understanding of genomic function.

About This Material

This is a Hands-on Tutorial from the GTN which is usable either for individual self-study, or as a teaching material in a classroom.

Questions this will address

How do you set up a computational environment for generating synthetic DNA sequences using pre-trained language models?
What role does the temperature parameter play in controlling the variability of generated DNA sequences?
How can you compare generated synthetic DNA sequences with real genomic sequences using k-mer counts and PCA?
What is the significance of performing BLAST searches on generated DNA sequences, and how do you interpret the results?
How can you detect open reading frames (ORFs) in generated DNA sequences and translate them into amino acid sequences?

Learning Objectives

Describe the process of generating synthetic DNA sequences using pre-trained language models and explain the significance of temperature settings in controlling sequence variability.
Set up a computational environment (e.g., Google Colab) and configure a pre-trained language model to generate synthetic DNA sequences, ensuring all necessary libraries are installed and configured.
Use k-mer counts and Principal Component Analysis (PCA) to compare generated synthetic DNA sequences with real genomic sequences, identifying similarities and differences.
Perform BLAST searches to assess the novelty of generated DNA sequences and interpret the results to determine the biological relevance and uniqueness of the synthetic sequences.
Develop a pipeline to detect open reading frames (ORFs) within generated DNA sequences and translate them into amino acid sequences, demonstrating the potential for creating novel synthetic genes.

Licence: Creative Commons Attribution 4.0 International

Keywords: Large Language Model, Statistics and machine learning, ai-ml, elixir, jupyter-notebook

Target audience: Students

Resource type: e-learning

Version: 4

Status: Active

Prerequisites:

Deep Learning (without Generative Artificial Intelligence) using Python
Fine-tuning a LLM for DNA Sequence Classification
Foundational Aspects of Machine Learning using Python
Introduction to Python
Neural networks using Python
Predicting Mutation Impact with Zero-shot Learning using a pretrained DNA LLM
Pretraining a Large Language Model (LLM) from Scratch on DNA Sequences
Python - Warm-up for statistics and machine learning

Learning objectives:

Describe the process of generating synthetic DNA sequences using pre-trained language models and explain the significance of temperature settings in controlling sequence variability.
Set up a computational environment (e.g., Google Colab) and configure a pre-trained language model to generate synthetic DNA sequences, ensuring all necessary libraries are installed and configured.
Use k-mer counts and Principal Component Analysis (PCA) to compare generated synthetic DNA sequences with real genomic sequences, identifying similarities and differences.
Perform BLAST searches to assess the novelty of generated DNA sequences and interpret the results to determine the biological relevance and uniqueness of the synthetic sequences.
Develop a pipeline to detect open reading frames (ORFs) within generated DNA sequences and translate them into amino acid sequences, demonstrating the potential for creating novel synthetic genes.

Date modified: 2025-05-22

Date published: 2025-04-17

Authors: Bérénice Batut, Raphael Mourad

Contributors: Anup Kumar, Björn Grüning, Bérénice Batut, Wandrille Duchemin, olisand

Scientific topics: Statistics and probability

External resources: